Checking PDF links

2020/05/17

Sometime last year, Tayda changed their e-shop and many of their existing URLs stopped working. I won’t make any comment on that, suffice to say that links breaking was specifically one of the things I tried hard to avoid when building the new frequency central website.

Frequency Central links to Tayda in their build documents to specify what components a customer should buy. It’s a problem for us if those links break, because customers won’t know what to buy and might buy and incorrect component.

Fortunately, most links remained working, and only a few were broken. I set about trying to find the broken links.

I found pdf-link-checker, which can be installed with pip install pdf-link-checker. This checks a single PDF file and makes a HTTP request to all of the links in the file. Perfect!

Then we just write a very simple bash script to process all of our pdf files.

for f in *.pdf; do
    echo "Processing $f file..";
    python2 /usr/bin/pdf-link-checker $f
done

…which we can run to find all of the dead links.

...
Processing Raging-Bull-Build-Doc.pdf file..
Processing Routemaster-Build-Document.pdf file..
ERROR: URL http://www.taydaelectronics.com/10-x-resistor-220-ohm-1-4w-1-metal-film-pkg-of-10.html failed. Reason: HTTP Error 404: Not Found
Processing Seismograf-Build-Doc.pdf file..
Processing Stasis-Leak-Build-Doc.pdf file..
...

Thus automating the detection part of the process, making it much simpler to correct all of the hyperlinks.

(I used masterpdfeditor4 to edit the PDF links.)

P.S. I'm late to the party, but I recently got a twitter account that you can follow here.