Of lynx and curl
,zsh
, and some aspects of this article may be zsh
specific, particularly the substitution trick. bash
has similar ways to achieve these goals, but I won’t be going into anything bash-specific here. At work, I was recently tasked with archiving several thousand records from a soon-to-be-mercifully-destroyed Lotus Notes database. Why they didn’t simply ask the DBA to do this is beyond me (just kidding, it almost certainly has to do with my time being less valuable, results be damned). No mind, however, as the puzzle was a welcome one, as was the opportunity to exercise my Unix (well, cygwin in this case) chops a bit. The exercise became a simple one once I realized the database had a web server available to me, and that copies of the individual record web views would suffice. A simple pairing of lynx
and curl
easily got me what I needed, and I realized that I use these two in tandem quite often. Here’s the breakdown:
There are two basic steps to this process: use lynx
to generate a list of links, and use curl
to download them. There are other means of doing this, particularly when multiple depths need to be spidered. I like the control and safety afforded to me by this two-step process, however, so for situations where it works, it tends to be my go-to. To start, lynx --dump 'http://brhfl.com'
will print out a clean, human-readable version of my homepage, with a list of all the links at the bottom, formatted like
1. http://brhfl.com/#content
2. http://brhfl.com/
3. http://brhfl.com/./about/
4. http://brhfl.com/./categories/
5. http://brhfl.com/./post/
…and so on (note to self: those ./ URLs function fine, and web browsers seem to transparently ignore them, but… maybe fix that?). For our purposes, we don’t want the formatted page, nor do we want the reference numbers. awk
helps us here: lynx --dump 'http://brhfl.com' | awk '/http/{print $2}'
looks for lines containing ‘http’, and only prints the second element in the line (default field separator being a space).
http://brhfl.com/#content
http://brhfl.com/
http://brhfl.com/./about/
http://brhfl.com/./categories/
http://brhfl.com/./post/
…et cetera. For my purposes, I was able to single out only the links to records in my database by matching a second pattern. If we only wanted to return links to my ‘categories’ pages, we could do lynx --dump 'http://brhfl.com' | awk '/http/&&/categories/{print $2}'
, using a boolean AND to match both patterns.
http://brhfl.com/./categories/
http://brhfl.com/./categories/apple/
http://brhfl.com/./categories/board-games/
http://brhfl.com/./categories/calculator/
http://brhfl.com/./categories/card-games/
…and so on. Belaboring this any further would be more a primer on awk
than anything, but it is necessary1 for turning lynx --dump
into a viable list of URLs. While this seems like a clumsy first step, it’s part of the reason I like this two-step approach: my list of URLs is a very real thing that can be reviewed, modified, filtered, &c. before curl
ever downloads a byte. All of the above examples print to stdout, so something more like lynx --dump 'http://brhfl.com' | awk '/http/&&/categories/{print $2}' >> categories-urls
would (appending to and not clobbering) store my URLs in a file. Then it’s on to curl
. for i in $(< categories-urls); curl -O "$i"
worked just fine2 for my database capture, but our example here would be less than ideal because of the pretty URLs. curl
will, in fact, return
curl: Remote file name has no length!
…and stop right there. This is because the -O
option simplifies things by saving the local copy of the file with the remote file’s name. If we want to (or need to) name the files ourselves, we use the lowercase -o filename
instead. While this would be a great place to learn more about awk
3, we can actually cheat a bit here and let the shell help us. zsh
has a tail-matching substitution built in, used much like basename
to get the tail end of a path. Since URLs are just paths, we can do the same thing here. To test this, we can for i in $(< categories-urls); echo ${i:t}.html
and get
categories.html
apple.html
board-games.html
calculator.html
card-games.html
…blah, blah, blah. This seems to work, so all we need to do is plug it in to our curl
command, for i in $(< categories-urls); (curl -o "${i:t}".html "$i"; sleep 2)
. I added the two seconds of sleep when I did my db crawl so that I wasn’t hammering the aging server. I doubt it would have made a difference so long as I wasn’t making all of these requests in parallel, but I had other things to work on while it did its thing anyway.
One more reason I like this approach to grabbing URLs – as we’re pulling things, we can very easily sort out the failed requests using curl -f
, which returns a nonzero exit status upon failure. We can use this in tandem with the shell’s boolean OR to build a new list of URLs that have failed: (i="http://brhfl.com/fail"; curl -fo "${i:t}".html "$i" || echo "$i" >> failed-category-urls)
gives us…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404 Not Found
~% < fail.html
zsh: no such file or directory: fail.html
zsh: exit 1 < fail.html
~% < failed-category-urls
http://brhfl.com/fail
…which we can then run through curl
again, if we’d like, to get the resulting status codes of these URLs: for i in $(< failed-category-urls); (printf "$i", >> failed-category-status-codes.csv; curl -o /dev/null --location --silent --head --write-out '%{http_code}\n' "$i" >> failed-category-status-codes.csv)
4. < failed-category-status-codes.csv
in this case gives us
http://brhfl.com/fail,404
…which we’re free to do what we want with. Which, in this case, is probably nothing. But it’s a good one-liner anyway.
sed
and/orgrep
could sub in here, butawk
is really the right tool for this one. ↩︎- Well, mostly. The filenames were all things like 0bbc93e72b0c16d7852580d8004ef57e?OpenDocument, which I tidied up a tiny bit using
zmv
:zmv '(*)?OpenDocument' '$1.html'
↩︎ - This would be something like
awk -F/ '$(NF-1)'
, with-F
specifying the field separator,$NF
being a variable that represents the number of fields in a line, and then backtracking by one because of the trailing slash. ↩︎ - There’s a bit to unpack here, and it’s beyond the scope of this article, so this is where I refer the reader to
man curl
. Suffice it to say, however, I use thiscurl
one-liner to check status codes rather often. ↩︎