Semaphore and sips redux

In this article, I do sem -j +5, allowing 5 jobs to run at a time. -j can be used with integers, percents, and +/– values such that one can say -j +0 -j -1 to run one fewer job than their available cores (+0), etc.

I was going to simply edit my last post, but this might warrant its own, as it’s really more about sem and parallel than it is sips. parallel’s manpage describes it as ‘a shell tool for executing jobs in parallel using one or more computers’. It’s kind of a better version of xargs, and it is super powerful. The manpage starts early with a recommendation to watch a series of tutorials on YouTube and continues on to example after example after example. It’s intense.

In my previous post, I suggested using sem for easy parallel execution of sips conversions. sem is really just an alias for parallel --semaphore, described by its manpage (yes, it gets its own manpage) as a ‘counting semaphore [that] simply waits for a semaphore to become available and then runs the command given’. It’s a convenient and fairly accessible way to parallelize tasks. Backing up for a second, it does have its own manpage, which focuses on some of the specifics about how it queues things up, how it waits to execute tasks, etc. It does this using toilet metaphors, which is a whole other conversation, but for the most part it’s fairly clear, and it’s what I tend to reference when I’m figuring something out using sem.

In my last post (and in years of converting things this way), I had to decide between automating the cleanup/rm process or parallelizing the sips calls. The problem is, if you do this:

for i in ./*.tif; sem -j +5 sips -s format png "$i" --out "${i/.tif/.png}" && rm "$i"

…the parallelism gets all thrown off. sem executes, cues up sips, presumably exits 0, and then rm destroys the file before sem even gets the chance to spawn sips. None of the files exist, and sips has nothing to convert. The sem manpage doesn’t really address chaining commands in this manner, presumably it would be too difficult to fit into a toilet metaphor. But it occurred to me that I might come up with the answer if I just looked through enough of the examples in the parallel manpage (worth noting that a lot of the parallel syntax is specific to not being run in semaphore mode). The solution is facepalmingly simple: wrap the && in double quotes:

for i in ./*.tif; sem -j +5 sips -s format png "$i" --out "${i/.tif/.png}" "&&" rm "$i"

…which works a charm. We could take this even further and feed the PNGs directly into optipng:

for i in ./*.tif; sem -j +5 sips -s format png "$i" --out "${i/.tif/.png}" "&&" rm "$i" "&&" optipng "${i/.tif/.png}"

…or potentially adding optipng to the sem queue instead:

for i in ./*.tif; sem -j +5 sips -s format png "$i" --out "${i/.tif/.png}" "&&" rm "$i" "&&" sem -j +5 optipng "${i/.tif/.png}"

…I’m really not sure which is better (and I don’t think time will help me since sem technically exits pretty quickly).

Darwin image conversion via sips

I use Lightroom for all of my photo ‘development’ and library management needs. Generally speaking, it is great software. Despite being horribly nonstandard (that is, using nonnative widgets), it is the only example of good UI/UX that I’ve seen out of Adobe in… at least a decade. I’ll be perfectly honest right now: I hate Adobe with a passion otherwise entirely unknown to me. About 85-90% of my professional life is spent in Acrobat Pro, which gets substantially worse every major release. I would guess that around 40% of my be-creative-just-to-keep-my-head-screwed-on time is spent in various pieces of CC (which, subscription model is just one more fuck-you, Adobe). But Lightroom has always been special. I beta tested the first release, and even then I knew… this was the rare excuse for violating so many native UI conventions. This made sense.

Okay, from that rant we come up with: thumbs-down to Adobe, but thumbs-up to Lightroom. But there’s one thing that Lightroom has never opted to solve, despite so many cries, and that is PNG export. Especially with so many photographers (myself included) using flickr, which reencodes TIFFs to JPEGs, but leaves the equally lossless PNG files alone, it is ridiculous that the Lightroom team refuses to incorporate a PNG export plugin. Just one more ’RE: stop making garbage’ memo that I need to forward to the clowns at Adobe.

All of this to just come to my one-liner solution for Mac users… sips is the CLI/Darwin equivalent of the image conversion software that MacOS uses for conversion in Preview, etc. The manpage is available online, conveniently. But my use is very simple – make a bunch of supid TIFFs into PNGs.

for i in ./*.tif ; sips -s format png "$i" --out "${i/tif/png}" && rm "$i"

…is the basic line that I use on a directory full of TIFFs output from Lightroom. Note that this is zsh, and I’m not 100% positive that the variable substitution is valid bash. Lightroom seemingly outputs some gross TIFFs, and sips throws up an error for every file, but still exits 0, and spits out a valid PNG. sips does not do parallelism, so a better way to handle this may be (using semaphore):

for i in ./*.tif; sem -j +5 sips -s format png "$i" --out "${i/tif/png}"

…and then cleaning up the TIFFs afterward (rm ./*.tif). Either way. There’s probably a way to do both using flocks or some such, but I haven’t put much time into that race condition.

At the end of the day, there are plenty of image conversion packages out there (ImageMagick comes to mind), but if you’re on MacOS/Darwin… why not use the builtins if they function? And sips does, in a clean and simple way. While it certainly isn’t a portable solution, it’s worth knowing about for anyone who does image work on a Mac and feels comfortable in the CLI.

Of lynx and curl

I use zsh, and some aspects of this article may be zsh specific, particularly the substitution trick. bash has similar ways to achieve these goals, but I won’t be going into anything bash-specific here.

At work, I was recently tasked with archiving several thousand records from a soon-to-be-mercifully-destroyed Lotus Notes database. Why they didn’t simply ask the DBA to do this is beyond me (just kidding, it almost certainly has to do with my time being less valuable, results be damned). No mind, however, as the puzzle was a welcome one, as was the opportunity to exercise my Unix (well, cygwin in this case) chops a bit. The exercise became a simple one once I realized the database had a web server available to me, and that copies of the individual record web views would suffice. A simple pairing of lynx and curl easily got me what I needed, and I realized that I use these two in tandem quite often. Here’s the breakdown:

There are two basic steps to this process: use lynx to generate a list of links, and use curl to download them. There are other means of doing this, particularly when multiple depths need to be spidered. I like the control and safety afforded to me by this two-step process, however, so for situations where it works, it tends to be my go-to. To start, lynx --dump '' will print out a clean, human-readable version of my homepage, with a list of all the links at the bottom, formatted like


…and so on (note to self: those ./ URLs function fine, and web browsers seem to transparently ignore them, but… maybe fix that?). For our purposes, we don’t want the formatted page, nor do we want the reference numbers. awk helps us here: lynx --dump '' | awk '/http/{print $2}' looks for lines containing ‘http’, and only prints the second element in the line (default field separator being a space).

…et cetera. For my purposes, I was able to single out only the links to records in my database by matching a second pattern. If we only wanted to return links to my ‘categories’ pages, we could do lynx --dump '' | awk '/http/&&/categories/{print $2}', using a boolean AND to match both patterns.

…and so on. Belaboring this any further would be more a primer on awk than anything, but it is necessary1 for turning lynx --dump into a viable list of URLs. While this seems like a clumsy first step, it’s part of the reason I like this two-step approach: my list of URLs is a very real thing that can be reviewed, modified, filtered, &c. before curl ever downloads a byte. All of the above examples print to stdout, so something more like lynx --dump '' | awk '/http/&&/categories/{print $2}' >> categories-urls would (appending to and not clobbering) store my URLs in a file. Then it’s on to curl. for i in $(< categories-urls); curl -O "$i" worked just fine2 for my database capture, but our example here would be less than ideal because of the pretty URLs. curl will, in fact, return

curl: Remote file name has no length!

…and stop right there. This is because the -O option simplifies things by saving the local copy of the file with the remote file’s name. If we want to (or need to) name the files ourselves, we use the lowercase -o filename instead. While this would be a great place to learn more about awk3, we can actually cheat a bit here and let the shell help us. zsh has a tail-matching substitution built in, used much like basename to get the tail end of a path. Since URLs are just paths, we can do the same thing here. To test this, we can for i in $(< categories-urls); echo ${i:t}.html and get


…blah, blah, blah. This seems to work, so all we need to do is plug it in to our curl command, for i in $(< categories-urls); (curl -o "${i:t}".html "$i"; sleep 2). I added the two seconds of sleep when I did my db crawl so that I wasn’t hammering the aging server. I doubt it would have made a difference so long as I wasn’t making all of these requests in parallel, but I had other things to work on while it did its thing anyway.

One more reason I like this approach to grabbing URLs – as we’re pulling things, we can very easily sort out the failed requests using curl -f, which returns a nonzero exit status upon failure. We can use this in tandem with the shell’s boolean OR to build a new list of URLs that have failed: (i=""; curl -fo "${i:t}".html "$i" || echo "$i" >> failed-category-urls) gives us…

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 404 Not Found
~% < fail.html
zsh: no such file or directory: fail.html
zsh: exit 1      < fail.html
~% < failed-category-urls

…which we can then run through curl again, if we’d like, to get the resulting status codes of these URLs: for i in $(< failed-category-urls); (printf "$i", >> failed-category-status-codes.csv; curl -o /dev/null --location --silent --head --write-out '%{http_code}\n' "$i" >> failed-category-status-codes.csv)4. < failed-category-status-codes.csv in this case gives us,404

…which we’re free to do what we want with. Which, in this case, is probably nothing. But it’s a good one-liner anyway.

Making multiple directories with mkdir -p

I often have to create a handful of directories under one root directory. mkdir can take multiple arguments, of course, so one can do mkdir -p foo/bar foo/baz or mkdir foo !#:1/bar !#:1/baz (the latter, of course, would make more sense given a root directory with a longer name than ‘foo’). But a little trick that I feel slips past a lot of people is to use .. directory traversal to knock out a bunch of directories all in one pass. Since -p just makes whatever it needs to, and doesn’t care about whether or not any part of the directory you’re passing exists, mkdir -p foo/bar/../baz works to create foo/bar and foo/baz. This works for more complex structures as well, such as…

% mkdir -p top/mid-1/../mid-2/bottom-2/../../mid-3/bottom-3
% tree
└── top
    ├── mid-1
    ├── mid-2
    │   └── bottom-2
    └── mid-3
        └── bottom-3