brhfl.com

Speech synthesis

When I was in elementary school, I learned much of my foundation in computing on the Commodore 64. It was a great system to learn on, with lots of tools available and easy ways to get ‘down to the wire’, so to speak. Though it was hard to see just how limited the machines were compared with what the future held, some programs really stood out for how completely impossible they seemed1. One such program was S.A.M. – the Software Automated Mouth, my first experience with synthesized speech2.

Speech synthesis has come a long way since. It’s built into current operating systems, it can be had in IC form for under $9, and it’s becoming increasingly present in day-to-day life. I routinely use Windows’ built in speech synthesizer along with NVDA as part of my accessibility checking regimen. But I’m also increasingly becoming dismayed by the egregious use of speech synthesis when natural human speech would not only suffice but be better in every regard. Synthesis has the advantage of being able to (theoretically) say anything while not paying a person to do the job. I’m seeing more and more instances where this doesn’t pan out, and the robot is truly bad at its job to boot.

Three examples, all train-related (I suppose I spend a lot of time on trains): the new 7000 series DC Metro cars, the new MARC IV series coach cars, and the announcements at DC’s Union Station. None of these need to be synthesized. They’re all essentially announcing destinations – they have very limited vocabularies and don’t make use of the theoretical ability to say anything. Union Station’s robot occasionally announces delays and the like, but often announcements beyond the norm revert to a human. Metro and MARC trains only announce stops and have demonstrated no capacity for supplemental speech. Where old and new cars are paired, conductors/operators still need to make their own station stop announcements.

So these synthesizers don’t seem to have a compelling reason to exist. It could be argued that human labor is now potentially freed up, but given the robots’ limited vocabularies and grammars, the same thing could be accomplished with human voice recordings. I can’t imagine that the cost of hiring a voice actor with software to patch the speech together into meaningful grammar would be appreciably more expensive than the robot. In fact, before the 7000 series Metro cars, WMATA used recordings to announce door openings and closings; they replaced these recordings in 2006, and the voice actor was rewarded with a $10 fare card3.

Aside from simply not being necessary, the robots aren’t good at their job. This is, of course, bad programming – human error. But it feels like the people in charge of the voices are so far detached from the final product that they don’t realize how much they’re failing. The MARC IV coaches are acceptable, but their grammar is bizarre. When the train is coming to a station stop, an acceptable thing to announce might be ‘arriving at Dickerson’, which is in fact what the conductors tend to say. The train, instead, says ‘this train stops at Dickerson’, which at face value says nothing beyond that the train will in fact stop there at some point. It’s bad information, communicated poorly. Union Station’s robot has acceptable grammar, but she pronounces the names of stations completely wrong. Speech synthesizers generally have two components: the synthesizer that knows how to make phonemes (the sounds that make up our speech), and a layer that translates the words in a given language to these phonemes. My old buddy S.A.M. had the S.A.M. speech core, and Reciter which looked up word parts in a table to convert to phonemes. This all had to fit into considerably less than 64K, so it wasn’t perfect, and (if memory serves), one could override Reciter with direct phonemes for mispronounced words. Apple’s say command (well, their Speech Synthesis API) allows on-the-fly switching between text and phoneme input using [[inpt TEXT]] and [[inpt PHON]] within a speech string4. So again, given just how limited the robot’s vocabulary is (none of these trains are adding station stops with any regularity), someone should have been able to review what the robot says and suggest overrides. Half the time, this robot gets so confused that she sounds like GLaDOS in her death throes.

Which brings me to my final point – the robots simply aren’t human. Even when they are pronouncing things well, they can be hard to understand. On the flipside, the DC Metro robot sounds realistic enough that she creeps me the hell out, which I can only assume is the auditory equivalent of the uncanny valley. I suppose a synthesized voice could have neutrality as an advantage – a grumpy human is probably more off-putting than a lifeless machine. But again, this is solvable with human recordings. I cannot imagine any robot being more comforting than a reasonably calm human.

Generally speaking, we’re reducing the workforce more and more, replacing the workforce with automation, machinery. It’s a necessary progression, though I’m not sure we’re prepared to deal with the unemployment consequences. It’s easy to imagine speech synthesis as a readily available extension of this concept – is talking a necessary job? But human speech is seemingly being replaced in instances where the speaking does not actually replace a human’s job and/or a human recording would easily suffice. In some instances, speaking being replaced is a mere component of another job being replaced – take self-checkout machines (which tend to be human recordings despite the fact that grocery store inventories are far more volatile than train routes, hence ‘place your… object… in the bag’). But I feel like I’m seeing more and more instances that seem to use speech synthesis which is demonstrably worse than a human voice, and seemingly serves no purpose (presumably beyond lining someone’s pockets).


  1. I loved the C64 demo scene, these people were hackers in the truest sense, finding and exploiting loopholes in a constrained system. One of my favorites included a PCM recording of M.C. Miker G & Deejay Sven’s Holiday Rap, that was right up there with speech synthesis. I can’t find it on YouTube. I do now know how such a thing was accomplished – having analog and digital componentry smushed together on the same chip at the time led to audible clicks during output gain changes. This flaw could be exploited to produce 4-bit PCM audio. ↩︎
  2. One fun thing about how impossible S.A.M. was: in order to ensure the sound processing was time coherent, the software had to prevent any unnecessary hardware interrupts from occurring during speech, leading to the screen blanking while S.A.M. talked. ↩︎
  3. The original pre-2006 voice was, in my opinion, the most soothing of the lot (as were its door chimes). I can’t find a source to back this up, but I seem to remember that the original voice was some WMATA employee or board member’s wife – who I would also assume was unpaid. ↩︎
  4. One of the stations that the robot botches particularly bad is Monocacy. Apple’s speech synthesizer gets it wrong, also: say 'Monocacy'. But, again, we can override this with phonemic input: say '[[inpt PHON]] 2mOW1nAAkUXsIY'. (If you don’t want to touch the command line, you should be able to highlight from the [[inpt PHON]] on, right-click, and choose Speech → Start Speaking). See also this reference on customizing speech in Apple’s API. ↩︎