Here’s what happened:
Several years ago, as part of the TalkBank project, I wrote a suite of computer programs to clean up, correct, and transform SCOTUS oral argument transcripts into a particular text file format called CHAT, to enable analysis and further transformation.
The CHAT transcripts are aligned (by humans), utterance by utterance, with the available audio files, in order to create a nice way to simultaneously read the text while listening, with the ability to pause, skip to a particular section, etc.
Example: Hamdan v. Rumsfeld
SCOTUS provides an oral transcript in PDF format.
The linked transcripts can be played back in several ways:
- Directly from the CMU site.
- Downloading the 2005 corpus and running the CLAN desktop application on a transcript.
- Oyez provides their own interface, through transformation of the CHAT files, to an entire site for each case.
What happened today
Today, as I have done on Thursdays or Fridays for years now when SCOTUS is in session, I have converted the oral argument transcripts for this week in order to hand off to the Oyez people for their use.
The conversion software
The conversion happens in three steps.
The conversion is not fully automatic, because there are always errors of some kind in the transcripts that I correct manually. I wrote a set of twenty-nine separate Perl scripts that do a lot of cleanup and checking of transcripts as a pre-processing step.
Parsing, validation, transformation
The main task of parsing and validation and transformation uses a program written in Java and using the ANTLR parser generator framework.
Note: these programs were written several years ago, hence the use of ANTLR version 2 rather than ANTLR 3. Newer projects of mine have used ANTLR 3 (ANTLR 4 is only now about to hit 4.0).
A post-processing Perl script is used after the initial CHAT generation, in order to convert numerical and other tokens into the desired spoken form; at the Pittsburgh Perl Workshop 2010, I gave a little talk, How do you pronounce “07-1191”?, about this part of the project.
Reliable legacy software
No new development has been done on these programs for several years, since as far as can be determined, there have been no remaining bugs in the programs.
These programs have, to date, successfully generated 3,468 CHAT files that have been validated. In the first year or two, there were bugs quickly found and fixed, and then I don’t remember the last time a bug was found.
The bug came today as I was processing this week’s case 11-9953.
I got an error message when my pre-processing phase exited after an error while running the script
I was confused. I looked at the transcript and saw:
JUSTICE SCALIA: And another of his counsel, Mr. Singer – of the three that he had – he was a graduate of Harvard law school, wasn’t he?
MS. SIGLER: Yes, Your Honor.
JUSTICE SCALIA: Son of a gun.
JUSTICE THOMAS: Well – he did not –
MS. SIGLER: I would refute that, Justice Thomas.
JUSTICE SOTOMAYOR: Counsel, do you want to define constitutionally adequate counsel? Is it anybody who’s graduated from Harvard and Yale?
I was confused about why my script would not recognized Justice Thomas. I looked at the Perl source code, saw where I initialized a table of Justice names, and saw:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
I had simply forgotten seven years ago to put into the table initialization the following code:
And this bug had lain undiscovered for seven years because Justice Thomas has not spoken in seven years!
How the bug arose
Justice Thomas did speak in the following transcripts I successfully converted years ago:
What happened was that they were generated before I wrote the
find-bad-ids.pl script. The main program in Java has its own table of justices. The purpose of the pre-processing Perl script is to catch errors earlier than handing off to the Java program. In particular, the use of approximate string matching enables easy correction of typos before ANTLR ever sees the text to parse. Before the pre-processing scripts, the text that arrived at ANTLR often had a lot of systematic errors that were annoying to fix, so I wrote in Perl both a cleanup pre-processing phase and a checking pre-processing phase. They include tests for all kinds of “suspicious” formatting and tokens and ambiguity that require manual judgment and correction.
I still don’t know why I forgot to put Justice Thomas into the table initialization code in my Perl script seven years ago, but it resulted in a bug that was not detected until today. The lesson: your computer program can have bugs if you didn’t get test data that represented all possible situations, including that of Justice Thomas actually speaking during a SCOTUS oral argument!