Revisiting Knuth and McIlroy's word count programs
Dec 8, 2011 · 6 minute read · Commentsprogrammingliterate programmingUnixDonald KnuthDoug McIlroyHaskellfunctional programmingCC++AwkPerl
Today I came across a blog post revisiting Jon Bentley’s challenge in 1986 to Donald Knuth to write a literate program to solve a sample task and have Doug McIlroy critique it.
The task:
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
Knuth came up with a typically clever, lengthy, low-level implementation. McIlroy then somewhat perversely wrote a six-line shell script that did the job, basically changing the subject away from literate programming and toward a critique of Knuth’s doing something low-level and complicated when unnecessary. The article publishing both Knuth’s and McIlroy’s solutions is available here. A followup article with David Hanson’s implementation in C is here.
I decided to bring the discussion here a quarter of a century (25 years!) to the present. How would we solve the problem now?
(Update of 2013-06-29)
I have changed my mind about many things I said here, and also have more clarifications and new arguments to make, which I will eventually post on my new programming blog, The Conscientious Programmer.
Salient points of McIlroy’s solution
First, let’s look at McIlroy’s solution, which despite the passing of time is still a beautifully elegant illustration of why Unix is timeless. (By the way, Knuth is a C and Linux user to this day).
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
The code is self-explanatory if you are familiar with basic Unix command-line tools. The pipeline just transforms data starting from standard input until the desired result is computed.
What is beautiful about the code is that it decomposes the solution to intuitively and does not require any mutable state. McIlroy’s program is a purely functional program.
Using a general-purpose programming language instead
I thought to myself, how would I write the program today? Especially if I thought I would need to modify it, add new features? The shell script does the job for the problem as stated, but would clearly be hard to extend. Indeed, historically, “scripting languages” such as Awk were invented in order to do more than simple shell scripts were suitable for, and Perl was Larry Wall’s response to Awk, to create a truly general-purpose language.
Nowadays, in 2011, languages and libraries are at a sufficiently high level that a program almost as concise as McIlroy’s could be written in Perl, Ruby, Python, or even the latest versions of Java, C#, etc. I’ll leave that as an exercise for you (feel free to post your solutions as comments below).
Haskell
I present a simple program in Haskell that I feel is closest to McIlroy’s both in spirit and in letter.
Here is my Haskell program, in two variants. The first is a standard source code file, while the second uses Haskell’s built-in support for its own notion of “literate programming”.