Revisiting Knuth and McIlroy's word count programs

Today I came across a blog post revisiting Jon Bentley’s challenge in 1986 to Donald Knuth to write a literate program to solve a sample task and have Doug McIlroy critique it.

The task:

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

Knuth came up with a typically clever, lengthy, low-level implementation. McIlroy then somewhat perversely wrote a six-line shell script that did the job, basically changing the subject away from literate programming and toward a critique of Knuth’s doing something low-level and complicated when unnecessary. The article publishing both Knuth’s and McIlroy’s solutions is available here. A followup article with David Hanson’s implementation in C is here.

I decided to bring the discussion here a quarter of a century (25 years!) to the present. How would we solve the problem now?

(Update of 2013-06-29)

I have changed my mind about many things I said here, and also have more clarifications and new arguments to make, which I will eventually post on my new programming blog, The Conscientious Programmer.

Salient points of McIlroy’s solution

First, let’s look at McIlroy’s solution, which despite the passing of time is still a beautifully elegant illustration of why Unix is timeless. (By the way, Knuth is a C and Linux user to this day).

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

The code is self-explanatory if you are familiar with basic Unix command-line tools. The pipeline just transforms data starting from standard input until the desired result is computed.

What is beautiful about the code is that it decomposes the solution to intuitively and does not require any mutable state. McIlroy’s program is a purely functional program.

Using a general-purpose programming language instead

I thought to myself, how would I write the program today? Especially if I thought I would need to modify it, add new features? The shell script does the job for the problem as stated, but would clearly be hard to extend. Indeed, historically, “scripting languages” such as Awk were invented in order to do more than simple shell scripts were suitable for, and Perl was Larry Wall’s response to Awk, to create a truly general-purpose language.

Nowadays, in 2011, languages and libraries are at a sufficiently high level that a program almost as concise as McIlroy’s could be written in Perl, Ruby, Python, or even the latest versions of Java, C#, etc. I’ll leave that as an exercise for you (feel free to post your solutions as comments below).

Haskell

I present a simple program in Haskell that I feel is closest to McIlroy’s both in spirit and in letter.

Here is my Haskell program, in two variants. The first is a standard source code file, while the second uses Haskell’s built-in support for its own notion of “literate programming”.

I used GHC to compile and run the program. Sample output:

$ ghc -O6 --make WordCount
$ ./WordCount 10 < WordCount.lhs
35 the
16 a
11 list
10 of
9 text
9 for
8 n
8 map
8 count
7 string

Apart from missing leading spaces, this is the same output as from McIlroy’s shell script.

The literate program above explains each step of the Haskell “pipeline” I constructed.

Why I don’t do literate programming

I don’t currently use “literate programming” systems.

I experimented with programming in C and C++ and Standard ML using noweb over a decade ago, but found that for myself, it was not really beneficial.

There was little benefit in being able to rearrange code fragments at will. Furthermore, spreading code out interspersed with a lot of prose made it harder for me to actually chunk meaning out of a spatial portion of text in my editor window.

Also, modern languages and programming styles make it much easier to express things concisely and less monolithically, such that I find that using ordinary comments suffices for my needs.

I find that literate programming in the Knuth style amounts to a macro system that distorts the layout of code.

Finally, literate programming interacts badly with editors and IDEs that are built specifically to operate on pure source code.

What do you think? Which of the variants of the same Haskell code above would you prefer to write, read, or maintain? The non-literate one or the literate one?

An alternative to literate programming

I should note that in practice, I would write suitable comments in a non-literate program primarily before the function definition. Also, I would not use such a large pipeline of expressions either: I would break out almost every line of the pipeline into its own little function, with its own comment. That is how I would actually write a nontrivial Haskell program, writing one or two line functions and testing each of them, before trying to hook them all up into a big pipeline.

I “cheated” in this case because McIlroy’s program already existed, so I simply translated it into a Haskell equivalent without real thought and testing.

Comparison between the shell script and the Haskell program

The shell script operates on raw text and everything is just strings being parsed and reparsed by the respective Unix utility programs.

The Haskell program is statically typed. It is type-checked by the compiler, which generates native code. The program uses standard libraries and data types, such as lists and hash maps.

Also, the Haskell program could be refined, extended, optimized in various ways. The most important optimizations I can think of off the top of my head:

  • Using a better representation of strings than the default built-in “string as list of characters”. Easily accessible advice can be found on Stack Overflow and browsing through Haskell library documentation, such as for the text package.
  • Loop fusion, deforestation can be applied to deal with the apparent allocation of lots of new lists in the pipeline. One of the selling points of using a language like Haskell is the opportunity for the compiler to perform radical optimizations that are impossible for languages that have side effects.

I don’t write many bash scripts these days. General-purpose programming languages can do a decent job munging data without difficulty. The situation was different decades ago when there was C, and few standard high-level libraries for the C world.

Conclusion

  • I am skeptical of literate programming.
  • McIlroy was ahead of his time, but that time has passed; we should take his contributions as inspiration to move further forward already, using advanced general-purpose programming languages.

Comments (18) Archived from Disqus

Dorin B View on Disqus ↗

"McIlroy was ahead of his time", maybe you are right, but 
"move further forward already, using advanced general-purpose programming languages" says to me "move backwards". You missed the point in McIlroy's sollution: to use reusable components. Reusable components might be specialized programs/languages for solving problems from specific domains.

Franklin Chen View on Disqus ↗

No, I think I illustrated exactly the point that McIlroy was making, and I believe that if you emailed him, he would completely agree with me today. I know that he has been a big fan of generic, functional programming. In fact, he is a member of the Haskell community and has written papers involving Haskell code to illustrate functional programming, e.g., http://pastebin.com/ESpu19ZQ

Note how every single line in my Haskell program is in fact a reusable component. I could build a library of such components so that I could use them in other programs. In fact, there is a whole body of work on embedded domain specific languages into general-purpose languages. Note that my program is much more reusable than McIlroy's because it does not just operate on text, and could be modified to operate on other kinds of data that need to be counted and sorted.

StephenMcCracken View on Disqus ↗

You said you prefer to write small functions in Haskell.  When I was using Haskell, I had trouble deciding when to make functions.   Often a helper-type function would have a very short definition, and I would feel that my function name was clumsy by comparison and didn't add value.  Also, I didn't like the Haskellish style of introducing auxiliary definitions with "where".  I prefer scheme-style or ML-style "let", which you can read top-down.

Franklin Chen View on Disqus ↗

I prefer to write small functions in any programming language, whether Haskell or C or JavaScript. I've simply found from experience that (1) I forget what long code does, (2) any piece of code that is nontrivial enough to be possibly buggy and can be tested should be its own function, otherwise how do you unit test it? I don't think it is a coincidence that whenever I find a bug in my code, it is embedded in a long function that does many things, creates intermediate expressions, hands them somewhere, etc.

Franklin Chen View on Disqus ↗

It turns out this is a followup to the original article. I've located the original article as well and therefore have included links to both now. Thanks!

sswam View on Disqus ↗

The shell version is shorter and easier to understand than your Haskell. The UNIX way is not out of date yet.

This is the Unix philosophy: Write programs that do one thing and do it well.
Write programs to work together. Write programs to handle text streams, because
that is a universal interface. -- Doug McIllroy

Knuth's program and McIllroy's review are found in his book "Literate Programming", Chapter 6. I bought the book and I'm enjoying to read it.

Franklin Chen View on Disqus ↗

Yes, three decades have gone by and we all program at a much higher level than was the norm back in the 1970s and 1980s!

Ian Fairman View on Disqus ↗

I've had a go at this problem using the functional capabilities of Java 8.

https://ianfairman.wordpres...

Like Franklin, I've cheated somewhat by using the shell version as a template. Needless to say, that prior to Java 8 such a concise solution would have been difficult.

Franklin Chen View on Disqus ↗

Thanks! Yes, Java 8 was a good step forward for Java.

Bart View on Disqus ↗

Hooray for JavaScript

var obj = {};
string.split(' ').forEach(function (word) {
if (obj[word]) obj[word]++;
else obj[word] = 1;
});
var arr = [], hw;
for (var i = 0; i < 3; i++) {
hw = {count: 0};
for each (var word in Object.keys(obj))
if (obj[word] > hw.count) hw = {word: word, count: obj[word]};
arr.push(hw);
delete obj[hw.word];
}

Define 'word' by adding a regex replace before this script (e.g. remove punctuation).

write dissertation online View on Disqus ↗

I have able to found a lot of great ideas on computer programming base in the blog that you have posted here. It able to make my works in computer programs to be more effective and responsive. Good thing that you have able to our some good perspectives in your blog here.

Peter Norvig View on Disqus ↗

def wordcount(filename, n=10):
text = open(filename).read().lower()
counts = collections.Counter(re.findall('[a-z]+', text))
for i, w in counts.most_common(n):
print(i, w)