The Unreasonable Effectiveness of AWK

Stephen Ramsay

“If you can’t do it with sed, C, awk, and the shell, you probably can’t do it.”

I remember reading that … well, I don’t know where I read it (or even if I did). It sounds like something from The Perl Journal in the late nineties, and I was certainly a devoted reader of that publication. But it could have been from any number of books and articles from that period. For some reason, it’s seared in my memory.

All of those languages—including sed—are Turing-complete, so I suppose the statement is formally and provably true. But I recall it being more of a cultural observation than a technical one: “Since we already have sed, C, awk, and the shell, why do we need something like Perl?”

C was my first programming language. First, in the sense that I didn’t really know what programming was about or how to do it at all before studying the K&R a few decades ago, but it wasn’t until I picked up Perl that I gained any serious proficiency as a programmer, and by then I had earned my black belt in regular expressions from crafting little sed scripts (in the Korn shell, if you must know). I was munging texts and creating cgi scripts, so the answer to the question “Why Perl?” was simplicity and ergonomics, and it’s the same answer to the question “Why Python?” or “Why Ruby?”

But somehow, I never really learned awk. I have used it here and there over the years, but mostly for really simple one-off one-liners intended to extract data from something already in columns. I knew there was more to it, but never really bothered to find out what more there was.

I’ve become increasingly radical when it comes to the The UNIX Philosophy as time goes by. Which is to say, I ask myself more often and with greater urgency whether this thing can really be done with simple little tools and a little shell (and writing little tools, usually in C, only when necessary). This very blog, in fact, is kind of an experiment in doing things The UNIX Way. I didn’t deploy a static site generator, but I didn’t really write one either. The whole thing is basically Pandoc running from Makefiles (with various other little command-line utilities lifting things here and there).

But one thing I was looking at admiringly on other peoples’ sites is automatically generated indexes. I had a general sense of how this should work: find any keywords listed at the bottom of the Markdown file, and use those (and the file path, and maybe the title) to generate a nicely formatted index (in Markdown) that could then be turned into html like everything else on the site.

I don’t know what made me think of awk. It’s easy to do this sort of thing in Ruby or, well, Bash. Maybe I had that little quip in my mind.

At any rate, I’m sure I’m late to some party (or just continuing my admiration for 1970s computing). But: awk. Is. Amazing. And it’s made me rethink what makes for a good dsl.

In the case of my system, all I really needed to do is grab the keyword line from every Markdown file. “Every Markdown file” is a straightforward find operation, and the “grab” part is some kind of regex (/Keywords:/). The essence of awk is attaching some logic to the matched pattern, and as far as logic goes, awk has all the essentials: variables, loops, conditionals, and (I had no idea!) associative arrays. Any record (by default, a line) is automatically split up by its delimiters (by default, white space) and assigned to numbered variables ($1, $2, $3 … with $0 containing the whole record). This leads to stuff like this:

/title-meta:/ {
    title=gensub(/title-meta: /, "", "g", $0)
}

/Keywords:/ {
    keywords=gensub(/Keywords: /, "", "g", $0)
    split(keywords, keys, ", ")
    for (key in keys) {
        idx[keys[key]] = FILENAME
    }
    for (key in idx) {
        printf("%s:%s:%s\n", key, title, idx[key])
    }
}

That generates a stream full of lines in which each line contains a keyword, a title, and a path, like so:

awk:The Unreasonable Effectiveness of AWK:../posts/unreasonable.md

Now, the important thing to notice here is what we are not doing. We’re not creating file handles and opening files (or closing them), we’re not reading lines into arrays, and we’re not iterating through those arrays (for each open file) looking for interesting things. All of that is happening, of course, but we don’t have to set any of it up. And the reason for that is a matter of design: the domain of this dsl is transforming data in plain text files into some other form. Which is another way of saying that the domain always involves opening text files, matching patterns within them, and extracting or reformatting or otherwise munging whatever in those files is of interest.

So while there are functions and loops and conditions inside each pattern block, we’re really just focusing on the logic within the (implied) outer loop. It’s not quite declarative, but it’s close (and I can see how in some cases awk code would start to seem almost “logic-less”). But more importantly, it shaves off all the stuff that is part of the problem by definition.

In any event, my indexing system uses find to grab all the *.md files and pipes them through two awk scripts and an intervening sort. I run that one line with make, and I have my index.

The standard criticism of dsls is that having to learn a bunch of little languages can be more tedious and burdensome than learning a general purpose language or textual interface. This is a valid complaint in general. I use xslt, css, and jq (the ingenious little tool for manipulating json files) from time to time, but I don’t use any of them often enough to feel like it’s easy to slip in and out of them. That is to say, I always feel like I have to spend a bit of time relearning the dsl before I can get anything done.

But because awk shares its fundamental idioms (both syntactically and conceptually) with sed, C, and the shell, it feels like something that’s easy to slip in and out of (and because I know those other tools extremely well, it only took me an hour or so to “learn” awk). There’s nothing wrong with creating a similarly-powerful textual world out of Lisp, or Forth, or Lua, or Haskell, or even sql, but I wonder if one of the keys to writing a good dsl is to figure out what world you’re in and innovate only to the degree that it’s absolutely necessary. In some sense, this is really no different than guis—unless you’re trying to reimagine the user’s world (and that can be a laudable goal), you want to work within the conventions that are already there.

But then again, if you already have sed, C, awk, and the shell, what more could you possibly need?

Incoming: home | blog | index

Keywords: AWK, Perl, C, Bash, DSLs, UNIX

Last Modified: 2024-11-13T09:35:34:-0600