500 million lines of code

Here at Bureauphile, we care about the measurement of performance. This post is about one terrible way to do it: lines of computer code. You may have seen the claim that healthcare.gov is backed by “about 500 million lines of software code”. That figure is from the last line of an NYT article, from an unnamed source. This was picked up by many people, probably because it makes for a nice headline about the bloat of government contracts, with a punch that a simple statement like this site doesn’t work for a lot of people doesn’t have. I ran into it again at Putative, which prompted me to give this some debunking.

First, let’s do the simplest of calculations. If it takes you one second to write a line of code, and you are a contractor working a solid eight hours/day shift, it will take you 17,361 days to write half a billion lines. With a 250-day person year, that’s 69 person-years. Of course, it takes a lot more than a second to write a line of code: more typical would be—I am not making this up— ten lines of code per day. Let’s give them 100 lines of code per day, then we’re still at 5 million days, or 20,000 person years to write all this up. The contractors started this project at the beginning of the year, and did not have 20,000 people working on it.

So the number is prima facie fishy.

But lines-of-code counts, even on the best of days, should not be taken seriously, because it is so difficult to define what is a line of code.

First, different languages have different whitespace customs. In C, you’ll find people who write

if (x==0)
 return INFINITY;
 return y/x;

whereas in other languages the custom is much closer to

if (x==0) return INFINITY; else return y/x;

So this simple sentiment could be one or eight lines of code.

One day, when avoiding work, I wrote a one-line script that counts lines of C-family code. The gist is that I omit comments, then look for lines that have something more than a }, a), or whitespace on them. Here it is, in one line:

sed -e 's|/\*.*\*/||' -e 's|//.*||' $* | awk -e '/\/\*/ {incomment=1}' -e '{if (!incomment) print $0}' -e '/\*\// {incomment=0}' - | grep '[^}) \t]' | wc -l

Following the ingrained UNIX tradition, this is four programs piped together, one of which (awk) has a three-line program specified on the command line. Readers good with awk will notice cases where this is inaccurate; it’s not worth caring.

I’m coming to a relative stopping point with my work on the the Apophenia library for statistical and scientific computing. How many lines of code does it take to implement—if I may be immodest for a moment—a darn solid statistical library?

14,860: using the nontrivial line counter above.

27,934: Simply counting lines, regardless of their content, including documentation and blanks [wc -l for the POSIX geeks].

39,947: There is a testing suite to verify that Apophenia calculates the right numbers, which includes several data sets, totalling 10,013 lines, which we can add in. More on this below.

41,397: It is a not uncommon technique to write code that generates other code. I use the m4 macro language to autogenerate 1,450 lines of code that gets distributed in the package (by a rough count), which we can add to the above.

56,330: I use GNU Autotools for installation, which takes the generation of code using m4 to the extreme: it produces a 16,933 line script based on a 119 line pre-script that I wrote. Can’t use the library without installing it, so add that on.

325,687: When I sit down to a fresh computer (say, one that just solidified in Amazon’s cloud), I have to install other libraries before Apophenia will run. Apophenia relies upon the GNU Scientific Library, which is 197,452 lines by my nontrivial line counter (they use the super-sparse format—it’s 299,716 lines by raw line count), and SQLite 3, which is 71,905 nontrivial lines of code. One could argue that those are a necessary part of Apophenia.

So Apohenia is somewhere between about 15,000 and 325,000 lines of code. If I’m bragging to friends about how efficient my codebase is, it’s the former; If I’m on a Dickensian government contract, it be the latter.

Getting back to healthcare.gov, one question we might really want to ask is: what lines of code might a maintainer one day have the responsibility of manually revising? The press has been reporting that 5 million lines of code have to be changed, which might be getting toward this question, though I am comfortable assuming that somebody just made up that number too.

We know that the site relies on code lifted from other projects, because the press has reported that they screwed up a copyright attribution for one of them. Apophenia is at arm’s length from the GNU Scientific and SQLite libraries, but web projects are more likely to function by cutting/pasting code from javascript libraries into what gets served up. However, if a bug is found in one of these libraries, the .gov maintainers would first file a bug report with the library maintainers, not try to fix it themselves (depending on the situation).

Is data code? The reader will be unsurprised to hear that there are about 141,000 procedure and diagnosis codes in the ICD-10 system. If there’s a database with 50 lines of description for each procedure (also plausible, especially in a sparse-on-the-page format like some XML), then you’ve got 7 million lines of “code” that needs maintaining right there.

The people running healthcare.gov surely bought whatever underlying data sets from a provider, and if an insurance company submits bad pricing files to healthcare.gov, the legal contracts probably say that it is the responsibility of the insurer to fix it. But in managing the real-world project, responsibility isn’t quite so clear-cut: end users will just see a wrong number and declare that healthcare.gov is broken.

Nobody expects the site to run in fifty lines of javascript that the developers tweeted to each other. Healthcare.gov is no doubt tens of thousands of lines of code by any measure, because health care insurance in the United States is in the running for the most complex system on Earth, and this web site has to simplify it and deliver it securely to millions of people in diverse contexts. But defining where this project ends and the projects, databases, and other underlying structures begin is futile, as is measuring the complexity of the task by lines of code.


1 thought on “500 million lines of code

  1. The Paperwork Reduction Act has similar idiotic views about public burden, in terms of asking for the number of items required for a data report. Asking for your birth month, day, and year (in three separate fields) is 3x is much work as asking for one field in MMDDYYYY? The only defense of these idiotic measurement attempts is that it’s really really hard to measure what they are trying to measure — but a stupid measurement is not much better than nothing at all.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s