Leprechauns and Unicorns of Software
Something like 25 years ago, Bill James wrote an essay asserting that one difference between smart and dumb baseball organizations was that dumb organizations behaved as thought Major League talent was normally distributed on a standard bell curve, and smart organizations knew that talent is not (it’s the far end of a bell curve).
Hold that thought.
So I just finished reading Laurent Bossavit’s self published book The Leprechauns of Software Engeneering, which I recommend quite highly. Bossavit looks at some of Software Engineering’s core pieces of received wisdom: that there is a 10x productivity difference between developers, that the cost of change rises exponentially over the course of a project. The kind of thing that all of us have heard and said dozens of times.
Bossavit does an interesting thing. He goes back to the original papers, sometimes wading through a maze of citations, to find the original source of these claims and to find out what empirical backing they may have. Turns out the answer is “basically none”. One by one, these claims are shown to be the result of small sample sizes, or no sample, or other methodological problems. They’ve become authoritative by force of repetition. (Which doesn’t mean they are wrong, just that we don’t know as much as we think we do.)
If you are like me, which is to say you are fascinated by the idea of empirical studies of software, but deeply skeptical of the practice, this book will take your head to some fun places.
The Bill James analogy, for example. What James is talking about is that in order to accurately value what you have, you need some idea of the context in which it occurs. In the baseball example, to know how to value a player who hits ten home runs (which we’ll pretend is average, for the sake of oversimplification), it’s helpful to have a good sense of how many players are out there who are capable of hitting twelve, or eight. If you erroneously assume that there are fewer players capable of hitting eight home runs then ten home runs, then some bad management decisions are in your future. Specifically, you’ll overvalue your ten home run player (or more likely, overpay for somebody else’s ten home run player, when your own eight home run player is a fraction of the cost.)
I’m wary of taking this analogy too far, not least because it doesn’t necessarily reflect well on my overeducated typing fingers. There are all kinds of reasons to think that the curved for web developers is different than for baseball players. We don’t have a good idea of what the distribution curve of productivity is for developers, even if we had a good idea of productivity is (we don’t) or a way of measuring it (ditto) or any idea of how individuals improve or decline based on teams (guess what). That said, I do not think that I have been on an actual team where people were genuinely 10x better than other people. (Total newbies notwithstanding, I guess). Ten times is a lot of times, that’s one person’s week being another person’s half-day. Sustainably.
But see what I did there? I palmed a card. I said that newbies don’t count in my personal recollection of my teams productivity. Why not? For a good reason — I don’t think the productivity of somebody in intense learning mode has a lot to tell me about how software teams work. But that’s my decision, and it’s subjective, and suddenly I’m deciding which of my hypothetical data “really counts” and which doesn’t. That’s a normal process, of course, but it’s not How Science Is Supposed To Work. In reality, I’m already skeptical of the 10x finding, and pulling newbies out moves the data in a way I’m comfortable with, so I’m not likely to examine that decision too closely. (See Bias, Confirmation.)
I spent about five years reading and writing social science academic work, and if there’s one thing I learned it’s to always be skeptical of any finding that confirms the author’s preconceptions. (See also: Stephen Jay Gould’s The Mismeasure of Man — well worth your time if you deal with data.) Data is complicated, any real study is going to generate a ton of it, and seemingly trivial decisions about how to manage it can have dramatic effects on the perceived results.
I spent a lot of time researching education, which shares with software engineering the idea that individual performance is much harder to measure, or even to define, than you might assume at first glance. Empirical education studies tend to fall into one of two groups:
- A study under very controlled lab conditions, where the researcher is claiming that a clear data result is applicable to the larger world.
- A study in the real world, with messier data, where the researcher is claiming that there is an effect and that it as because of some specific set of clauses.
Both studies are problematic — the first kind often have small or non-representative subjects, the second kind is often a long-term study of one group with real questions as to whether the result is at all reproducible. On top of which you have the Hawthorne Effect (any group that knows it is being observed tends to increase performance no matter what the intervention is) and the effect whose name I can never remember where the more attention is paid to a specific metric the less reliable that metric becomes as a proxy for overall performance.
Or, looking at this another way… I got in a conversation at SCNA this year about why SCNA talks so rarely have empirical results about the value of the software techniques discussed. My kind of glib answer was that we’re all a little afraid that empirical results wouldn’t support the value we perceive in what me might call the “SCNA way”. By which I partially mean that we’re afraid that a badly-designed study might suggest that, say, Test-Driven Development had little or no value, and we’d all have to expend energy dealing with it. (But of course, I’d say that any such study is badly-designed, because of confirmation bias.)
But I also mean, I think, that I’m not interested in that kind of empirical testing and as interesting as I find the pursuit, I have little confidence that it will have much relevance to my day-to-day work. Agile methods, TDD, what we call “good” coding practices make my job easier and more sane. I have my own experience to draw on for that — which I realize is not science, but it’s working for me. Asking them to be proven to be the most efficient way to design software seems like impossible icing on the cake. For me, it’s enough that the methods I favor seem to result in saner, more pleasant work environments. It’s weird to simultaneously be interested in empirical results in my field, and yet at the same time feel they are utterly separated from what I do, but that’s where I am.