Entropy Essays 5: Team Metrics
Or Why is a Software Team Like A Shortstop
Previously on Locally Sourced: I’ve been writing these Entropy Essays about Agile and XP practices. Here’s the most recent one. You can see the rest here. Tell all your friends and colleagues to subscribe.
One of the great things about writing this newsletter is that it’s forced me to think some ideas through in more structured ways than I might have if I was just talking to myself in my own head. For me, writing down a thought packages it in my brain so that I can move on to consider other ideas that are implied by the original idea.
That said, you need to be careful about clever ideas that come to you as you are putting together an argument. People are really good at seeing patterns where patterns don’t exist, and there’s nothing as compelling as a pithy connection or implication of your argument. You need to make sure you think it’s true in addition to being clever.
All of which is to say that I really think these Entropy things are going somewhere, and think I made some connections that locked that in for me, but which might still be more glib than true.
Here’s today’s thought: Except for cases of very clear toxic badness, it’s nearly impossible from the outside to tell the difference between a well-functioning software team and a poorly functioning software team.
Clearly, the extremes are visible. But I think it’s quite hard to tell the difference between a team that is, say 25% above average and one that is 25% below average. I think I’d even go so far as to say it’s challenging to observe a difference from the outside between a 50% above and 50% below team. (The fact that it’s impossible to define what “50% above average software team” even mean goes to my point…) And yet, the difference between a 25% above average and 25% below average team is probably significant, and is likely the difference between profit and loss for many companies that employ software teams.
I’ve always been a big baseball fan, and in particular a fan of baseball statistics. In retrospect, one thing about baseball stats that was very appealing to me as a child was their lack of ambiguity. It is very clear what a baseball player is there to do – win baseball games – and so the goal of messing with baseball statistics is also clear – identify what a player does that makes the team more likely to win.
One thing you learn almost immediately is how small the differences between a good and a bad player are. One extra hit a week is the difference between a star and somebody struggling to stay in the majors. It’s nearly impossible to tell by watching one game who the stars are if you don’t already know.
For decades and decades, there were no meaningful statistics kept about baseball fielding. Over the last twenty years or so, as those statistics have rolled out, it’s often been the case that there’s a sharp difference between judgments based on eyeballs and judgements based on data. We see a spectacular play where a shortstop makes a diving grab. And we know in our head that the fact that there was a diving grab doesn’t mean anything other than a fun highlight. We know that a shortstop with quicker reflexes might have made the play look routine. We know that a shortstop with more knowledge might have known to stand where the ball was going and made the play look even easier. But it’s very hard to put all those pieces together to make a clear picture of skill over time, and so there’s an inevitable tendency to reward the effort of the diving grab because the effort is so easy to see.
And so it is with software teams. We don’t have metrics that effectively measure how well a team is performing. From the outside, you see at best that tickets get done and bugs get fixed, but you can’t tell whether a team with metaphorical quicker reflexes might have done the same thing and made it look easier.
And that’s best case. I often think about the first time I was really in a corporate hierarchy, where I had not only a boss, but a grand boss, and a great-grand boss and a few more. We’d interact with the great-grand boss sometimes and he’d seem kind of clueless about how our team was doing. At the time it was frustrating for me, but now I also hope it was frustrating for him as well. All he knew about our team was a line on a spreadsheet. What do you even put on that line that would accurately measure a team?
All this has a few implications:
- We don’t really know what works and what doesn’t, so being humble in the face of that uncertainty seems wise.
- Historically, we tend to reward effort because effort is so much easier to see than skill. This is sometimes erroneously called “passion”, and it’s a real problem.
- It’s extra hard to incentivize long term processes or practices because as hard as it is to see how a team is performing in the moment, it’s orders of magnitude harder to see how they are impacting future issues.
- In the absence of real metrics, defaulting to practices that treat all team members (not just developers) with respect and increase their happiness seems wise to me.
All this relates to Agile because Agile practices are supposed to make for “better” teams and “happier” developers. But those practices are hard to execute and their effect is hard to measure, so the practices tend to fade around the edges to versions of them that are simpler. Probably not as good in the long term, but how can we even tell? It’s that fade to simpler practices that is where I get the entropy metaphor.
And if I had to sum up my team in a line on a spreadsheet, I think I’d put these things on it. But I also don’t think they tell the real story.
- All team members subjective 1-5 score of how the team is going.
- All team members subjective 1-5 score of code quality
- How often there’s a really complex bugs caused by interactions of multiple components. (I have a theory that good code quality produces not fewer bugs, but easier bugs).
- Either amount of time it takes to deploy a basic fix to production or the amount of time subjectively lost to spinning wheels over slow tools. (I’m not sure that either of these actually measures anything useful about the team’s skill, but they probably measure something about the team’s happiness.)
What would you use?