Databases and Validation and Uncertainty
A long time ago, I studied research on what makes successful engineering teams. (Not programmers, other engineering fields). I don’t remember a lot of it, but one phrase stands out: “preserving ambiguity”. Successful teams don’t make decisions that aren’t needed, and they don’t get themselves locked in too early.
One fact about the beginning of a project is that you know less about the project than you ever will. And yet teams are often asked to provide estimates, do architecture, and make long-ranging plans at the beginning of the project when it is guaranteed that some of the assumptions will be wrong.
It’s a hard problem, but one way to get around it is to preserve ambiguity. Be humble, and try not to lock yourself into abstractions or structures before you know how the project is going to be used.
Which brings me to data models and specifically data validations, something I was thinking of as I was planning out the data model for my project tracker. Data validations are one way that you can lock yourself into assumptions about the data in ways that can cost you later.
At the beginning of the project you don’t really know what your data constraints are. You might know some of them, but you don’t know all of them, and some of what you think is wrong. Being humble at the beginning of a project suggests cautions about strict data validation.
Validation is one of those coding problems where the cost of not doing the thing is understood – bad data can get into the database and then you are forever fighting it, but the cost of doing an unnecessary thing is not as clearly defined, and often manifests in ways that are far downstream for the original decision, making it hard to pin down the problem.
A Rails application can validate data at the database level via database constraints or at the code level via ActiveModel validations. I am less likely than many other Rails developers to include database constraints. I am also less likely to use ActiveRecord validations. (It’s probably worth mentioning that in the early days of Rails, the core team and DHH strongly recommended doing validations in code and not in the the database, a recommendation that has gotten less forceful over time.)
There are a lot of developers, especially those who come from enterprise backgrounds, who put a lot of stock in having very strongly validated data, and even people who don’t have that kind of background look at me nervously when I say that validation might not be necessary, but may well be harmful.
You are probably looking at me nervously right now, which is why this is such a tricky argument to make, because it sounds like I’m advocating being sloppy.
Don’t be sloppy.
Do think more deeply about the costs of validating data and what your data really looks like in the world.
In some cases, of course, you do have strong reasons for requiring certain conditions in the data. But, especially at the beginning of the project, you may not really know what those conditions will be when your application starts dealing with real world complexity. Saying that data has certain requirements can make your application less able to deal with a real-world situation where the requirement may not apply. (We’ve all seen this, for example, in applications that strictly require logic about people’s names that to not match the complexity of real people’s names).
One way to think about this when looking at user facing data: is this piece of information so important that it is worth raising an error to a user in the process of giving us money?
Form validation errors are off-putting to users and can cause them to back out of a sales process. Is the data more important than the sale? Sometimes it is – you need a credit card authorization for a credit card sale. You probably want a user sign up to include a user name and a password.
But be cautious. I had a client that insisted on requiring company name and job title in their sales flow. And yes, they wanted to block a sale if the fields weren’t there.
Not only did they likely lose sales and annoy users, but they didn’t even get the benefit they wanted, which was clean demographic data about their users. Users would enter “n/a” or “asdf” or whatever. This data is especially costly because now not only do you not have clean data, but you have to actively clean up weird responses that could otherwise have just been left blank.
Even worse, I don’t think you get a code benefit from the validation. On paper, an advantage of validating data is that you don’t have to continuously test for the state of the data, you can assume the validations are complete. In a Rails app, I don’t think you can. ActiveRecords are only automatically validated when saved, which means there are code paths where new data could easily go through the system before being validated so you often need code checks anyway.
There’s also the 100% problem. You have business logic that you think is a 100% true constraint, but in the real world it is more like a 99% or even 99.9% true case. Weird edge cases happen all the time, and if your database or code can’t handle them, you are causing headaches for your administrators or customer service personnel or whoever is handing your data.
To give an oversimplified example, if all your customers live in the US, you may assume that you can require each record to match one of the fifty US states and a six digit zip code. But there are edge cases here. A customer might live on an army base, or they might move out of the US but still be your customer, or who knows what all. One feature of complicated real-world issues is that they are hard to enumerate in advance.
My personal experience with administrators and customer service reps is that they are extremely smart about working the system they have to manage the real world situation they are given. The ways in which they enter the data given the constraints, however, won’t necessarily match the way that you want to analyze the data.
A system that makes real use cases invalid is inviting creative data entry that defeats the purpose of the validation anyway. And again, you don’t know in advance which of your validations are truly 100% and which are just 99.9%.
Database validations, and in particular foreign key requirements, have a clear cost when testing, especially in Rails app. By requiring foreign keys to exist, you are requiring tests to create objects that are may not be necessary for the test. If your order object has a foreign-key requirement on a user, which has a foreign-key requirement on an address or something, then just to create a test with an order, you also need to create a user and an address. This all adds up and can raise the cost and reduce the value of tests.
Data validation is a process rather than a yes/no question, and what it means to have valid data is going to be different at different times in the lifecycle of your code.
The live version of this post is at http://noelrappin.com/blog/2021/07/databases-and-validation-and-uncertainty, and you can comment there if you have something to say. If you like this, and want to see more like it in your email inbox, you can sign up at https://buttondown.email/noelrap.