It’s rare that I am intentionally provocative in my post titles, but I’d really like you to think about this one. I’ve known and worked with a lot of people who work with data over the years, many of who call themselves data scientists and many who do the role of a data scientist but by another name1. One thing that worries me when they talk about their work is an absence of scientific rigour and this is a huge problem, and one I’ve talked about before.
The results that data scientists produce are becoming increasingly important in our lives; from determining what adverts we see to how we are treated by financial institutions or governments. These results can have direct impact on people’s lives and we have a moral and ethical obligation to ensure that they are correct.
As a mathematician, I trained on numerical proofs and patterns. As a scientist, I look for things to explain, trained to understand statistical significance, repeatability and analysis of data from theory and experimentation. As a software engineer, I trained on optimisation, portability and problem solving. But it was my first role in industry that taught me how to break things. That first role as a junior, seconded into a testing role, gave me a thorough grounding in skills that I believe are essential to anyone working in data science today, but time and time again I see are lacking. Whether it’s a candidate for interview, a paper, or a conference presentation, there’s a distinct lack of formal testing that’s ruining the scientific nature of data science. The courses I’ve seen focus mainly on techniques, test for significance, but not testing your assumptions, code or data.
I can distil this lack into several distinct parts:
- a questioning of evidence
- making hypotheses based on evidence
- testing those hypotheses2.
- making new hypotheses and starting again.
For me, that initial questioning of evidence and then the iterative testing is what makes science great. It’s the ultimate search for truth and there is no shame in changing your view based on new evidence3.
How often do you really get to understand the data you use? If you make any assumptions about the data have you checked those assumptions? Do you just assume that the data is correct and nobody upstream could have possibly made a mistake in data gathering, input or pre-processing? Look at your data – really look at it. Are there any anomalies that could indicate things are not as they should be. Even if everything looks fine, what assumptions are you making in your code? Do you have binary fields (e.g. gender) that could be updated in future? Do you have formatting that could change? Do you always expect dates as DD-MM-YYYY? Get to understand your subconscious assumptions. If you are doing any processing, are you checking that this always works as you expect? Do you skip errors or investigate them?
A big red flag for me is when I see this:
try: something... except: pass
This says, “I kept getting errors when running this so I wrapped it in a try block so I could ignore the fails without investigating them”. Now this may be fine, you may want to throw away the exceptions for good reason – add a comment – those requirements may change in future. I’d like to see this sort of thing highlighted as a warning of a possible error. If you’re silently ignoring problems you may find that the data you actually processes is a statistically insignificant subset or the original. Every warning should be investigated. I also want to know why warnings haven’t been fixed – they may be benign, they may have a considerable impact so unless you have investigated them all confidently who could you know?
Check your data after each transform, make sure you’ve not distorted or corrupted it. Changing a 4 channel image to RGB and resizing, does it still look how you expect afterwards? If not, fix it.
The best advice I can give here is a thorough test harness. Every function you write should not only handle the data you do expect, but also the data you don’t expect. Have you heard the joke about the tester who walks into a bar? Check your counts and look at the data that errors. Every problem piece of data should not only be part of your test harness, but should also inspire you to create potential problems you may not have encountered directly.
QA Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.
— Bill Sempf (@sempf) September 23, 2014
Those of you who are machine learning data scientists also need to look carefully at your data – you can’t just assume that everything is fine and the system will learn around a small amount of bad data: mislabelled classes or incorrect data can significantly throw off your outputs. Purgamentum init, exit purgamentum4.
Word your initial hypotheses without pre-assuming a correlation. This will help you avoid bias as you build your models. If you are looking for a specific correlation then you’ll probably find something, but won’t necessarily see something more significant that you didn’t expect.
Test your hypotheses, and not just by using a segmented subset of your data here (although you really should do that). I’m particularly thinking about proper rigorous testing. What happens if you give your vision network things it hasn’t seen before? What happens if you’ve mislabelled a class? What happens if the data format changes? Are you checking that column 3 is always a timestamp or do you just take it in? Do you have any fencepost errors?
I’m yet to find a data science course that places any emphasis on scientific rigour beyond basic significance of results. As such, I try to hire people who have these skills naturally – the healthy scepticism of the data and process. I also recommend any of the books by James Whittaker who approaches testing in a scientific way that appeals to me.
Finally, seek feedback from the people around you – validate your assumptions, check your code, find the mistakes before they affect someone’s life.
- I don’t make a distinction here in job title, it’s the job itself that’s important ↩
- Or in old English “proving” The “exception proves the rule” means that exceptions tests the rule not confirms it. We have a wonderfully strange, dynamic and evolving language, but this phrase really sticks out as one that’s consistently misinterpreted 🙂 ↩
- There is shame in blindly accepting things without verification of the evidence, or refusing to change your view when the evidence with which you’re presented holds up to scrutiny, it makes me sad that most people don’t ascribe to this view ↩
- Loosely, garbage in, garbage out ↩