Agile Data Science: your data point is probably an outlier

It’s not often that I feel the need to write a reactionary post as mainly the things that tend to inflame me are usually by design.  However today I read something on LinkedIn that caused a polarisation in debate within a group of people who should really appreciate learning from different data: Data Scientists.

 

What was interesting was how the responses fell neatly into one of two camps: the first praising the poster for speaking out and saying this, supported by nearly an order of magnitude more likes than the total number of comments, and the second disagreeing and pointing out that it can work.  What has been lost in this was that “can” is not synonymous with “always”  – it really needs a good team and better explanation than many companies sometimes use.  What irked me most about the whole thread was the accusation that people doing data science with agile obviously “didn’t understand what science was”.  I hate these sweeping generalisations and I really do expect a higher standard of debate from anyone with either “data” or “science” anywhere near their profile.

So what exactly is the problem?

Science requires research and investigation, creating and testing hypotheses, documenting and repeating results.  When you ask a scientist how long it will take to solve a problem you will always get either a long timescale or an “I don’t know”.  This is absolutely fine as if the problem was solved to the point where the full solution could be estimated then it wouldn’t need research.  The fatuous statements presented in the LinkedIn thread that “if agile worked then cancer could be cured in four sprints” 1 show how little many data scientists understand about the process.  I suspect that enforced implementation has left a sour taste of agile for many data scientists but hope that they respect data enough to understand that their data point may not be significant in predicting overall success or failure of something that is working successfully in many companies.

I do sympathise with this view as I’ve implemented agile practises successfully over my career in technology, while at the same time seeing other similar companies fail, and many still are failing!  The same objections raised for agile and traditional development apply to data science.  I’ve given talks before on how you can do AI within a continuous lifecycle environment, and with the correct approach it can work for everyone’s benefit.

A decade ago I was at a very large multi national company who was struggling to get their developers to change from a very traditional approach to more agile practises.  One of the developers just couldn’t get it.  When asked how long a task would take he just couldn’t tell me, shrugging and suggesting “2 years” for every task.  So we took a step back.  Instead of asking how long, I started asking what he needed to do.  Before he realised, he had a very simple list of tasks that he could estimate, and a further list of things that he might or might not have to do, or which required further information. We planned in what we could and everything that couldn’t be done was pushed back.

I take exactly the same approach with research.  And here’s the not so big secret: this is exactly how I did my PhD… I’m sure I wasn’t alone in this.  I had a weekly meeting with my supervisor where we discussed what I’d done in the past week, and what I was intending to do.  Each week this was updated based on the previous results.  If I got to a dead end I’d yell early.  The only difference between this and my current research is that back then it was all in paper in my notebook rather than in an online task manager2

The scientific method isn’t about blindly stumbling forward, you think about what you need to do first.  If you’re organised, you plan your time, work out what is going to block you, and whose help you might need.  When you’ve done your experiments you check your results and plan what you are going to do next.  This fits beautifully with the agile method.

So why does it fall down in industry and why are so many data scientists against it?

Mainly because businesses convert this to “wagile”3: they follow a waterfall methodology but just want to split the tasks into chunks, have read up on sprints and have merged the two together.  This can work for mature solutions in production where the issues to solve are known and logged but does not work for research as the outputs from tasks this sprint will fundamentally change what happens in the next, and the overall timescales.  Similarly there is a disconnect with what should be the output of the sprint.  Have a defined point you are aiming to reach, not (necessarily) something you can share.  This also has a great advantage in documenting what you’ve done and what worked and what didn’t, something every scientist should be doing anyway. I think for the majority of data scientists they are being forced into a point where they have to deliver something working within a sprint, with no time to do the science, and this just isn’t the correct approach, or aren’t following scientific methodology anyway.4.

Solving the problem is too big.  You might need to look at the data first before you know how you are going to begin to solve the problem.  You know, because you are an experienced data scientist, how long you will need with a data set to apply your toolkit.  You know how long it will take to clean a data set of a certain size.  You know how long it takes to create a first version of a CNN.  Do not mistake these for final outputs and more importantly, do not let the business mistake these for final versions.  This is all an iteration.  Once you’ve tried a few things, talk as a team about your next steps.  What worked, what results look promising, which didn’t.  This is the collaborative scientific process and is encapsulated in agile as the retrospective.  Plan your next experiments.  You should never have more that a few weeks’ worth of tasks. Report you progress in terms of accuracy, techniques tried, but most importantly be clear on what you are going to do next as this is what the business needs for its planning.  After a few projects done this way you will also start to get a feel for  broader estimates.  For the problems I’m working on, I know I need a sprint for initial data review, another sprint for initial experimentation and then about another four sprints to get something acceptable as a first version to the business.  If I’m lucky it ends there, for more complicated problems further sprints will be needed to get to a business specific end point.

It really is that easy and this process works whether you are doing multi-year research or just a few months to solve a business problem.  Always bear in mind the focus of the research – in industry this may not be state of the art results.

 

  1. Slightly paraphrased, but I’m not going to call out this individual
  2. Although had they existed back then I would have probably used one.
  3.   I’ve been using this term since 2006 but I know others use it and have no idea where it originated
  4.   I suspect that a subset of people who call themselves data scientists do not actually follow the scientific method at all.  I know a few people on LinkedIn who describe themselves as such because they once managed a data base and wouldn’t know basic statistics if their lives depended on it… but this sort of thing happens every time that a subject becomes trendy 🙂

Published by

janet

Dr Janet is a Molecular Biochemistry graduate from Oxford University with a doctorate in Computational Neuroscience from Sussex. I’m currently studying for a third degree in Mathematics with Open University. During the day, and sometimes out of hours, I work as a Chief Science Officer. You can read all about that on my LinkedIn page.