My LinkedIn news feed was lit up last week by a medium post from Dario Radečić originally posted in December 2019 discussing how much maths is really needed for a job in data science. He starts with berating the answers from the Quora posts by the PhD braniacs who demand you know everything… While the article is fairly light hearted and is probably more an encouragement piece to anyone currently studying or trying to get that first job in data science, I felt that, as someone who hires data scientists1, I could add some substance from the other side.
Firstly, you do need mathematics to do data science, you can’t get away from that. The fewer qualifications that you have will mean that you will have to demonstrate the skills that employers are looking for in other ways. However, we’re in the 21st century. We don’t use slide rules any more ;), we don’t even have to code the basic statistical toolkit. As an employer I expect you to know how to use these tools, adapt them and interpret them.
So how much maths do you need? As a minimum, enough to choose the correct tool and interpret the data correctly. That’s still too fluffy. When you look at what a data scientist actually does (and there’s a great summary of this on kdnuggets here) there is a programming component and a maths component consisting of statistics and probability.
As an aside, this is a starting point of what people will expect of you. Every job will differ in what they want as a minimum and what will make you stand out. Most job descriptions list everything because companies often list everything that want rather than what they need – generally only a sub set of this will get you the role. Naturally the more you can offer an employer the more attractive you will be as a hire. While there are many soft skills that are essential as a data scientist, we’ll stick to the critical mathematics here.
If someone claims to be a data scientist2 then I expect them to know the following. By “know” they should be able to :
- explain under what circumstances this technique is valid and when it could give misleading results
- create an application using this technique with standard libraries (e.g. NumPy, Pandas etc)
- interpret the results for a wider audience
I explicitly do not mean that you need to know:
- how to code this from first principles
- mathematical proofs of any of these
- any of these algorithms from memory
However, there will come a time in your career when you will have to start adapting some of the algorithms or implement something from a paper. At this point you will need to understand the standard mathematical notation to follow the reasoning in an academic paper and be able to follow the proofs, which will need more advanced mathematical skills. You’ll be continually learning as a data scientist so you’ll pick this up along the way3.
Probability and Distribution
This should be your staring point. If you are going to look at any data then you need to understand natural variation and outliers. Make sure that you can determine the probability of different results given various distributions. Be able to recognise deviations from these standard distributions and the impact this will have on the probability of getting a result. You should be able to create graphs of your data and interpret this in terms of inter-quartile ranges, quantiles, variance and standard deviation.
Hypotheses, estimates and intervals
You’ll know how to get point and interval estimates for data using maximum likelihood, confidence intervals, and t-intervals. Given a question phrased in business terms, you will be able to turn this into a hypothesis test with appropriate null and alternate hypotheses. For specific distributions you will be able to determine evidence in favour of, or against, the null hypothesis for both single and multivariate problems.
Be able to talk about data sufficiency, bias, and type I and type II errors. How to deal with these problems and the impact that they have on the inference of the results.
While you should already understand the impact of sample size on statistical inference, you should also know all the different errors in sampling and the impact they can have on your results: selection bias, random sampling, over and under coverage, measurement/response error, processing errors, and participation bias.
Starting with Bayes theorem and simple probabilities, extend this to Bayesian inference with priors and posteriors and updates as new data is available. Be able to create hierarchical models. Be aware of Markov-Chain-Monte-Carlo techniques and when to use them.
There’s no getting away from this as a topic. Starting with linear regression, you should understand the different approaches to fitting the data, whether that is least squares or least absolute deviations etc. The different types of linear regression (simple, multiple, multivariate) and their assumptions. You should be able to use a standard programming library to create a linear regression model and understand the impact of changing the default parameters these libraries use.
Other regression techniques that you should be able to use with the standard libraries are support vector machines, random forest and decision trees.
Here you’re going to need an appreciation of the different clustering techniques and what their definition of a cluster is so that you can choose the correct technique for your problem. You should be able to understand how centroid-, density-, distribution- and connection-based clustering will result in different labels for data points that are not close to the centre of the cluster. As with regression, you should know how to use the standard libraries to code a clustering example.
Return on Investment
Data science in business is not always about perfection, you will need to find thresholds for all your predictions. You will need to balance your predictions depending on whether false positive or false negative predictions are more risky. At the same time, you will need to ensure that you don’t commit the sin of using intuition to force your model to give a pre-determined result. While calculating return on investment can be very simple algebra, you need to understand the business impact of your models as this will impact the confidence the business has in you.
All of the above you should be able to google for examples and come up with the background you need and code samples. This will not take you long to learn. What you will need to commit to memory is when and where to use each technique and when they will let you down.
When I was doing my PhD, the best piece of advice I was given was “learn the boundaries – once you know whether something is possible then you can find the answer quickly”. Nobody expects you to have every single algorithm at the front of your mind at all times. But if you know the basics and the boundaries, it will be easy to find what you need to solve the problems you are set.
What makes a data scientist stand out in interview is when you can discuss a problem and they understand where the limitations are and intuitively know the correct approach. You can’t get this from memorising proofs or coding regression from scratch. You get this from working with data, from making mistakes and investigating why the predictions from your models are wrong. You get this from knowing whether that outlier data point is a blip that can be ignored or something significant that needs attention. You get this from using the techniques over and over until you’re not
Don’t forget the scientific principle too – without this you’re just a data analyst or engineer…
- And is also one of the people he describes as posting bad advice but I’d never write such unhelpful posts! ↩
- I’ve seen a lot of people on LinkedIn claim this when I know that they haven’t the faintest idea about any of these techniques :). ↩
- You may need to do this as self study rather than just through experience, but it will be worth it. ↩