One of the things that I have been complaining about with many of the data science masters courses is that they are missing a lot of the basic skills that are essential for you to be able to be effective in a business situation. It’s one of the things I was going to talk about at the Women in AI event that was postponed this week and I’m more than happy to work with universities who want to help build a course1. That said, some universities are realising this is missing and adding it as optional courses.
One of the best set of resources I’ve seen recently is from MIT. They’ve called it the “Missing Semester of Your CS Education” but it’s something that I believe should be fundamental to every course where the end result is a career developing applications in industry.
The MIT course is freely available online with accompanying YouTube videos linked against the description of each lecture. I’ve taken a look at some of them and highly recommend that you take a look – whatever your technical journey.
Why do I recommend this? Mainly, in industry you are expected to know these skills to be efficient and if you’re in academia then you’ll hugely benefit from having these skills.
Command line, Shell and scripting
If you’re doing anything with docker then you’re going to want to understand everything from the command line and what you can do in different shells, particularly when it comes to scripting. Particularly when you are wrangling data, being able to script renaming of files easily or selective copies can save you a lot of time.
I love vim as an editor. You may wonder why bother learning vim when you can use visual studio or other equally powerful development environments. While these are fantastic, if you are remotely connecting to a server or a docker container – having a visual interface is unlikely. If you do have it then it might be slow and frustrating. If you can quickly investigate and edit files with the simple tools then you’ll be more effective (like the issue I had to debug a couple of weeks ago). There are environments where you are also unable to install new applications on a remote server for security reasons, so knowing the basic editors will make your life easier.
Version control (git)
In an industrial setting you will not be working in isolation – you will be working on code with other people and, at time, more than one person will be working on the exact same piece of code. There are many tools to help with this and git is one of the best. Not only will you be able to merge different branches of code when multiple people are working on them, but you will also see different versions of the code. It was working yesterday but you’ve made too many changes? You can see the changes. If you make sensible commit messages with each change then you’ll be able to track down what might have broken your code. If you’re the sort of person that has multiple versions of code (or your thesis!) in different directories then please start using source control – future you will be grateful.
This is one of the most common skills to lack. If it was working before and now is not, ask yourself what has changed. If it has never worked, understand how to break it apart and find the (first) point that is the problem. Good diagnostics is a rare skill and comes with practise, but if you can quickly get to the nub of issue with your code then you’ll save yourself a lot of time going forward (and get a lot of kudos from your colleagues).
Understanding security is essential for any business application. You may think that your code isn’t a risk but you should always code with good principles in mind because you never know when your code might be in a position that it can compromise the rest of the system or expose data. Many of the problems we’ve seen with data exposure have been due to developers not taking security seriously.
What’s missing from the missing semester?
There’s still one thing that’s missing form this course – rigorous testing. If you have not covered how to set up a test harness, mock data and general principles of test driven development then please take a look at this. I’m currently writing a book on this but in the meantime there are lots of great resources online. Tutsplus has a great resource to get started with nose tests in python – please do take a look.
If you want a list of data skills that you should know then I have another post here. Are there any other ancillary skills that you think data scientists need that I haven’t mentioned?