Source Code Control for Data Scientists

XKCD explains git source code control.. 🙂

I work with many people who are recently out of academia. While they know how to code and are experts in their fields, they are lacking some of rigour of computer science that experienced developers have. In addition to understanding the problems of data in the wider world and testing their solutions properly, they are also unaware of the importance of source code control and deployment. This is another missing aspect from these courses – you cannot exist as a professional developer without it. While there are many source control setups, I’m most familiar with git.

I’ve recently written a how-to guide for my team and was going to make that the focus of this post, although I’ve seen some very good guides out there that are more generic, so I’d like to explain why source code control is important and then give you the tools to learn this yourself.

Firstly why source code control? If you’re working on your own, even as part of a larger team, the temptation is just to code locally. This is fine up until you find you’ve broken your code and don’t know why… It was working two weeks ago, wasn’t it? Without previous versions of your code (unless you have an infallible memory) you’ll be stuck.

“Aha”, you say, “but I regularly back up my work”. Great, you should always take back-ups and put them on a different machine. So you have date-stamped folders and incrementing file numbers, they may even be on a server so others can see them. How would I know where you got up to, what you were working on, or whether it’s safe to take a copy? Where should I put my changes so you can benefit from them too? As soon as someone else is added to the project things start getting tricky.

Git¹ can make your life easy here and it’s not something that should be scary. Github itself even has a simple guide to get you started.

In my opinion there are a few rules that you should observe, even if you know source code control inside out.

Never code directly into master

Once a repository has been set up, all development should be done in branches. If you code in the master branch you make life difficult for everyone else. You could potentially break master just as a critical change needs to go out. What if you’re on vacation and someone else needs to make a change? The master branch should always be 100% tested, functional code that could be deployed. Don’t even open master in your local dev environment. It’s not worth the pain of accidental changes.

Always check you are in the correct branch before doing any changes

Sounds simple but if you spend a day working in the wrong branch and have to manually resolve your changes you’ll wish you’d taken the second to do a branch check first.

Commit at each stable point

A stable point is when you have finished a succinct task and your code is tested and working. You may also check in and commit non-finished/non-working code at the end of each day. This allows someone else to pick up your work if you are ill (for example). Even on big tasks you should be committing every day having broken the task down into smaller sections. Push your changes to the origin server (which may or may not be GitHub.com) on your own branch.

Rebase your local branch before trying to merge into master

This may seem like overkill, but this will prevent master getting corrupted, which is pretty important if you like your evenings and weekends and don’t want to feel the wrath of the engineering department 😉². From your own branch, rebase. This undoes all your changes, adds in any new changes from master and then attempts to put your changes over the top. If someone else has changed something you may get conflicts. This will stop the rebase at a particular commit. Git will tell you the files with the problems.

Work out which should apply and delete (or comment out) the extra code and the wrapper lines. When all conflicts are resolved, you will need to check in the changes and commit before continuing the rebase. Finally, you will have rebased from master. Test your code thoroughly – there may be functional conflicts that git did not spot. Check in and commit any changes in your branch, before attempting to merge into master – this will give you a stable point to roll back to.

Never force anything in git

This will prevent a world of pain for everyone in your repository. If you force an update when others are following proper process then none of their merges will be clean and you will almost certainly break everything. Just because a command is there, doesn’t mean you should use it. If you don’t know how to use it safely then you shouldn’t touch it³.

Useful Guides

Rather than rewrite other great posts here are some accessible beginner guides to source code control (and particularly Git) that you ought to read if you’re getting started.

Firstly @jmourtada‘s fantastic post on dev.to about Git that made me abandon my half written draft, really nicely done.

Also @dr_lepper’s series on git is your friend not your foe. The one on rebasing and merging is particularly interesting and the first in the series can be found here.

Better explained has a great guide explaining general source code control from first principles, which overlaps a little with the start of this post, but is also well worth a read.

Other source code control systems are available. ↩
There is another school of thought that suggests merging into master rather than rebasing your local branch. I prefer the rebase then merge approach as it gives an extra level of protection and rollback. Rebasing and then merging does affect the history so it depends how you (or your company) want to see it. ↩
Also doubles as a life rule! ↩

Source Code Control for Data Scientists

Published by

janet

Share this:

Related posts:

Published by

janet