Technical Working Setup & Best Practices
Note: This guide (and most of this repository) is written primarily for Python development but some parts are generally applicable to other languages.
When you are developing a codebase (perhaps for the first time), especially a RAP codebase, that uses functions stored in different directories, it can be tricky to know how to easily edit code, experiment with different sections of the pipeline, test other people's changes and debug errors.
This guide includes sections on:
If you run into issues with anything in this guide, one of the best places to find answers is by searching for your problem on StackOverflow.
What is this guide for?
Setting up a technical working environment is a skill that takes time and energy to learn, and so it may be tempting to stick to "out-of-the-box" solutions such as the interactive platform on your Python training course, Google Collab, or Jupyter. These are great tools for prototyping, learning and experimenting, but there are many pitfalls to relying solely on "out-of-the-box" solutions.
At the end of this guide, the reader should feel more comfortable setting up and using their own technical development environment.
Your employer/educator may have restricted your ability to change any part of your technical working environment. If this is the case, then hopefully this guide will at least help your understand more about your current setup.
What you will need to get started
You will need the following, either on your local machine, or on a virtual machine that you will need to connect to. How to set up and connect to a virtual machine is not in the scope of this guide.
- A working Python installation. We recommend using the Anaconda Python distribution if you are going to use a conda environment.
- Git
This may need to be configured the first time you use it
Code should generally not run on your local machine
Your local machine refers to your computer processing power on your laptop itself. Logging into a platform that allows you to access virtual computation, using AWS or similar does not count as a local machine, that is what we call a 'virtual machine'.
Using a virtual machine ensures:
- Confidentiality: many projects at the NHS involve sensitive data and needs to be protected, which means only working with it in a secure computing environment.
- Computing Power and Scalability: We can use computing resources with more power than a single laptop, and scale them up and down as needed.
- Collaboration: Working with others in a shared environment can help reproducibility, especially at the start of a project.
However, this does not mean that your local machine cannot be used for code development, as long as nothing sensitive is contained in the code.
Typical Workflow
There is no 'right' or 'wrong' choice when it comes to workflow, and it is important that you feel comfortable with the tools you use to develop your code. It is also important that the workflow supports the needs of the project.
However, there are a lot of different setups and when you are new to Python/code, it can be reassuring to follow a suggested workflow, which we have included below.
TL;DR: aim to start analytical pieces of work in notebooks. As the codebase / pipeline grows, refactor the code into functions and classes contained within modules alongside a full test suite - this will help ensure that the outputs are reproducible and the code is maintainable, and it will be necessary when it comes to automating pipelines in production
Tip
If you do use interactive notebooks (i.e. Jupyter notebooks, .ipynb files), make sure *.ipynb is added to your .gitignore file so that all interactive python notebooks are not tracked. Read more about why in this guide.
Starting a project
Here is a workflow to get you up and running with any new data science project. (Example code either shown or available via the links provided)
- Log in to virtual network. (optional, but recommended)
- Create a new conda environment for your project and install any packages you know you will need
- Activate your conda environment
- Create a new directory (folder) for your project in your preferred location
-
Navigate inside the directory in the terminal, and initialise the Git repository
cd your-project-name git init
-
Open the directory in VS Code, or an editor of your choice
code .
-
Create a
README.md
and give your project a title - Create a
.gitignore
. There is a useful VS Code extension for setting up your.gitignore
file. - Prototype code using interactive cells and/or notebooks
- Write code in .py files, separating out functions into modules (directories) where appropriate
- Test functions in interactive console
- Use linters to make any formatting corrections
-
Export your package requirements & dependencies to an environment.yml file
conda env export --from-history --no-builds | grep -v "prefix" > environment.yml
-
Git add and commit the files you have created
- Create a new repo on GitHub or GitLab then follow the instructions to push your code from your project directory to that repo, this will probably look something like:
git remote add origin https://www.github.com/yourname/your-repo-name.git git push origin main
And now you're ready to go.
Working on an existing project
When you are working on an existing project, many of the above steps are no longer needed, and you may need to add in a few extras. (Example code either shown or available via the links provided)
- Log in to virtual network. (optional, but recommended)
-
Navigate inside the directory in the terminal
cd your-project-name
- Check the current state of the project and make sure you're on the right branch for what you're doing using
git status
- (Make sure your local version of the project is up to date with GitHub using
git pull
). Optional as you're likely the only one working on your branch - Open the directory in VS Code, or an editor of your choice
- Prototype code using interactive cells and/or notebooks
- Write code in .py files, separating out functions into modules (directories) where appropriate
- Test functions in interactive console
- Use linters to make any formatting corrections
-
If you've added any new packages, make sure to update your environment.yml file:
conda env export --from-history --no-builds | grep -v "prefix" > environment.yml
-
Git add and commit the files you have changed
- Git push the staged changes
-
When you are ready to create a pull/merge request, you will need to check that your working branch is up-to-date with the main branch and deal with any merge conflicts. (optional, only create a pull request once you are happy with your changes and want them to be reflected in the main branch)
git pull origin main
-
Create a pull/merge request (optional, see above)
Tips for how to write code
Intended as broad advice for people who may not be very confident in Python and are nervous about 'breaking' an existing codebase.
- You can't really break anything if you've version controlled it properly. You can always revert back to a previous point in the codebase history.
- Your code doesn't have to be clean while you're developing it. When they are just starting to get to grips with RAP, some people can be nervous about writing code that isn't all contained in functions, or otherwise "nice and clean". It is totally ok for your code to start out messy. Ideally you would make commits frequently when your code is a bit nicer, but to test and try out, messy code is completely fine.
- Get something working, and then refine it. The most important thing is that you get the output that you want. Then you can go back and refine your code. You can figure out where you've repeated code which should indicate that you can make use of, or create new functions.
- Don't attempt everything all at once. If you have to write a piece of code that produces a calculated output in a particular format for multiple years and regions, start simple! Start with the output, and maybe start with just one year and one region. Then plug in the formatting, and then expand to other years and regions.
Debugging
It is much easier to debug code to run through it step-by-step on a sample of the data. This is especially true when you are trying to run code made up of different functions coming from different modules.
- Interactive tools such as interactive cells and notebooks can provide an easy way to do this. Interactive cells have the benefit of being insertable within the code being developed itself, whereas at other times it's helpful to create an entirely new jupyter notebook to put bits of code in and interrogate each output.
- Another common strategy is to create a blank Python file, such as
temp.py
, and only include the bits of code that you want to run, perhaps making use of interactive cells without changing the body of code that is version controlled. - You can also insert breakpoints, using
breakpoint()
in your code, to only run code up to a specified point, which helps you to interrogate your variables to figure out why something isn't working. To proceed to the next breakpoint, typecontinue
into the terminal. - An alternative to manually inserting breakpoints is built-in debuggers, something that most IDEs offer, including VS Code. However these can debuggers can require a steep learning curve and it can sometimes hinder progress initially to use them.
If you do want to use any of these strategies, it's important to make sure that:
- temp.py is added to your .gitignore file
- all jupyter notebooks are automatically ignored. You can do this by adding
*.ipynb
to your .gitignore. This means that all files with the.ipynb
extension will not be tracked. - any interactive cells you create are not committed into your codebase. This is not as important as the previous two points, and is more best practice than something to avoid at all costs, but it is advisable, especially when working on collaborative projects to not commit anything interactive as you cannot be sure that the cells will operate in the same way on somebody else's machine due to the setup required.
Acknowledgements
Inspiration has been drawn from:
- Step-by-Step Guide to Setting Up a Professional Data Science Environment on Windows
- My Computer Setup for Data Science
- The Definitive Data Scientist Environment Setup
- Setup a Data Science Environment on your Computer
- VS Code documentation
External Links Disclaimer
NHS England makes every effort to ensure that external links are accurate, up to date and relevant, however we cannot take responsibility for pages maintained by external providers.
NHS England is not affiliated with any of the websites or companies in the links to external websites.
If you come across any external links that do not work, we would be grateful if you could report them by raising an issue on our RAP Community of Practice GitHub.