Skip to content

Making Code Discoverable using Github Topics.

TLDR

  • Apply topics to each of your published repos following the ontology described below
  • Focus initially on topics related to technique and domain - these are what people are usually most interested in
  • Then, you add even more value by adding other topics.
  • There is a website which scans github for NHS github repositories and displays them by topic - making it easier to find useful code
Why should we care?
  • Applying topics for your repos will make it much easier to for you and others to find and reuse useful bits of code
  • Using a common ontology will make the topics more useful - we will all be speaking the same language
Pre-requisites
Pre-requisite Importance Note
None! Anyone can do this -though you need to have published some code on github already

A key aim of RAP is to not only automate our pipelines, but to re-use useful code in other work. This relies on us publishing the code as publicly as possible, and then making it easy to find these useful bits of code. Topics in github can help with this, however we will get the most benefit from topics by using a common topic vocabulary to describe our GitHub code repos.

The topic ontology described in this guide will ensure our code can be searched by:

  • language and tech used
  • what methods were used
  • whether or not the code is recent or old (and if it still updated)
  • what kinds of data the code was used with and where it came from

The Differences between 'topics' and 'tags'

In GitHub, tags and topics are different:

  • Topics are labels applied to whole repos which describe them, like keywords. Each repo can up to twenty, and github is good at searching and sorting results by topics.
  • Tags are labels applied to specific commits within a git repo, and it's how releases are made, e.g. v0.1.0 might be a tag applied to a specific commit locking in that this commit is Version 0.1.0.

Topics

Our aim with topics is to allow people to find code which might be useful to them, so they can reuse it. With this in mind, they usually want to know what kind of data the code was used on, in which language, if it was using the compatible datastructures (e.g. pandas, or pyspark) and how recently it was made / updated (people are less trustworth of ancient, dead code).

When applying topics to your code:

  • we suggest starting with the priority 1 categories below, e.g. Domain Area and Technique, first, as these are people tend to be most concerned with
  • stick to the topics suggested below - this will ensure we get the most benefit out of them. If there are too many, it becomes meaningless. If there are important ones missing, raise an issue against this github repo with your suggestion for new topics
Priority Category Description Example topics
1 Domain Area/ Datasets/ Data source People will want to know what data these techniques have been applied to, if any. This might inspire them to do something similar, or highlight areas for collaboration. secondary-care
primary-care
hospital-episode-statistics
gpdpr
civil-registration-of-deaths
gdppr
artificial (perhaps if it was using artifical data)
1 Technique People will want to know what kinds of data processing, analyses, etc. were done - this might be quite broad as it should cover the sorts of resuable code chunks people might want to look at. clustering
forecasting
classification
regression
statistical-disclosure-control
deduplication
entity-resolution
record-linkage
summarisation
data-cleansing
data-validation
hyperparameter-tuning
artificial-data-generation
etc.
2 Technology If I want to re-use someones Python or R code, and they made it using a different data structure to me, that might cause problems, hence it's important to describe them dplyr
numpy
notebook
pandas
polars
pyspark
pytorch
scipy
sklearn
sparklyr
sqlalchemy
sqlalchemy-orm
tensorflow
etc.
2 Language People often want to know if the code is using a language they know/use, and though GitHub can sometimes correctly identify the language used in the repo, if you have a lot of documentation or use certain languages (such as SQL), it can struggle. python
r
sql
2 Maturity People might want to know if a codebase is made to a high standard, or by people who are just starting out. baseline-rap
silver-rap
gold-rap
2 Opt-out of re-use A tag for those people who want to publish their code, but make it clear that it is not optimised for re-use. not-optimised-for-reuse

Using topics to find useful repos (and code)

You can search for repos by topic within github using the search bar (e.g., as seen here, with tips on github search syntax here) or you can use this helpful website which gathers the repos and topics from the various NHS organisations on GitHub.


Last update: March 1, 2024
External Links Disclaimer

NHS England makes every effort to ensure that external links are accurate, up to date and relevant, however we cannot take responsibility for pages maintained by external providers.

NHS England is not affiliated with any of the websites or companies in the links to external websites.

If you come across any external links that do not work, we would be grateful if you could report them by raising an issue on our RAP Community of Practice GitHub.