Skip to content

Tidy Data

Adopting inconsistent data formats leads to a huge amount of wasted effort and can actually lead to very complex code. By adopting tidy data format for your work you can both improve your service to users and simplify your own production pipeline.

Reference for this guide

Hadley Wickham's original paper on tidy format (pdf)

Tidy data: Why should I care?

"Tidy datasets are all alike but every messy dataset is messy in its own way" - Hadley Wickham

Teams spend a huge amount of effort reshaping and organising data to meet different needs. Tidy data is an attempt to align on a consistent shape for data that can meet most analytical use cases.

By doing as much intermediate data processing as possible in tidy format, you can often achieve huge improvements in code complexity and difficulty of maintaining code.

Note that adopting tidy data for stats production does not mean that your outputs need to be formatted this way. In fact, it is straightforward to get from a consistent tidy format to any unique format you require.

A common bad practice in stats production is to combine data production and data formatting in the same steps. Teams will calculate statistics and at the same time organise that data to match an excel table. This is a natural thing to do as you write code with your output table in mind but this approach leads to multiple problems:

  • Code often gets repeated for different tables - increasing the burden of updating the code each year and increasing the likelihood of errors.
  • Code is messy and difficult to maintain. Code should do one thing only - either produce the data or format the data.
  • Once the data has been formatted it becomes less useful for other analysts since it is not machine readable. Before another team can use the data, they need to reformat it.

Tidy data offers a consistent format for data analysis purposes that is easy for analysts to produce, easy for humans to interpret, and easy for statistical tools like Python and R to read. Tidy data offers more flexibility for readability and simplicity for data analysis purposes, as opposed to 'messy' or non-tidy data that can be have its own structure and rules. Tidy data involves "structuring datasets to facilitate analysis."

Having a consistent way of storing data, like in the tidy format, allows for analysts to easily use the same tools of analysis in a universal way, for example R or Python. Then, time is spent on producing all kinds of published outputs from the tidy structure, other pieces of analysis and data visualisation outputs such as dashboards.

An example of how tidy data enables analysts is that keeping this consistent form of storing data allows analysts to apply longitudinal methods to large datasets that hold years worth of data, critical for healthcare analytics and research.

What is tidy data

It can be confusing in a large dataset which data is an observation (e.g. all measurements for a specific patient across all attributes) and which data is a variable (i.e. the values for a specific type of attribute so the treatment type or the results of a treatment).

There are several conventions on how a tidy dataset is structured but the most common one follows Codd's 3rd normal form (Codd, 1990):

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

Example of tidying up your data

A 'messy' or non-tidy dataset displaying results to treatments applied to patients:

Patient_ID Treatment A Treatment B
1 N/A 3
2 11 5
3 9 12

The same data but in a tidy format:

Patient_ID Treatment Measured_Variable
1 a N/A
2 a 11
3 a 9
1 b 3
2 b 5
3 b 12

Notice how on this table:

  • Each measured variable (i.e. Treatment, Measured Variable) has its own column.
  • Each observation/measurement forms a row.
  • The treatment variable observations form a table of their own, ordering the dataset in a way that connects to the next subtable (i.e. Treatment b).

How to tidy your data

It's easy to imagine tidy data with a small scale dataset however with real world data things can be trickier. Tidy data follow a similar structure but messy data are unique in their own 'messiness'.

In the subsection above, an example of a messy dataset was transformed to a tidy format. Messy data will rarely be that simple and obviously structured in several different ways, depending on what the data is measuring. This subsection will cover the most common issues with messy data and how to apply the tidy format for each example:

Note: It's perfectly valid and understandable to have your own structured way of presenting data in your published outputs, like in the table below (Occupation and Salaries). You should only format your data like this as a final output step - after producing your data in tidy format. Tidy data focuses not so much on the data formatting perspective but on the data production side, as tidy data enables the analyst to focus on analytical questions and not on data processing.

Internal to NHS England

How to transform your tidy data into a publication style output example: From tidy to publication output.

This video can be found internal to NHS England within the RAP community of practice MS Teams page under:

Files > COP video tutorials > From tidy data to publication

1. BAD: Column headers as values, not as names

Using values instead of names as a column header is a quite common form of messy data, however it can be useful for specific cases and computations. In the table below, is an example of various occupations and number of people for each occupation and salary:

Occupation <£20k £21k-£30k £31k - £40k £41k - £50k £50k +
IT 20 30 60 40 20
Teacher 30 40 50 40 35
Doctor 20 40 50 25 40
Software Dev 10 10 25 25 55
Solicitor 25 25 60 30 50

In order to transform the table above into a tidier format, the columns are transformed into rows and into 2 columns/variables, one for income and one for number of people (example showing first rows for IT):

Occupation Income Count
IT <£20k 20
IT £21k - £30k 30
IT £31k - £40k 60
IT £41k - £50k 40
IT £50k + 20
Teacher ... ...

You can see that the data goes from wide to long.

2. BAD: Multiple variables stored in one column

Sometimes a dataset can contain columns which are comprised of multiple underlying variables. In the example table below, data shows an imaginary count of patients in hospitals per region, grouped by gender and age, combined in one column. So for example, m1930 stands for male patients aged 19 to 30 years old.

Region Year m018 m1930 m3145 m4665 m6680 m81o f018 f1930
West Midlands 2010 2 5 15 20 25 20 4 6
London 2010 5 10 15 20 25 40 5 10
Yorkshire and the Humber 2010 3 9 8 4 20 23 3 9
North West 2010 5 10 15 12 17 15 2 8

To break apart these columns, different approaches of processing can be used, depending on the column header such as breaking apart the string or if there are separators in the column name (e.g. _ or - or ;), capitals or any type of expression. The results of processing is show below, where the combined column of age and gender has broken into 2 separate variables and adding an extra column for the count of patients (first rows of West Midlands region):

Region Year Gender Age Cases
West Midlands 2000 m 0 - 18 2
West Midlands 2000 m 19 - 30 5
West Midlands 2000 m 31 - 45 15
West Midlands 2000 m 46 - 65 20
West Midlands 2000 m 66 - 80 25
West Midlands 2000 m 81 + 20
West Midlands 2000 f 0 - 18 4
West Midlands 2000 f 19 - 30 6

3. BAD: Variables stored in both rows and columns

The next example can be confusing to tidy, as variables are stored in both rows and columns, so Weather station, Year, Month across Day1, Day2 columns and so forth, and tempmax/tempmin (Element column) stored in rows:

Weather station Date Month Element Day1 Day2 Day3...
LS123 2018 1 tempmax 29 - -
LS123 2018 1 tempmin 10 - -
LS123 2018 2 tempmax - 28 25
LS123 2018 2 tempmin - 13 13
LS123 2018 3 tempmax 30 - -
LS123 2018 3 tempmin 18 - -
LS123 2018 4... tempmax - - -

To tidy this, a new column Date needs to be extracted from Year, Month and Day to correspond to each day of observation. Then variables minimum and maximum Temperatures for each date will be stored in two different columns:

Weather station Date tempmin tempmax
LS123 2018-01-01 10 23
LS123 2018-02-02 13 24
LS123 2018-02-03 13 25
LS123 2018-03-01 18 26
LS123 2018-04-24 17 27
LS123 2018-05-12 19 29

This table is in a tidy format, there is one row for one day's observations and two new columns, one for each variable of the column Element.

4. BAD: Multiple types of observations in one table

The next example is subset of data from the top Billboard songs for each year. This data is not in a tidy format. Weeks 4 - 75 have been edited out for readability purposes. Here, there are several variables, Artist, Song Title, Duration, Date Entered, Week and Rank. (as both are the same for each week the song spent in the billboard.)

Year Artist Song Title Duration Date Entered Week1 Week2 Week3
1966 The Beatles We Can Work It Out 3:00 1966-01-08 5 2 1
1966 The Beatles Paperback Writer 3:30 1966-06-25 6 3 1
1966 The Beach Boys Good Vibrations 3:00 1966-12-10 7 4 2
1966 The Lovin' Spoonful Summer in the City 3:20 1966-08-13 3 2 1
1966 Four Tops Reach Out I'll Be There 3:10 1966-10-15 9 3 2

Transforming this dataset into a tidy version would cause unnecessary duplication of observations (thus cancelling the 1 observation per row norm), in the table below. Imagine if Weeks 4-75 were not removed, so then we would have 72 rows of repeating the same Artist, Song Title and Duration variables. Date Entered has be edited to include the starting week of every week on the billboard:

Year Artist Song Title Duration Date Entered Week Rank
1966 The Beach Boys Good Vibrations 3:00 1966-12-10 1 7
1966 The Beach Boys Good Vibrations 3:00 1966-12-17 2 4
1966 The Beach Boys Good Vibrations 3:00 1966-12-24 3 2
1966 The Lovin' Spoonful Summer in the City 3:20 1966-08-13 1 3
1966 The Lovin' Spoonful Summer in the City 3:20 1966-08-20 2 2
1966 The Lovin' Spoonful Summer in the City 3:20 1966-08-27 3 1

In this instance, it is recommended that the tidy dataset is split into two, one for each observational unit, the song title and the rank of the song:

Table 1

ID Artist Song Title Duration
1 The Beatles We Can Work It Out 3:00
2 The Beatles Paperback Writer 3:30
3 The Beach Boys Good Vibrations 3:00
4 The Lovin' Spoonful Summer in the City 3:20
5 Four Tops Reach Out I'll Be There 3:10

Table 2

ID Date Entered Rank
3 1966-12-10 7
3 1966-12-17 4
3 1966-12-24 2
4 1966-08-13 3
4 1966-08-20 2
4 1966-08-27 1

This approach is useful for tidy data buts lacks in data processing tools' capabilities, so in that case these tables will have to be merged again.

5. One type of observational unit in multiple tables

There are datasets that have multiple tables and variables and observational units that are spread throughout those tables. Or these tables change frequently over time. Or there are different approaches to data types, missing data, data formats for each of those tables.

In these cases, it is recommended that each table is 'tidied' up, and then combine each tidy table. To avoid any potential data duplication, adding an extra IDcolumn in each table as well as a column that records the original table/file's name will ensure that each row of data in the merged tidy table is labelled with its source table/file.

Diabetes tidy data example

Internal to NHS England

This video can be found internal to NHS England within the RAP community of practice MS Teams page under:

Files > COP video tutorials > Diabetes tidy example

In this video:

  • SQL vs PySpark outputs comparison (messy vs tidy data)
  • Data shown in suppressed/published form and dummy data

Last update: March 1, 2024
External Links Disclaimer

NHS England makes every effort to ensure that external links are accurate, up to date and relevant, however we cannot take responsibility for pages maintained by external providers.

NHS England is not affiliated with any of the websites or companies in the links to external websites.

If you come across any external links that do not work, we would be grateful if you could report them by raising an issue on our RAP Community of Practice GitHub.