Basic Python Data Analysis operations
Python offers many ways to achieve multiple calculations, computations and operations. For data analysis and data science overall, Pandas is the most commonly used package or library to perform these operations, along with NumPy.
Reading in data
To get started with Pandas import:
Loading data from a .csv file
df - dataframe
df = pd.read_csv('your_file.csv')
df = pd.read_csv('your_file.csv', header=..., na_values=..., sep=..., etc)
Loading data from an Excel file .xlsx
Loading data from a SQL table/database
import pyodbc
import pandas as pd
connection_object = pyodbc.connect("Driver={SQL Server};"
"Server=xxxx;"
"Database=xxxx;"
"Trusted_Connection=yes;")
df = pd.read_sql(insert_your_sql_query, connection_object)
Add your Server address and Database name in the respective conditions above.
Loading data from an SPSS file (.sav)
There are two way to import a .sav file. One way:
or
The second option creates the dataframe but also captures the metadata of the .sav file, which is useful when running data validation checks.
Common Operations
There are many ways to perform various operations in Python, depending on the library you are using or the general approach and design you've opted to apply to your code. This guide presents several examples on how to perform common SAS operations using pandas, any suggestions or feedback is welcome.
Cases
You will soon notice after importing your data from the .sav file that the column headers are not aligned consistently to upper case or lower case, but a mixture of this. To apply lower or upper case to all column headers:
Extracting the required columns
To select a column:
Filter where a variable is not null/missing
To filter rows based on some specific criteria:
Joins
Add a new column
Sorting variables
Transposing columns
There's a few ways to transpose columns:
To set the name of the axis for the index or columns you can use rename_axis()
:
Grouping by variables
Aggregations
Once we've done the groupings above, add the aggregations:
new_df = df.groupby(grouped).agg(total_sum=("column to be summarised", "sum"), total_count=("column to be counted", "count")).reset_index()
Creating totals per row and per column
df.loc["Column Total"] = df.sum(numeric_only=True, axis=0)
df.loc[:, "Row Total"] = df.sum(numeric_only=True, axis=1)
Appending totals to a table
When creating different aggregations/groupings which are saved in different dataframes, you can then combine these aggregations into one table. For example, suppose you have calculated the totals for age and gender in different dataframes and you wish to append these results to the final output dataframe.
Creating derivations
To create a derivation based on the equivalent CASE WHEN SQL operation, there are several ways to do this in python:
df.loc[df["age"] < 0, "age11_15"] = df["age"]
df.loc[(df["age"] > 0) & (df["age"] < 11), "age11_15"] = 11
df.loc[(df["age"] > 10) & (df["age"] < 16), "age11_15"] = df["age"]
df.loc[df["age"] > 14, "age11_15"] = 15
This results in creating a new column "age11_15" in the existing dataframe, based on the CASE WHEN conditions we applied for the new derivation.
age11_15 = np.select(
[
df['age'] == 10, # WHEN
df['age'] > 15 # WHEN
],
[
11, # if age == 10 then assign 11
15 # if age > 15 assign 15
],
default=df['age'] # ELSE assign "age" column values
)
In the first bracket you assign the "WHEN" part of the condition, second bracket the "THEN", and "default=..." represents the "ELSE" part.
The NumPy option is faster and more efficient whereas Pandas is user friendlier and straightforward in its application. For datasets with only thousands of rows, whichever option you apply won't make a difference.
Apply a column order
Exporting the output
To check the parameters which can be adjusted for each file export, the pandas documentation provides useful resources. pandas.DataFrame.to_excel() for example.
Further reading
External Links Disclaimer
NHS England makes every effort to ensure that external links are accurate, up to date and relevant, however we cannot take responsibility for pages maintained by external providers.
NHS England is not affiliated with any of the websites or companies in the links to external websites.
If you come across any external links that do not work, we would be grateful if you could report them by raising an issue on our RAP Community of Practice GitHub.