PySpark
We have collated some information on styling in PySpark, logging and error handling, PySpark benefits as well as a tutorial.
We have also gathered some information on unit testing, unit testing field definitions and Python function- these are also applicable for PySpark.
PySpark Benefits
PySpark is a flavour of Python that enables us to make use of the distributed processing available in NHS Digital. This enables analysts to run queries against datasets that are far too big to fit in computer memory.
We recommend that analytical teams in NHS Digital should in general use PySpark for their code. There are a number of reasons for this:
-
In NHS England we have some very large datasets. Given our tech stack, the best way for us to process these datasets is to use PySpark. Other options risk running out of memory or disrupting the work of other teams by using compute resource inefficiently.
-
If you know that you will need to use PySpark sometimes then it is easier to just use it from the outset instead of trying to adapt Python or R when you run out of memory.
-
PySpark has critical mass in the NHS Digital data engineering community and so there is depth of technical knowledge.
-
PySpark is much easier to learn for people coming from SQL. All the same keywords are used: select, where, group by, etc. This is in sharp contrast to e.g., Pandas which has an extremely steep learning curve for new starters.
-
Choosing one language to support enables us to provide better training and support.
-
Aligning around one language as much as possible means that it is easier for teams to mutually support each other.
Note: we strongly believe that teams should have the option to use whatever tool they deem right for their situation. We focus our efforts on supporting PySpark but do not want to prevent teams from choosing another course
External Links Disclaimer
NHS England makes every effort to ensure that external links are accurate, up to date and relevant, however we cannot take responsibility for pages maintained by external providers.
NHS England is not affiliated with any of the websites or companies in the links to external websites.
If you come across any external links that do not work, we would be grateful if you could report them by raising an issue on our RAP Community of Practice GitHub.