Methodology

View code on GitHub

Note that on occassions where not all underlying data sources are available as at the most recent archived_on_date, we fall back to the next most recent version, e.g. GDPPR, Deaths and Vaccine Status were not provisioned as at 2024-06-04 so their versions as at 2024-05-28 are used in the batch provisioned as at June 2024.


Multisource Tables

The multisource tables consolidate a characteristic from multiple data sources. The data sources included so far are below. We plan to add data sources to this resource when the need arises - sources included so far have been project led.

Data Source Date of Birth Sex Ethnicity LSOA Record Date
GDPPR - all archived versions ✔ YEAR_OF_BIRTH ✔ SEX ✔ ETHNIC ✔ LSOA REPORTING_PERIOD_END_DATE
GDPPR - snomed codes       RECORD_DATE (DATE if missing)
HES APC ✔ MYDOB ✔ SEX ✔ ETHNOS ✔ LSOA11 EPISTART
HES OP ✔ APPTAGE_CALC ✔ SEX ✔ ETHNOS ✔ LSOA11 APPTDATE
HES AE ✔ ARRIVALAGE_CALC ✔ SEX ✔ ETHNOS ✔ LSOA11 ARRIVALDATE
SSNAP ✔ S1AGEONARRIVAL ✔ S1GENDER   ✔ LSOA_OF_RESIDENCE S1FIRSTARRIVALDATETIME
Vaccine status ✔ MYDOB     ✔ LSOA RECORDED_DATE
  • Date of Birth is only ever accurate to month and year and imputed with Day = 01
  • For HES OP, HES AE and SSNAP, an imputation algorithm is applied to derive Date of Birth from Age. Age represents completed years therefore, an upward adjustment of 0.5 years has been applied to approximate fractional ages.
  • LSOA is sourced from LSOA as at 2011. Please note that we do not currently clean LSOA’s that are not LSOA 2011. If the demographics table includes LSOAs from the 2001 census, these are derived from the original data sources. We are developing a version that will align 2001 LSOAs with their 2011 equivalents, ensuring that discrepancies in 2001 LSOA data within the 2011 columns are addressed and cleaned up.

GDPPR is derived from a JOURNALS (coded data) and PATIENTS (demographic) table whereby the latter contains the most up-to-date view of a patient’s demographics. Thus, demographics in GDPPR are not record specific. Using the latest batch of the extract will include the characteristic only as at its most recent recording but accessing all batches allows a person’s full history to be captured.

No data cleaning is ever applied to the multisource table.

Example

For Person A and B we find all date of birth records that exist in GDPPR, HES APC, HES AE, HES OP, SSNAP and Vaccine Status as at the archived_on date 2024-04-25. They are then harmonised and concatenated into the table hds_curated_assets__date_of_birth_multisource_2024_04_25:

ROW_NUM PERSON_ID DATE_OF_BIRTH RECORD_DATE DATA_SOURCE
1 A 1975-09-01 2020-05-18 GDPPR
2 A 1975-09-01 2021-01-03 GDPPR
3 A 1975-09-01 2023-08-09 GDPPR
4 B 1980-10-01 2022-10-02 GDPPR
5 B 1980-10-01 2024-05-09 GDPPR
6 A 1975-09-01 1800-01-01 HES APC
7 A 1975-09-01 1995-04-14 HES APC
8 A 1975-09-01 2024-02-20 HES APC
9 B 1985-10-01   HES APC
10 B 1980-10-01 2023-12-13 HES APC
11 B 1981-04-11 2003-10-11 HES AE
12 B 1980-07-02 2015-01-02 HES AE
13 B 1981-03-09 2023-09-09 HES AE
14 A 1977-03-01 2010-09-01 HES OP
15 B 1982-05-02 2004-11-02 HES OP
16 A 1975-12-01 2023-06-01 SSNAP
17 A 1975-12-12 2023-06-12 SSNAP
18 A 1976-03-03 2023-09-03 SSNAP
19 A 1976-06-01 2023-12-01 SSNAP
20 A 1975-09-01 2022-01-01 Vaccine Status

Mapping

Codes have been mapped to their descriptions for Sex and Ethnicity, by data source.

For example:

sex_code sex_description sex
0 Not known (not recorded) NULL
1 Male M
2 Female F
9 Not specified (indeterminate, i.e. unable to be classified as either male or female) I
ethnicity_raw_code ethnicity_raw_description ethnicity_18_code ethnicity_18_group ethnicity_5_group
93941000000 Greek Cypriot – ethnic finding 2001 census (finding) C Any other White background White
C Any other White background C Any other White background White
7 Chinese R Chinese Asian or Asian British

Individual Tables

We apply data cleaning, data restrictions and finally a selection algorithm to produce a one-record-per-person table.

Data Cleaning

  • Characteristic of interest must be valid i.e. Not Null/NA or blank. Codes that were mapped to Not Given/Stated/Known or Unknown are considered invalid.

  • Record date must be valid i.e. Not Null/NA or blank.

  • Additionally for Date of Birth: Date of Birth must be ≤ Record Date

Example

ROW_NUM PERSON_ID DATE_OF_BIRTH RECORD_DATE DATA_SOURCE
6 A 1975-09-01 1800-01-01 HES APC
9 B 1985-10-01   HES APC

We remove row 6 because Date of Birth is not ≤ Record Date.

We remove row 9 because the Record Date is NULL.

Data Resctictions Available as function arguments

  • Record date must be ≥ 1900-01-01

  • Record date must be ≤ most recent archived_on date (the max archived_on date for each data source is considered)

Example

ROW_NUM PERSON_ID DATE_OF_BIRTH RECORD_DATE DATA_SOURCE
5 B 1980-10-01 2024-05-09 GDPPR

We remove row 5 because the Record Date is ≥ archived_on date of 2024-04-25.

Want to use our selection algorithm but include only a subset of data sources or define your own minimum record date? Use our functions!

Selection Algorithm

Finally, we apply a selection algorithm to pick one record per person.

Depending on the characteristic, recency or data source is prioritised over the other for this selection. Time variant characteristics are selected differently to time invariant.

Demographic data Data sources priority levels Priority
Date of Birth GDPPR = 1
HES APC = 2
HES OP = HES AE = 3
Remaining data sources = 4
Data source > Recency
Sex GDPPR = 1
HES APC = 2
HES OP = HES AE = 3
Remaining data sources = 4
Data source > Recency
Ethnicity GDPPR SNOMED = 1
GDPPR = 2
HES APC = 3
HES OP = HES AE = 4
Remaining data sources = 5
Data source > Recency
LSOA GDPPR = HES APC = HES OP = HES AE = SSNP = Vaccine Status = … = 1 Most recent record (all data sources have the same priority index)
  • For Date of Birth, Sex and Ethnicity we prioritise primary care over secondary care. We also prioritise HES APC before HES OP and HES AE. Within data sources, we prioritise receny. Prioritising data source > recency means that we do not necessarily pick the most recent record available for each person. Instead, we pick the most recent record available in GDPPR and if there are no records in GDPPR we pick the most recent in HES APC and so on.

  • For LSOA we prioritise the most recent record regardless of source.

We prioritise SNOMED Ethnicity codes before GDPPR Ethnicity codes. Find out more about ethnicity in health research here.

Example

We apply our selection algorithm to the records remaining below:

ROW_NUM PERSON_ID DATE_OF_BIRTH RECORD_DATE DATA_SOURCE
1 A 1975-09-01 2020-05-18 GDPPR
2 A 1975-09-01 2021-01-03 GDPPR
3 A 1975-09-01 2023-08-09 GDPPR
4 B 1980-10-01 2022-10-02 GDPPR
7 A 1975-09-01 1995-04-14 HES APC
8 A 1975-09-01 2024-02-20 HES APC
10 B 1980-10-01 2023-12-13 HES APC
11 B 1981-04-11 2003-10-11 HES AE
12 B 1980-07-02 2015-01-02 HES AE
13 B 1981-03-09 2023-09-09 HES AE
14 A 1977-03-01 2010-09-01 HES OP
15 B 1982-05-02 2004-11-02 HES OP
16 A 1975-12-01 2023-06-01 SSNAP
17 A 1975-12-12 2023-06-12 SSNAP
18 A 1976-03-03 2023-09-03 SSNAP
19 A 1976-06-01 2023-12-01 SSNAP
20 A 1975-09-01 2022-01-01 Vaccine Status

For date of birth we prioritise data source > receny thus our final individual table for date of birth is hds_curated_assets__date_of_birth_individual_2024_04_25:

ROW_NUM PERSON_ID DATE_OF_BIRTH RECORD_DATE DATA_SOURCE
3 A 1975-09-01 2023-08-09 GDPPR
4 B 1980-10-01 2022-10-02 GDPPR

Ties

Conflicts can occur at the same record date at data sources that have the same priority index.

To maintain the one row per person requirement:

  • ties are flagged

  • all ties are pulled into an array

  • one record is randomly selected

Example

The following LSOA’s were found for Person A as at the most recent record date available for that person:

person_id lsoa record_date data_source
A E01000001 2020-05-18 GDPPR
A E01000002 2020-05-18 GDPPR

We randomly select one of the values, create a column that indicates this lsoa was picked from a conflict and retain the original conflicting lsoa’s in an array.

person_id lsoa record_date data_source lsoa_tie_flag lsoa_tie_values
A E01000002 2020-05-18 GDPPR 1 [E01000001, E01000002]

Need to know

Ties affect a very small % of persons once the selection algorithm has been applied. You should decide how you want to proceed with these. The random selection could be carried forward, the tie set to NULL or your own algorithm applied to select an appropriate value.


Demographics Table

Finally, we combine the indivdual tables for date of birth, sex, ethnicity and lsoa into one demographics table.

This table also maps lsoa to the appopriate region and IMD (2019) deciles and quintiles, contains a death flag and date of death and a gdppr flag and minimum gdppr record date.

person_id date_of_birth sex_code sex ethnicity_raw_code ethnicity_raw_description ethnicity_18_code ethnicity_18_group ethnicity_5_group lsoa region imd_decile imd_quintile death_flag date_of_death in_gdppr gdppr_min_date date_of_birth_tie_flag sex_tie_flag ethnicity_18_tie_flag ethnicity_5_tie_flag lsoa_tie_flag
A 1975-09-01 1 M 92461000000105 Pakistani or British Pakistani - ethnic category 2001 census (finding) J Pakistani Asian or Asian British E01000001 City of London 9 5 0   1 2020-01-01 0 0 0 0 1
B 1980-10-01 2 F A British A British White E01025616 East Midlands 7 4 0   1 2021-02-02 0 0 0 0 0
  • Note that there is a small group of people who have died but have no date of death.