Common Disclosure Risk Factors
- 2. Deposit Preparation
- Collect and Prepare Data File(s)
- Preparation of Data
- Collection of Data
- Collect and Prepare Data File(s)
Detailed below are a number of common disclosure risk factors that relate to the user or the data itself. The Data Custodian should consider these risks and add appropriate Data Protections prior to depositing their data with the ADA.
1. Incentive: It is often difficult to establish the motives of an individual or the incentives that they may have to disclose Personal Information. However, generally the more an individual or organisation is likely to gain from re-identifying a record, the greater the disclosure risk. Incentive also forms the fundamental principle of trusted access, here Data Owners share accountability for protecting data confidentiality with the Data Custodians, the incentive being for data users the ongoing authorisation to access the data. Additional information regarding Trusted Users can be found under the Data Sharing Principles.
2. Data Sensitivity: Some variables may require additional treatment if they are deemed sensitive (e.g. health, TFN or criminal information). This treatment may be dictated by legislation and policy, such as the aforementioned Specific Privacy Act 1988 Regulations, as well as confidentiality obligations, such as Ethics approvals. This can be a significant balancing act as often the variables that are of most interest to researchers are also the most sensitive.
3. Data Age: As a general rule, the older the data, the harder it is to re-identify an individual or organisation and therefore it is considered to be less risky. This is simply due to the fact that circumstances change, people move to a new house, they change jobs, have pay rises etc… although these changes can also render the data useless if the circumstances have changed so much from when the original research was conducted. Typically, older data acquires a basic level of data protection owing to this divergence. This data divergence due to the age of the information occurs in two common forms.
a. The first, Data-to-Data Divergence exists as a result of differences between two separate datasets that may have once contained the same or similar data, for example if one dataset is updated and another is not, causing them to diverge.
b. The second type is Data-to-World Divergence, this occurs as a result of changes in conditions from when the data was collected, for example a person has moved jobs, homes, had a pay rise etc…
It is possible for two separate sets of data to diverge at the same rate, known as parallel divergence. In this case, there is no difference in the probability of re-identification. It is for this reason that Data Age should not be solely relied upon to provide a proactive protection measure from disclosure risks, and therefore it should only be considered as an additional buffer to other data protection techniques.
4. Sample Data (Survey): Data based on surveys or samples taken from a population are generally a lower disclosure risk than full population datasets. This is because there is the inherent uncertainty over whether a particular target individual, organisation or entity exists within the sample. There may also be other records from within the population that have identical characteristics to those in the sample, therefore the certainty of identifying an individual is further reduced. The risk is not reduced to zero, particularly when considering rare characteristics which may be re-identifiable in a sample as well as in the population. In order to further protect the data, the specific sample selection methodology could be withheld from rsecondary users, thereby making deliberate identification attempts more difficult.
5. Unique or Rare Characteristics: Even if a sample contains only a few data items and categories, a disclosure risk may exist if the data contain unique, rare or remarkable characteristics (or a specific combination of characteristics). This risk depends on how rare or remarkable the characteristic is. For example, a widow in a sample aged 19 years is more likely to be identifiable than one aged 79 years. In addition, it is important to consider the rarity of a record from a population perspective. For example, there may only be one 79 year old widow in a sample, but they will not be unique in the entire population. The sampling process is a significant contributor to protection of the confidentiality of that individual, as a user is unlikely to know which 79 year old widow was actually selected. It is advisable however, to protect that single individual in any subsequent outputs that may be publicly released (Open Access).
Where a record is not unique in a sample, it follows that it can’t be unique in the population. Likewise, a record that is unique in the population, if selected in the sample, must also be unique. A record could however be unique in a sample but not in the population as a whole, based upon the make-up of the sample data, it is this uncertainty around the concept that helps to provide some data protection, but if the intruder knows that a particular value is unique in a population then disclosure has occurred.
6. Population (Census) Data: Data that represents all of the people in a particular cohort or group (i.e. a population), such as hospital patients or benefits claimants. This data is always considered to be at greater risk because there will be little uncertainty as to who is represented within the dataset. Coupled with the richness of the data that normally accompanies this form of information, disclosure risk is high and will require treatment.
7. Longitudinal Data: Data that is about a defined population collected over time, as opposed to datasets that are snapshots of different samples of the population, have significant disclosure issues. Individuals or organisations that have changes in their characteristics over time are much more likely to be re-identified than those that don’t (in reality very few individuals or organisations don’t change characteristics over time). For example, a business that has relatively constant income over 5 years, but then triples their income for the next three years is more likely to be re-identified compared to a business with a constant income over the same time frame.
A very common example is transactional data such as a supermarket loyalty card, here purchases that are tracked over time are considered risky because of the potential to capture unique changes over time. Such changes can be indicative of marital, economic, employment, geographical and health status that may increase the likelihood of a subject being unique in a dataset and assist in their re-identification.
8. Hierarchical Data: Where datasets have information at more than one level (e.g. personal as well as family level), the disclosure risk is greater. A table that appears to be non-disclosive at one level may contain information that a knowledgeable user could use to re-identify a contributor at another level. For example, a count of people with a household income of $801-$1000 per week may be 6; however, the 6 may refer to a single household at the higher level (2 parents and 4 children). This effectively discloses information about all of the people in the household and may lead to Data Subjects becoming unique within a dataset, therefore more likely to become identifiable. In these hierarchical datasets, contributors at all levels must be protected. Hence it is typical to have to make a change to the data at both levels.
9. Data Accuracy/Quality: All data will contain some level of error, either originating from the Data Subject, data collector and/or the collection process. High data accuracy or quality actually increases the disclosure risk. This is not to suggest that low accuracy or quality should be strived for, or used as a method to manage disclosure risk; but that Data Owners should be aware that a small level of error, inherent in all data, offers a natural degree of data protection.
10. Microdata: Due to the richness of the detail associated with unit records, microdata files are potentially valuable resources for researchers and policy makers. The challenge with this data type is to strike the right balance between maximising availability of information for statistical and research purposes and maintaining confidentiality. Microdata brings with it two key risks, disclosure from a published analysis, or disclosure from a user accessing the unit records.
The second risk is often broken down into spontaneous recognition and malicious attempts at re-identification.
a. Spontaneous recognition is where, in the normal course of analysis, a data user recognises an individual or organisation without deliberately intending to (i.e. when checking outliers in a population).
b. A malicious attempt could involve someone looking for a specific individual in the data, or using other research to confirm the identity of an individual who stands out because of their characteristics.
11. Aggregate Data: This form of data is most susceptible to Attribute Disclosure, particularly from differencing when other data or information is available that has been generated from the same or a similar source. Mathematical approaches, including the use of simultaneous equations to break down cell and column details are also particularly disclosive. Once attributes are known, this further increases the likelihood of re-identification of people, organisations or entities.
12. Direct Identifiers: These are variables that contain information that directly identifies the individual, organisation or entity, for example a name or address. The variable could lead to identification either alone, or together with other direct identifiers, and often in combination with other readily available information.
13. Indirect Identifiers: These are variables that can be used to identify an individual, organisation or entity with a high probability. An example of an indirect identifier might be age, sex, date of birth, marital status or location (such as post code or census geography). Whilst not immediately obvious identifiers alone, in combination, or with other information they may be. Particularly where unique or rare combinations may exist, such as our 19 year old widow example earlier.
14. Key Variables: Often the variables that are of most interest to users are invariably the most disclosive. These are the elements of the dataset that any potential intruder will use to identify an individual and are those for which some form of auxiliary information exists, allowing a comparison to be made, leading to re-identification. There are some similarities between the notions of Key Variables and Indirect Identifiers. The distinction being that a Key Variable is specific to a particular scenario and/or combination of datasets, whereas the term Indirect Identifier is focussed on a specific dataset and lists those variables which could be used as potential identifiers in any scenario. As such, the list of Indirect Identifiers is effectively a list of all possible Key Variables for all scenarios. As subsets of data are often used, the term Key Variable is relevant as not all variables that would be listed as Indirect Identifiers are present.
15. Low Counts: Within frequency tables, each cell in the table contains the number of contributors (e.g. individuals, households or organisations). Disclosures could be made from a table where cells have very low counts or contributors.
16. Magnitude Tables: Where each cell contains summarising information about the numerical contributions to that cell (often in the form of a total or mean, but may be median, mode or range), for example those reporting total turnover for groups of businesses. Disclosures can occur when the value of a small number of units (e.g. one or two businesses with extremely high turnover) dominate the total cell value.
17. Response Knowledge: The simple fact that you know that somebody is contained within a dataset increases the likelihood of de-identification and disclosure. Response Knowledge can occur in two prime ways, either the intruder knows that the data corresponds to a population, and the target is a member of that population, or the intruder has ad-hoc knowledge about a particular individual’s presence within the data (e.g. my neighbour tells me that they were surveyed). Response Knowledge is a more important consideration where the dataset is small, effectively decreasing the size of the haystack that the needle is in, increasing the likelihood of re-identification.