Data Treatment Techniques: Difference between revisions

From ADA Public Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 11: Line 11:
****[[File & Folder Naming Conventions]]
****[[File & Folder Naming Conventions]]
****[[Double-Zipping Files and Folders]]
****[[Double-Zipping Files and Folders]]
*****[[Instructions on how to Zip and Encrypt a file or folder]]
*****[[Instructions on how to Double-Zip]]
***[[Collection of Data]]
***[[Collection of Data]]



Latest revision as of 22:24, 19 January 2020

There is no one correct method of data treatment, and the techniques applied to your data should be based upon each dataset’s information, ensuring that the data remains as useful as practicably possible but making sure that there is no reasonable risk of re-identification. The ADA collection is made up of various datasets ranging from those that have been highly manipulated and therefore are able to be released as Open Data through to those that may have identifiable items within them and are therefore not in general circulation and require specific conditions to be met in order to access the data.

Most data treatment techniques fit into one of two broad categories;

a. Data Reduction which reduces the detail available to the user, or

b. Data Modification which makes small changes to the data, masking the true values.

Whilst the removal of Direct Identifiers is an obvious and essential component of de-identification, other techniques should then be used to increase the likelihood that the data will remain de-identified. Coupled with other Data Sharing Principles such as People and Settings controls, these methods should assist in minimising the risk of disclosure to a level that is reasonable. Where you are unsure as to the best or most appropriate method of data protection for your study, you should contact the ADA for expert advice. Some of the most common techniques are outlined below:

1. Sampling: For surveys, the sampling fraction is usually specified by the study design and therefore its choice often rests outside the disclosure control process. For other forms of data, there is value in considering sampling. It cuts down the risk associated with Response Knowledge by creating uncertainty that a particular population unit is contained within the dataset, and thus decreases the likelihood that a match is interpreted as a true re-identification. Even a 95% random sample creates uncertainty and for most purposes will not unacceptably reduce the utility of the data.

2. Threshold Rule: This rule sets a particular value for the minimum number of unweighted contributors to a cell. Application of this rule means that cells with counts below the threshold (Low Counts) are defined to pose an unacceptable risk and need to be protected. However, when using this rule consider that there is no strict statistical basis for choosing one threshold value over another. Higher values will increase protection against re-identification but also degrade the utility of the original data. A lower threshold value may be appropriate for sampled datasets compared to population datasets. By implication, everything greater than or equal to the threshold is defined as posing an acceptable disclosure risk. It may be that a cell below the threshold is not in fact a disclosure risk, while a cell value greater than the threshold value is disclosive. Judgement also needs to be exercised when considering cells with no contributors (i.e. zero cells), or where all contributors to a row or column are concentrated in a single cell (i.e. 100% cells), as these may also pose a disclosure risk.

3. Cell Dominance or Concentration Rule: This typically applies to tables that present magnitude data such as income or turnover. The rule is designed to prevent re-identification of units that contribute a large percentage of the marginal cell’s total. It defines the number of units that are allowed to contribute a set percentage of the total. The rule can also be referred to as the (n,k) rule, where ‘n’ is the number of units which may not contribute more than ‘k’ percent of the total. For example, (2, 75) dominance rule means that the top 2 units may not contribute more than 75% of the total cell value. As with the Threshold Rule, there is no strict statistical basis for choosing values for the dominance rule.

4. P% Rule: This can be used in combination with the Cell Dominance Rule to protect the attributes of individual contributors to magnitude data. It aims to protect the value of any contributor from being estimated to within a percentage ‘P’, and is helpful where a user knows that a particular person or organisation is contained in a dataset (i.e. when Response Knowledge is present). Applying the rule, limits how close in percentage terms a contributors estimated value can be from its true value. For example a rule of P%=20%, means that the estimated value of any contributor to the dataset must differ from its true value by at least 20%.

5. K, I, T-Anonymisation: This rule is a form of Statistical Disclosure Control (SDC), balancing the need for useful data, but minimising the risk of disclosure. It requires all sets of key variables (each combination of possible values of those variables) to have at least ‘k’ records that have that combination of values. It effectively sets a threshold for the number of occurrences of the combination of values. There is no definitive value as to what the value of ‘k’ should be, however common choices are 3 and 5. An expansion upon this basic use is to apply ‘I’ diversity. Here a further constraint is placed whereby each equivalence class (group of data units sharing the same attributes) must have multiple values for any variable that is considered sensitive. In some cases this can be even more disclosive as it can reveal information based upon the number of ‘I’ diversity classes. For example, assume that the sensitive information is that the patient has cancer; ‘K’ anonymity tells us that there has to be at least ‘k’ occurrences of this in the data for each age group. However, where there are no respondents that do not have cancer, it is still disclosive as it is now known that all ‘k’ results for that particular age group have cancer. ‘I’ diversity would require this category to be further broken down into for example, type of cancer. This on one hand offers greater variation but also provides greater richness of information. To deal with this, and other problems associated with ‘I’ diversity, a further constraint ‘t’ closeness has been introduced. To satisfy this criteria, the value of each distribution-sensitive variable within each equivalence class should be no further than the threshold ‘t’ from the value associated with the variable distribution across the whole dataset.

6. Suppression: Removing Quasi-Identifiers (for example, significant dates, professions, income etc…) that are unique to an individual, or which in combination with other information are reasonably likely to identify an individual, meaning that the data are only partially released. It can be usefully applied as a de-identification technique to cells that do not meet the Threshold Rule. The suppressed cell may be annotated with an ‘X’ or ‘Not Released’ for example. Sometimes secondary or consequential suppression will also be required, such as where one cell in a table is suppressed but there is a cell total figure. Therefore preventing suppressed values from being derived from the remaining information. A useful way of masking the true values of cells is to suppress just the primary cell and then amend the totals in the rows and columns to state that the total is greater than a certain figure, but not what the actual figures are. This method also prevents users from determining a maximum and minimum value for cells (i.e. a bounded field that the true value must lie within). Suppressing values or removing records that cannot be protected will impact on the quality of the data as it limits the number of variables available for analysis. Aggregate data themselves are primary examples of suppression, since they are partial releases of the underlying microdata.

7. Combining Categories: Combining into categories information or data that is likely to enable identification of an individual. Common examples are the use of age ranges as opposed to single year records. For example, ages expressed in ranges (25-35 years) offer greater protection than when expressed as single years (25, 26 and 27 years). Other options include describing industries at higher levels (e.g. mining rather than coal mining), or combining small territories with larger ones (e.g. ACT with NSW). Where possible, it is always best to combine categories that contain a small number of records, such as those that do not meet the Threshold Rule so that the identities of individuals in those groups remain protected (e.g. combine the use of electric wheelchairs with manual wheelchairs). It is possible to combine different rows, creating larger category groups. Or to combine adjacent columns to increase the number of values within a group. When combining categories just be careful that other data is not readily available that will allow the values previously combined to be undone and discovered. Data Owners and Custodians should wherever possible consider other tables or information that is being released at the same time, or that is likely to be released in the future, and those that have already been released. Clearly this is not always possible to know or indeed predict, thus the Data Owner should err on the side of caution if there are any concerns.

8. Top/Bottom Coding: This is another form of combining categories, where extreme values above an upper or below a lower limit may be collapsed into an open-ended range such as an age value of ‘less than 15 years’ or ‘more than 80 years’. This is particularly useful for protecting small populations.

9. Perturbation: Alteration of some or all non-zero cells in a table (often with a random component) that would otherwise likely lead to the identification of an individual, such that the aggregate information or data is not significantly affected. This leads to a ‘tolerable error’ in the result, meaning that a user may make a reasonable guess of the true value of a cell, but they will be unable to know the value with any certainty due to the added ‘noise’. For count data (frequency tables), a randomised number may be added to the original values, known as additive perturbation. This is often applicable for non-economically sensitive data. For magnitude tables, the original values can be multiplied by a randomised number, known as multiplicative perturbation. For example, adding a few dollars to a company’s income is unlikely to change the disclosure risk in any meaningful way, whereas multiplying it by a random amount or percentage will. Perturbation allows the cell values to be protected as any user would not know how much perturbation has been applied, whilst the total value remains within an acceptable margin of the original final figure.

Both of these types can be further broken down into Global or Targeted approaches.

a. The Global approach means that each non-zero cell (including the totals) is perturbed independently, this does mean that there may not be consistency within a table (i.e. the totals do not equal to the sum of the constituent cells).

b. Targeted perturbation is when only those cells that are deemed to be a disclosure risk are treated, but this may result in secondary perturbation being required to maintain the additivity within the table. For example, it is possible to remove an amount from one cell and add half of this to two other cells, meaning that the information loss is minimised, and the totals remain accurate, but the disclosure risk is still reduced. Another method is to take the value removed from one cell and place it in a new cell entirely, creating a new sundry category. Care must be taken when using any of these methods however, as all tables containing this data need to have matching perturbed values in order to prevent disclosure.

It is important to remember that all perturbation may adversely impact the usefulness of the data. Although the totals in a table may remain very close to their originals, some cells values may have changed by as much as 100%. When releasing this data, it must be made clear in the supporting information the process by which data has been perturbed (although exact parameters should not be disclosed so as to prevent the re-work of the data). A caveat should also be made that any further analysis of the perturbed data may not give accurate results. This is catered for in the population of Metadata pages of the wiki for the Self-Deposit process.

10. Rounding: This is probably the simplest approach to data modification, often a value is rounded to a specified base, where all values are divisible by the same number (often 3, 5 or 10). Counts that are a multiple of the base remain unchanged. The data are still numerical (i.e. contain no symbols or letters) which is a practical advantage for users requiring machine readability. Though the data loses some utility, rounding brings advantages from a confidentiality point, as users don’t know whether the rounded value of three for example is a 2, 3 or 4. Users also won’t know if zeros are true zeros, this migrates the problem of group disclosure whereas the original values may have breached disclosure rules.

Normally, the margins are rounded according to the same method, therefore in many cases it does not produce an additive table (i.e. one where the row, column and totals are correct). Even if the true grand totals or marginal totals were known from other sources, the user is still unlikely to be able to calculate the true values of the internal cells. This is also the main disadvantage with rounding, as there can be inconsistency within the table. This is due to the cumulative effect of rounding the data in each cell, which can cause the added totals to be significantly over or under the real value, hence why the margins are also usually rounded. Graduated rounding can also be used to round magnitude tables, this means that the rounding base varies by the cell size, thus some cells may be to a base 100 (132 originally), others to a base 10 (14 originally). Again, the totals will not equal the sum of the cells, as the total is also rounded but it is now much harder to estimate the relative contributions of each cell.

A method of rounding that has presentational advantages is to release tables of percentages rather than actual counts. A value of 0% may not refer to 0 units of the population and depends upon sample size, although it may be possible to determine a range for the value if the sample size is known. Displaying information in this manner preserves the message as the percentages are still representative of the raw figures but are not as exact as the raw figures. Rounding can therefore be very effective at reducing risk when considering individual tables of counts. Care should still be taken to consider the interactions between multiple outputs of data, particularly concerning additivity and consistency between marginal totals in the different outputs as these could lead to re-identification.

11. Swapping: To hide a record that may be identifiable by its unique characteristics, swap it for another record where some of the other characteristics are shared. For example, someone living in NSW who speaks an uncommon language could have their record moved to Victoria, where the language is more commonly spoken. This allows characteristics to be reflected in the data without risk of re-identification. As a consequence of this method though, additional data changes may be required. For example if this is hierarchical data, then family related information for the individual relocated to Victoria would also need to be adjusted so that the original and family record both remain consistent. Like most data focussed controls, this increases uncertainty. It is particularly strong where multiple data outputs are being released from a single data source. For example, a sample of microdata with coarse geography and aggregate population tables of counts for fine geography is a common set of census outputs. Modest data swapping amongst the fine geographical areas within the coarser areas means that the microdata itself is unchanged. However, the modification of the aggregate data will reduce the risk of subtraction attacks.

12. Choice of Variables: An obvious mechanism of disclosure control is excluding certain variables from the released dataset. The Data Custodian can reduce the number of key variables to which an intruder is likely to have access and/or reduce the number of target variables. With microdata, the choice is whether a variable appears in a dataset or not. With aggregate data, the choices are about which variables will be included in each table. If key variables were removed, the de-identification risk is reduced. If target variables are removed then the sensitivity of the data is lessened, and the potential impact of any breach is reduced. The offset of this by removing a variable you can impact on the analysis able to be performed.

13. Imputation: Creating new values generated from original data so that the overall totals, values and patterns are preserved, but do not relate to any particular individual. In order for this to work without adversely affecting the data, it may be necessary to allow the original values to be modelled back in. One critical decision when imputing is what you tell the user. There are multiple options ranging from you choosing to tell them that the data has simply been imputed, or how many values have been imputed, that a model has been used to conduct the imputation or even the values that have been imputed. However, information regarding how the imputation was carried out, i.e. how the model works, should not be released as this can allow an individual to back-engineer the data treatment technique. This can be an effective method if you are already using a model to generate missing data values. The impact on utility depends upon how good a model you have used to impute the values.

14. Encryption or ‘Hashing’ of identifiers: Techniques that will obscure the original identifier, rather than remove it altogether, usually for the purposes of linking different datasets together (but without sharing the information in an identified form).