File & Folder Naming Conventions: Difference between revisions
Dahaddican (Sọ̀rọ̀ | contribs) No edit summary |
IChowdhury (Sọ̀rọ̀ | contribs) No edit summary |
||
(32 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
It is important that each version of the data and its supporting documentation is clearly identified using the correct file naming convention. | It is important that each version of the data and its supporting documentation is clearly identified using the correct file naming convention. This helps to prevent erroneous access conditions from being applied to the published data, and potentially unauthorised access to the material. All data files and supporting documentation that is to be uploaded to Dataverse should therefore be named in accordance with the ADA guidance, this also makes the files easily identifiable to ADA Staff and standardises the naming convention across datasets and Dataverse’s. | ||
= Dataverse Uploads = | = Dataverse Uploads = | ||
Line 8: | Line 8: | ||
The ADA standardised naming convention for Self-Deposit files uses the following format: | The ADA standardised naming convention for Self-Deposit files uses the following format: | ||
<q>''' | <q>'''DataverseNumber_StudyName_Year_StudyArtefact_ADAID'''</q> note that the Dataverse application will include a file extension to the name during the upload process. | ||
== Dataverse Number: == | == Dataverse Number: == | ||
The number 0, 1 or | The number 0, 1, 2, 3 or 99, indicating the order that the files are to be arranged in. | ||
Due to compatibility issues with Dataverse, all individual data files and any supporting files that are in the following formats must be zipped | - “0” is to be applied to all licensing information files (e.g. License Agreement Forms, License Terms and Conditions of Use Supporting Information, License Access Guestbook Supporting Information). | ||
- “1” is applied to all other supporting document files (e.g. Questionnaires, Codebooks, Technical documents). | |||
- All Data files (including all Double-Zipped Data files e.g. SPSS, Stata, SAS, CSV and Excel format files) are to be labelled with the prefix “2”. | |||
- "3" is used by ADA Staff to identify Processing Reports that have been raised during the curation of your data. Processing Reports will be archived as part of the Archival Information Pack (AIP) for the dataset and [[Glossary of Terms|Data Owners]] should only see these files within their datasets if it has been returned to you for corrective action. | |||
- Files that are uploaded for ARCHIVING ONLY as part of the Submission Information Pack (SIP), and which are not to be included as part of the published dataset are to be labelled with the Dataverse Number "99_ARCHIVE_ONLY". | |||
Due to compatibility issues with Dataverse, all individual data files and any supporting documentation files that are in any of the following formats must be double-zipped prior to uploading to the dataset to preserve their formatting. Where multiple files are uploaded to the dataset as a folder and at least one of those files is of one of the following formats, the folder will need to be double-zipped. | |||
File Formats requiring Double-Zipping prior to upload to a dataset are: | |||
- SPSS | - SPSS | ||
Line 27: | Line 38: | ||
=== Folders containing multiple files === | === Folders containing multiple files === | ||
In certain cases, a complete zipped | '''''Note''''' - Although it is possible to upload multiple files in a folder to a Dataset, it is not recommended unless absolutely necessary as in doing so you reduce the findability of the files contained within the folder. This is due to the fact that both File Tags and Description Notes can only be added at the folder level in this case. For more information refer to: [https://docs.ada.edu.au/index.php/Adding_File_Tags_and_Description_Notes Adding File Tags and Description Notes]. | ||
In certain cases, a complete double-zipped folder of supporting documentation or data files may be produced (e.g. many longitudinal studies will have multiple files and it would be time consuming to double-zip and upload each file individually). Where packages of such files are to be uploaded as a single folder, the folder should be annotated with the term "-Z" immediately following the Dataverse Number. For example a double-zipped collection of supporting documents that contains at least one file that would require double-zipping if uploaded individually, when uploaded as part of a folder the folder would be identified with the prefix "1-Z". | |||
Each file contained within the folder should still be individually named using the standard naming convention detailed in this section. In addition, as all individual data files in one of the aforementioned file formats are already required to be double-zipped, there is no need to identify these separately with a "-Z" addendum, this term is used purely to identify packages of files uploaded as a folder that contain one or more of these types of file. | |||
Where a large number of Supporting Documents (i.e. greater than 40-50) are to be uploaded together as a folder, as long as the files within the folder would not require double-zipping if they were uploaded individually (for example the folder is a package of supporting documentation files that are not SPSS, SAS, Stata, CSV or Excel file formats), then the folder need not be double-zipped. A good example of this would be the License Document suite of forms. These could be uploaded directly as a folder, with no double-zipping required. | |||
=== Files or Folders containing Sensitive or Personal Information === | === Files or Folders containing Sensitive or Personal Information === | ||
The suffix ‘-S’ is to be used to identify data files that contain [[Glossary of Terms|Sensitive Information]] or [[Glossary of Terms|Personal Information]]. Typically this is used to differentiate between those data files that may be available for public release as [[Glossary of Terms|Open Data]] and those that contain some form of information that requires access to be managed. Many [[Glossary of Terms|Data Owners]] will choose to upload both an open source version of their data as well as a version of the data that requires some form of access restriction. The former is most likely to have far fewer safeguards when sharing, and therefore requires less management and maximises the potential benefits of the data for sharing. | The suffix ‘-S’ is to be used to identify data files that contain [[Glossary of Terms|Sensitive Information]] or [[Glossary of Terms|Personal Information]]. Typically this is used to differentiate between those data files that may be available for public release as [[Glossary of Terms|Open Data]] and those that contain some form of information that requires access to be managed. Many [[Glossary of Terms|Data Owners]] will choose to upload both an open source version of their data as well as a version of the data that requires some form of access restriction. The former is most likely to have far fewer safeguards when sharing, and therefore requires less management and maximises the potential benefits of the data for sharing. | ||
=== Data Files used to create a Derived Dataset === | === Data Files used to create a Derived Dataset === | ||
Finally, for derived datasets (i.e. those made up from multiple sources of separate data), the suffix's ‘a’ through ‘z’, should be used to identify the individual data used in the creation of the derived data. Thus if a new data file was created through the linking of data from the ATO and Medicare, the ATO data may have the Dataverse Number ‘2a’ whilst the Medicare data may have the identifier ‘2b-S’, the latter denoting that the data is also Sensitive. | Finally, for derived datasets (i.e. those made up from multiple sources of separate data), the suffix's ‘a’ through ‘z’, should be used to identify the individual data files used in the creation of the derived data file. Thus if a new data file was created through the linking of data from the ATO and Medicare, the ATO data may have the Dataverse Number ‘2a’ whilst the Medicare data may have the identifier ‘2b-S’, the latter denoting that the data is also Sensitive. | ||
=== Dataverse Number Examples: === | |||
0. Licensing Information (License Agreement Form, License Terms and Conditions of Use, License Access Guestbook) | 0. Licensing Information files (License Agreement Form, License Terms and Conditions of Use, License Access Guestbook) | ||
1. [[Collect and Prepare Supporting Documentation|Supporting Documentation]] (Questionnaires, Codebooks, Technical documents etc...) | 1. [[Collect and Prepare Supporting Documentation|Supporting Documentation]] (Questionnaires, Codebooks, Technical documents, Data Dictionary etc... remember to double-zip those files with SPSS, Stat, SAS, CSV and Excel file extensions) | ||
2. Data files | 2. Data files (all CSV, Stata, SPSS, SAS and Excel file formats are to be double-zipped prior to upload) | ||
2a. Data file used in creation of a derived data file (all are to be zipped | 2a. Data file used in creation of a derived data file (all CSV, Stata, SPSS, SAS and Excel file formats are to be double-zipped prior to upload) | ||
3. Processing Report created during the curation of the dataset and used for the recording of issues that require action prior to publication. Processing Reports may also include recommendations made by the ADA to improve the quality of the data. | |||
99_ARCHIVE_ONLY. Files uploaded for archiving with the ADA only (for example Signed Consent forms or original interview transcripts that are not to be published as part of the dataset but are submitted to complete the deposit and to retain a full audit trail) | |||
-S. The ‘S’ suffix when displayed after the Dataverse number is used to denote that the Data file contains ‘Sensitive’ data | |||
By way of example, the Dataverse Number ‘2-SZ’ would denote a zipped package of data files, which contain sensitive data. | -Z. The ‘Z’ suffix when displayed after the Dataverse number is used to denote that multiple files are contained in a double-zipped folder. In addition, it identifies that at least one of these file would require double-zipping if uploaded individually. | ||
By way of example, the Dataverse Number ‘2-SZ’ would denote a double-zipped package of data files, which contain sensitive data. | |||
== Study Name: == | == Study Name: == | ||
Line 58: | Line 80: | ||
== Study Artefact: == | == Study Artefact: == | ||
This should refer to the specific item in question. For supporting documentation it could be the item (e.g. Questionnaire, Codebook or Report), for data this is typically the file type (e.g. SAS, SPSS or Stata Data File). Where the Study Artefact is more than a single word, for example: Plain Language Statement, you can either leave the space between words or use an underscore to separate the words. Either is accepted in Dataverse. | This should refer to the specific item in question. For supporting documentation it could be the item (e.g. Questionnaire, Codebook or Report), for data this is typically the file type (e.g. SAS, SPSS or Stata Data File). Where the Study Artefact is more than a single word, for example: Plain Language Statement, you can either leave the space between words or use an underscore to separate the words. Either is accepted in Dataverse. All Processing Report files should be assigned the study artefact name Processing Report. Where there is more than one Processing Report in the dataset, each should be differentiated using study artefact terms such as 'Processing Report 1 of 2' or Processing Report Sensitive Data Version etc... | ||
== ADAID: == | == ADAID: == | ||
This refers to the five digit ADA Identification number assigned to the Project or Study. For Self-Deposits it is unlikely that this number will have been allocated, although it may have been provided by the ADA Archivist when the ‘Shell Dataverse and dataset(s)’ were created. In the event that it has not been provided, use ‘ADAID’ and the ADA Archivist will enter the correct identification details once the file | This refers to the five digit ADA Identification number assigned to the Project or Study. For Self-Deposits it is unlikely that this number will have been allocated, although it may have been provided by the ADA Archivist when the ‘Shell Dataverse and dataset(s)’ were created. In the event that it has not been provided, use ‘ADAID’ and the ADA Archivist will enter the correct identification details once you have uploaded the file to the dataset. | ||
== File Extension: == | == File Extension: == | ||
A file extension (e.g. .pdf, .zip and .xlsx) must be present for every file contained and listed within a dataset. | A file extension (e.g. .pdf, .zip and .xlsx) must be present for every file contained and listed within a dataset. This will be automatically added by Dataverse during the upload process and should not be added by the Data Owner. | ||
== Correctly named folder example: == | |||
A full example of a correct file name, adhering to the above naming conventions where the ADAID is known is: ‘1_ANUPoll_2018_Questionnaire_01212.pdf’. | |||
Where the ADAID is unknown and the Study Artefact is multiple words, a correct file name would be: '2_ANUPoll_2018_Data_File_Number_1_ADAID.CSV' | |||
Where the ADAID is unknown and the file is uploaded for Archiving Only, the correct file name would be: '99_ARCHIVE_ONLY_ANUPoll_2018_Signed_Consent_Form_ADAID.pdf'. |
Latest revision as of 01:38, 5 May 2021
It is important that each version of the data and its supporting documentation is clearly identified using the correct file naming convention. This helps to prevent erroneous access conditions from being applied to the published data, and potentially unauthorised access to the material. All data files and supporting documentation that is to be uploaded to Dataverse should therefore be named in accordance with the ADA guidance, this also makes the files easily identifiable to ADA Staff and standardises the naming convention across datasets and Dataverse’s.
Dataverse Uploads
All data and their supporting documentation should be simple to locate and identify. Dataverse automatically lists uploaded files in numerical and alphabetical order. Therefore, in order to list materials in a more meaningful order, the ADA have developed a standardised naming convention for files.
ADA Standard Naming Convention
The ADA standardised naming convention for Self-Deposit files uses the following format:
DataverseNumber_StudyName_Year_StudyArtefact_ADAID
note that the Dataverse application will include a file extension to the name during the upload process.
Dataverse Number:
The number 0, 1, 2, 3 or 99, indicating the order that the files are to be arranged in.
- “0” is to be applied to all licensing information files (e.g. License Agreement Forms, License Terms and Conditions of Use Supporting Information, License Access Guestbook Supporting Information).
- “1” is applied to all other supporting document files (e.g. Questionnaires, Codebooks, Technical documents).
- All Data files (including all Double-Zipped Data files e.g. SPSS, Stata, SAS, CSV and Excel format files) are to be labelled with the prefix “2”.
- "3" is used by ADA Staff to identify Processing Reports that have been raised during the curation of your data. Processing Reports will be archived as part of the Archival Information Pack (AIP) for the dataset and Data Owners should only see these files within their datasets if it has been returned to you for corrective action.
- Files that are uploaded for ARCHIVING ONLY as part of the Submission Information Pack (SIP), and which are not to be included as part of the published dataset are to be labelled with the Dataverse Number "99_ARCHIVE_ONLY".
Due to compatibility issues with Dataverse, all individual data files and any supporting documentation files that are in any of the following formats must be double-zipped prior to uploading to the dataset to preserve their formatting. Where multiple files are uploaded to the dataset as a folder and at least one of those files is of one of the following formats, the folder will need to be double-zipped.
File Formats requiring Double-Zipping prior to upload to a dataset are:
- SPSS
- SAS
- Stata
- CSV
- Excel
Folders containing multiple files
Note - Although it is possible to upload multiple files in a folder to a Dataset, it is not recommended unless absolutely necessary as in doing so you reduce the findability of the files contained within the folder. This is due to the fact that both File Tags and Description Notes can only be added at the folder level in this case. For more information refer to: Adding File Tags and Description Notes.
In certain cases, a complete double-zipped folder of supporting documentation or data files may be produced (e.g. many longitudinal studies will have multiple files and it would be time consuming to double-zip and upload each file individually). Where packages of such files are to be uploaded as a single folder, the folder should be annotated with the term "-Z" immediately following the Dataverse Number. For example a double-zipped collection of supporting documents that contains at least one file that would require double-zipping if uploaded individually, when uploaded as part of a folder the folder would be identified with the prefix "1-Z".
Each file contained within the folder should still be individually named using the standard naming convention detailed in this section. In addition, as all individual data files in one of the aforementioned file formats are already required to be double-zipped, there is no need to identify these separately with a "-Z" addendum, this term is used purely to identify packages of files uploaded as a folder that contain one or more of these types of file.
Where a large number of Supporting Documents (i.e. greater than 40-50) are to be uploaded together as a folder, as long as the files within the folder would not require double-zipping if they were uploaded individually (for example the folder is a package of supporting documentation files that are not SPSS, SAS, Stata, CSV or Excel file formats), then the folder need not be double-zipped. A good example of this would be the License Document suite of forms. These could be uploaded directly as a folder, with no double-zipping required.
Files or Folders containing Sensitive or Personal Information
The suffix ‘-S’ is to be used to identify data files that contain Sensitive Information or Personal Information. Typically this is used to differentiate between those data files that may be available for public release as Open Data and those that contain some form of information that requires access to be managed. Many Data Owners will choose to upload both an open source version of their data as well as a version of the data that requires some form of access restriction. The former is most likely to have far fewer safeguards when sharing, and therefore requires less management and maximises the potential benefits of the data for sharing.
Data Files used to create a Derived Dataset
Finally, for derived datasets (i.e. those made up from multiple sources of separate data), the suffix's ‘a’ through ‘z’, should be used to identify the individual data files used in the creation of the derived data file. Thus if a new data file was created through the linking of data from the ATO and Medicare, the ATO data may have the Dataverse Number ‘2a’ whilst the Medicare data may have the identifier ‘2b-S’, the latter denoting that the data is also Sensitive.
Dataverse Number Examples:
0. Licensing Information files (License Agreement Form, License Terms and Conditions of Use, License Access Guestbook)
1. Supporting Documentation (Questionnaires, Codebooks, Technical documents, Data Dictionary etc... remember to double-zip those files with SPSS, Stat, SAS, CSV and Excel file extensions)
2. Data files (all CSV, Stata, SPSS, SAS and Excel file formats are to be double-zipped prior to upload)
2a. Data file used in creation of a derived data file (all CSV, Stata, SPSS, SAS and Excel file formats are to be double-zipped prior to upload)
3. Processing Report created during the curation of the dataset and used for the recording of issues that require action prior to publication. Processing Reports may also include recommendations made by the ADA to improve the quality of the data.
99_ARCHIVE_ONLY. Files uploaded for archiving with the ADA only (for example Signed Consent forms or original interview transcripts that are not to be published as part of the dataset but are submitted to complete the deposit and to retain a full audit trail)
-S. The ‘S’ suffix when displayed after the Dataverse number is used to denote that the Data file contains ‘Sensitive’ data
-Z. The ‘Z’ suffix when displayed after the Dataverse number is used to denote that multiple files are contained in a double-zipped folder. In addition, it identifies that at least one of these file would require double-zipping if uploaded individually.
By way of example, the Dataverse Number ‘2-SZ’ would denote a double-zipped package of data files, which contain sensitive data.
Study Name:
This relates to the name of the Project or Study that the dataset(s) belong to. If the full name is unreasonably long this can be abbreviated. For example, ‘The Australian Longitudinal Study on Women’s Health’ is abbreviated to ‘ALSWH’. Where the Study Name is more than a single word, you can either leave the space between words or use an underscore to separate the words. Either is accepted in Dataverse.
Year:
Refers to the year that the Project or Study was conducted in. If this spans multiple years you can enter the period in question. For example, 2018-19.
Study Artefact:
This should refer to the specific item in question. For supporting documentation it could be the item (e.g. Questionnaire, Codebook or Report), for data this is typically the file type (e.g. SAS, SPSS or Stata Data File). Where the Study Artefact is more than a single word, for example: Plain Language Statement, you can either leave the space between words or use an underscore to separate the words. Either is accepted in Dataverse. All Processing Report files should be assigned the study artefact name Processing Report. Where there is more than one Processing Report in the dataset, each should be differentiated using study artefact terms such as 'Processing Report 1 of 2' or Processing Report Sensitive Data Version etc...
ADAID:
This refers to the five digit ADA Identification number assigned to the Project or Study. For Self-Deposits it is unlikely that this number will have been allocated, although it may have been provided by the ADA Archivist when the ‘Shell Dataverse and dataset(s)’ were created. In the event that it has not been provided, use ‘ADAID’ and the ADA Archivist will enter the correct identification details once you have uploaded the file to the dataset.
File Extension:
A file extension (e.g. .pdf, .zip and .xlsx) must be present for every file contained and listed within a dataset. This will be automatically added by Dataverse during the upload process and should not be added by the Data Owner.
Correctly named folder example:
A full example of a correct file name, adhering to the above naming conventions where the ADAID is known is: ‘1_ANUPoll_2018_Questionnaire_01212.pdf’.
Where the ADAID is unknown and the Study Artefact is multiple words, a correct file name would be: '2_ANUPoll_2018_Data_File_Number_1_ADAID.CSV'
Where the ADAID is unknown and the file is uploaded for Archiving Only, the correct file name would be: '99_ARCHIVE_ONLY_ANUPoll_2018_Signed_Consent_Form_ADAID.pdf'.