Storage & Integrity

From ADA Public Wiki
Revision as of 00:26, 9 September 2024 by JMcDougall (Sọ̀rọ̀ | contribs)
Jump to navigation Jump to search

The ADA archival workflow [34] outlines processes to manage the integrity of the data and metadata flow through the Archive. The ADA Workflow and Storage Diagram [47] reflects the distinct deposit, ingest, curation, access, and storage locations for each of the archival phases.

ADA Archive Training

The ADA Archivist team are trained with respect to the OAIS Reference Model and how it is implemented within ADA’s technical architecture:

  • Deposited data = SIP
  • Ingest (curation, processing, data management, preservation) = AIP
  • Access = DIP

The Archivist team members are trained to know where and how the data is stored for each of the SIP, AIP and DIP, and actively contribute to ongoing documentation and development of processes:

  • The archivist team are also trained in accessing and processing the raw Information Packages within ADA’s secure Remote Desktop Service.

The ADA access management team manages access to the DIP (dissemination version). The access management team is trained in documented processes regarding documented Business Rules that dictate an applicant’s being granted or rejected in their application to access the DIP. The technical management team follows established informally documented workflows to contact the NCI Helpdesk when any storage-related issues arise.

Data & Storage Management

The Dataverse software [49] supports reporting, data management and auditing through user accounts for access, authentication, and permissions; to edit, upload, and download data. ADA previously developed the ADA Data Processing Tool (ADAPT) [6] based on the OAIS Reference Model [42]. ADAPT enables archivists to programmatically manage movement of data and metadata between Dataverse instances and archival storage to manage data integrity.

  • Each user actioned ADAPT function creates or appends to a log using a standard provenance ontology (PROV-O) [3]. The log is stored as part of the AIP to support auditing of archival activities actioned through ADAPT, if required.

ADAPT is moving towards its third version as ADA continues to develop strategies to minimise the risk of manual data and metadata management versus building management into a software application. Data and metadata are stored with each Dataverse instance, and versioned at publication of a dataset. ADAPT is used to move data and metadata between the Dataverse instances as required at each archival phase. Dataverse exports metadata in a number of formats. The JSON export format of the DDI metadata also includes Dataverse system metadata such as fixity checks on uploaded data files (checksum MD5). The JSON metadata formats are exported by archivists during the Ingest phase and Publish phase. The JSON export is copied to archival storage for preservation of the original (SIP) metadata and data files, and the published (DIP) metadata and data files, to ensure that the integrity of digital objects from deposit to access can be verified against any changes to the data. The archivist Data Curation Process [34] encapsulates data processing, including superceded and new versions of data, in conjunction with support from Dataverse software versioning control.

The Repository's Strategy For Multiple Copies

The assumption is that ‘multiple copies’ means ‘multiple copies of data’. ADA has defined a process for the Archivist team to follow to manage the different copies of the data corresponding to OAIS:

  • A dataset deposited to ADA via Deposit Dataverse is assigned a unique ADAID
  • SIP:
    • Copy of deposited data on Deposit Dataverse. Uploading to Dataverse creates a copy in the backend server directory /files/xxxxxx/ on deposit.ada.edu.au (where xxxxxx is the DOIsuffix created by Dataverse for the dataset)
    • Copy of deposited data stored in NCI ADA-only project storage (../<ADAID>/original)
    • The files in the Dataverse /files/xxxxxx directory will be the same as the files in the NCI /<ADAID>/original/ directory
  • AIP:
    • Dataset + files + datafiles copied from deposit.ada.edu.au -> dataverse-test.ada.edu.au

Copying to Test Dataverse creates a copy in the backend server directory /files/yyyyyy/ on dataverse-test.ada.edu.au where yyyyyy is the DOI suffix created by Dataverse for that dataset.

    • Copy of files + datafiles created in NCI ADA-only project storage (../<ADAID>/processing)
    • The files on dataverse-test and /files/yyyyyy are dynamic while the archivists process these files
  • DIP:
    • Once AIP processing is complete, Dataset metadata is copied from dataverse-test.ada.edu.au -> Production dataverse.ada.edu.au
  • Copying the files + datafiles creates a copy in the backend server /files/zzzzzz directory on dataverse.ada.edu.au where zzzzzz is the doi suffix created by Dataverse for that dataset
  • The files on dataverse.ada.edu.au /files/zzzzzz will be exported and copied to NCI ../<ADAID>/ top level for preservation.

When data for a dataset that has been published on Production dataverse.ada.edu.au (DIP) is to be updated:

  • A new dataset (no metadata) is created on deposit.ada.edu.au into which the depositor deposits the new datafiles; this creates a copy of the files in the /files/bbbbbb on the Dataverse server.
  • On the NCI /<ADAID>/ storage:
    • A directory /superseded_<creation_date> is created
    • All of the files from the previous round of data being deposited, processed and disseminated are moved into the /superseded_<creation_date> folder
    • The new data is moved using ADAPT from Deposit deposit.ada.edu.au into the /<ADAID>/original and the deposit -> processing -> publish workflow starts again
    • To process the AIP, the files for the dataset on dataverse-test that was previously created are replaced with the new deposit files
  • 8 The new DIP files from /<ADAID>/ are uploaded to the existing dataset on dataverse.ada.edu.au, and that dataset is versioned up to a new major version (x.0).

This also places the new files in the /files/zzzzzz directory on the Dataverse server.

Backup copies and snapshots of the data on the NCI ADA-only project storage space are created for disaster recovery under the DRHA Policy.

Risk Management Techniques Used To Inform The Strategy

The OAIS model is implemented with data stored for each of the SIP, AIP, DIP NCI <ADAID> directories, and on the Dataverse backend servers. The archive team are trained to strictly follow documented archive procedures to manage the number of copies across the NCI and Dataverse directories. The ADA workflow procedures [34] minimises the probability of data copies becoming unsynchronised.

Procedures for Handling & Monitoring Deterioration of Storage Media

ADA Archival Storage and Dataverse instances [47] are provisioned, hosted and backed up on NCI servers. NCI has procedures for monitoring bit level integrity against the deterioration of their storage media. NCI notifies ADA if there are plans for any necessary upgrades and when those upgrades will take place.

ADA’s technical team works with NCI to keep server operating systems updated to supported versions:

  • NCI informs the ADA technical team when a Virtual - Machine’s (VM) operating system (OS) is no longer going to be supported.
  • NCI provisions new VMs when necessary, and the ADA technical team moves Dataverse installations to these new VMs.

Procedures to Ensure Data & Metadata Are Only Deleted Through Approved Processes

Data in the SIP or AIP state can be deleted by the ADA archivist team if directed by depositors, or for legal reasons, with the archiving team maintaining records relating to the request for deletion. Data in the DIP state cannot be deleted.

As datasets on Deposit and Test Dataverses are not published, deleting a dataset from Deposit would not create problems with the temporary non-production DOI that is created for it. Simply deleting the dataset would delete everything relating to it including the files stored in the backend server /files/xxxxxxx directory. The DOI would not have to be tombstoned [4] as the DOI prefix is a fake or test prefix, not ADA’s production DOI prefix.

When a draft dataset is deleted on Deposit or Test, the files are deleted from the dataverse server /files/xxxxxx directory structure. The data stored in the NCI /<ADAID> subdirectories are deleted by the data archivist team.

For datasets that have been published on ADA’s production Dataverse, ADA does not delete them but rather deaccessions them. This process results in the dataset being labelled as “Deaccessioned” in Dataverse and renders its files accessible only to users with the correct permission levels. The files remain in the Dataverse server /files/zzzzzz/ directory.

References