Jump to content

Data Platform/Dataset archiving and deletion

From Wikitech
Datasets deprecation, archiving and deletion workflow diagram
Datasets deprecation, archiving and deletion workflow diagram

Motivation

Deprecating, archiving or deleting datasets helps maintain system efficiency, security and compliance as well as improves data relevance and accuracy.

Process

Identification and Review

  • Candidates for depreciation can be nominated any time. Whenever a respective team works in an area where a legacy dataset qualifies for deprecation it should be flagged.
  • Integrated into the quarterly planning the Data Engineering (DE) team, with input from data stewards and others, identifies candidates for dataset deprecation.
    • Criteria for candidates
      • Expired TTL
      • Stale data, no updates in last x months
      • No downstream dependencies
      • Expired ownership
  • The review may also be triggered by deprecation of a legacy system, data migration needs, disk space constraints, or other factors such as usage or dependency metrics.

Pre-deprecation Preparations

  • In case a data steward exists for the candidate data set, the steward confirms that the dataset can be deprecated without adverse effects. Otherwise the data engineering team performs the mandatory investigations.
  • This involves:
    • Impact Analysis
      • Identifying and verifying downstream datasets, data pipelines, and dashboards derived from the dataset. Checks against dependencies are primarily made through Datahub, Airflow, GitHub and Notebook repositories.
    • Phabricator ticket
      • Creating a Phabricator ticket to document the reasons for deprecation, affected downstream assets, and tagging relevant stakeholders.
    • Communication (2x)
      • Posting the intent to deprecate on relevant Slack channels (#data-engineering-collab, #working-with-data) and data-related email lists to provide a grace period for feedback. This should be repeated at least once before the deprecation deadline.

Feedback Grace Period

  • The grace period for feedback should be 30 days. Multiple reminders should be sent in advance.

Deprecation Execution

  • Establish child tasks for deprecating associated dashboards, pipelines, and downstream datasets, and sets a timeline for completion.
  • After confirming all affected data usages have migrated or been deprecated, the data steward officially deprecates or archives the dataset.

Archiving vs. Deletion Decision

  • Post-deprecation, the decision to archive or delete the dataset is made based on:
    1. The impact of keeping the dataset around (e.g., risks, disk space).
    2. The presence of subscribers or dependents.
    3. Core vs. non-core dataset considerations.
  • Official channels (#data-engineering-collab and #working-with-data) are used for communications.

Archiving

  • If archiving is chosen, the dataset is renamed by adding an "_archived" suffix, indicating it is no longer active but still accessible for historical reference or legal compliance.
  • If technically possible the dataset should be marked as read-only.
  • The archived status is documented in DataHub.

Deletion

  • All data, data definitions, and associated code are removed.
  • The deletion is documented in Phabricator and communicated through established channels.