Data Storage Guidelines

This page provides guidance on what types of data should and should not be stored in GitHub repositories at Imperial College London. Understanding these guidelines is essential for maintaining compliance with data protection regulations and using version control effectively.

What should NOT be stored in GitHub

The following types of data should never be stored in GitHub repositories:

Personal and sensitive data

Personal data subject to GDPR: Names, email addresses, phone numbers, addresses, or any other personally identifiable information (PII) that could identify living individuals.
Live research data containing personal information: Active datasets that include participant information, health records, or other sensitive personal data.
Student records: Academic records, grades, attendance data, or any other information that could identify students.
Staff information: Employment records, HR data, performance reviews, or payroll information.

Credentials and secrets

Passwords and API keys: Never commit passwords, API tokens, authentication keys, or access credentials.
Private keys: SSH keys, SSL certificates, or cryptographic private keys.
Database connection strings: Connection strings containing usernames, passwords, or sensitive server information.

More information on storing secrets securely can be found at Secrets Management.

Other sensitive data

Financial data: Bank account details, payment card information, or financial records.
Intellectual property with restrictions: Data or code that is covered by NDAs, embargoes, or other contractual restrictions.
Large datasets: GitHub is not designed for storing large data files (files over 100MB). Use appropriate data storage solutions instead.

All repositories on GitHub.com (including private repositories in the ImperialCollegeLondon organisation) are stored on GitHub’s servers in the United States. This means data is subject to US jurisdiction and international data transfer regulations.

Why these guidelines matter

Data protection compliance

GDPR requirements: The UK General Data Protection Regulation (UK GDPR) requires organisations to protect personal data and only process it for legitimate purposes. Storing personal data in Git repositories without proper safeguards can lead to compliance violations.
Data breach risks: Git repositories maintain a complete history of all commits. Even if you delete sensitive data in a later commit, it remains in the repository history and can be accessed.
International data transfers: Because GitHub.com stores data in the US, storing personal data there may constitute an international data transfer, which requires additional safeguards under UK GDPR.

Security considerations

Difficult to remove: Once data is committed to Git, it becomes part of the permanent history. Removing it requires rewriting history, which can be complex and may not be fully effective if the repository has been forked or cloned.
Access control limitations: Even in private repositories, data can be exposed through security breaches, accidental sharing, or when collaborators leave the organisation.
Version control visibility: All collaborators with access to the repository can see the entire commit history, including any sensitive data that was previously committed.

What Git is designed for

Git and GitHub are specifically designed for:

Source code version control

Tracking code changes: Git excels at tracking changes to text-based source code files, allowing you to see who made what changes and when.
Collaboration: Multiple developers can work on the same codebase simultaneously, with Git helping to merge their changes.
Branching and merging: Create experimental branches, work on features in isolation, and merge them back when ready.

Documentation and configuration

Project documentation: README files, user guides, API documentation, and other text-based documentation.
Configuration files: Application settings, build configurations, and infrastructure-as-code definitions (excluding secrets).
Small reference files: Sample data files, test fixtures, and reference materials that support the codebase.

Best practices for using GitHub at Imperial

Use appropriate storage solutions

Research data: Use Imperial’s approved data stores as defined at Saving my files.
Large files: Use Git Large File Storage (LFS) for large files that must be version controlled, or use appropriate cloud storage solutions such as AWS S3 or Azure Blob Storage.
Secrets management: Use tools like GitHub Secrets, Azure Key Vault, or similar secret management services.

Use .gitignore files

Create .gitignore files to prevent accidentally committing files that shouldn’t be in version control:

# Environment variables and secrets
.env
.env.local
*.key
*.pem

# Data files
*.csv
*.xlsx
data/
datasets/

# Large files
*.zip
*.tar.gz
*.pdf

Regular security reviews

Regularly review your repositories for accidentally committed secrets or sensitive data.
Read proposed changes carefully before committing them to ensure no sensitive data exists.
Consider automated secrets scanning tools to block commits that contain secrets. More information can be found at automated secrets scanning tools.

If you accidentally commit sensitive data

If you accidentally commit sensitive data to a repository:

Don’t panic, but act quickly.
Remove the data immediately by following GitHub’s guide on removing sensitive data.
Rotate any exposed credentials immediately (change passwords, regenerate API keys, etc.).
Contact the ICT Service Desk if the data is highly sensitive or if you need assistance.
Consider the repository compromised: If it was a public repository or was cloned by others, assume the data has been exposed.

Simply deleting a file in a new commit does NOT remove it from Git history. You must use specialised tools to rewrite history and remove the sensitive data from all commits. Even then, it’s possible clones/forks of the repository exist so you should consider the data compromised.