Skip to content

ReCoDE Exemplar - Data-Scarce Behavioural Anomaly Detection

This exemplar provides a complete pipeline for unsupervised anomaly detection applied to univariate time series data. Using the InternalBleeding14 dataset from the UCR Time Series Anomaly Archive, the project demonstrates techniques for detecting irregular patterns in physiological-style sensor recordings, where normal operating conditions are occasionally interrupted by anomalous deviations. The exemplar guides learners through data preparation, preprocessing, Isolation Forest modelling, dimensionality reduction with PCA, clustering with HDBSCAN, model interpretation, and ethical considerations when analysing scarce or sensitive time series data. The exemplar is fully modular, industry-aligned, and reproducible for academic and applied machine learning use cases.

Scikit Camera Image

(Visual representative image to be inserted - PCA cluster visualisation with anomaly overlays will be included after Week 6 visual refinement.)

This exemplar was developed at Imperial College London by Duke T J Ludera in collaboration with Saranjeet Kaur S S Bhogal from Research Software Engineering and Dr. Jianliang Gao from Research Computing & Data Science at the Early Career Researcher Institute.

Learning Outcomes πŸŽ“

Upon completion, students will:

  • Preprocess univariate time series data for anomaly detection.
  • Implement Isolation Forest and HDBSCAN clustering in unsupervised anomaly detection contexts.
  • Interpret model outputs, identify anomalous deviations, and reflect on ethical challenges in modelling scarce time series data.

Target Audience 🎯

  • Postgraduate students
  • Early career data scientists
  • Researchers working on time series analysis, anomaly detection, fraud detection, operational monitoring, or applied machine learning pipelines.

Prerequisites βœ…

Academic πŸ“š

  • Python programming (intermediate level)
  • Familiarity with machine learning (unsupervised models, clustering)
  • Introductory understanding of anomaly detection and time series concepts

System πŸ’»

  • Python 3.10+
  • Anaconda or virtualenv recommended
  • Disk space: ~2 GB
  • RAM: 8 GB or higher

Hardware or HPC requirements

  • Standard desktop or laptop (no HPC required)

Getting Started πŸš€

  1. Clone this GitHub repository.
  2. Install environment using provided requirements.txt file.
  3. Launch Jupyter Notebook environment.
  4. Work through notebooks in sequence:
    notebooks/
    β”œβ”€β”€ 01_dataset_preparation.ipynb
    β”œβ”€β”€ 02_preprocessing_and_baseline_iforest.ipynb
    β”œβ”€β”€ 03_dimensionality_and_clustering.ipynb
    β”œβ”€β”€ 04_model_interpretation_and_explanation.ipynb
    β”œβ”€β”€ 05_ethical_reflection.ipynb
    β”œβ”€β”€ 06_visual_polishing_and_citations.ipynb
    β”œβ”€β”€ 07_reproducibility_and_environment_testing.ipynb
    └── 08_finalised_summary_notebook.ipynb
    
  5. Follow the markdown guidance and embedded exercises in Notebooks 02, 03, and 06, where the newly added Fantasia dataset is used for practical exploration of unsupervised anomaly detection methods.
  6. Review ethical reflection sections in Week 5.

Disciplinary Background πŸ”¬

This exemplar sits at the intersection of anomaly detection, unsupervised machine learning, and time series data science. While the dataset originates from physiological sensor measurements, it is applied here as a general case of unsupervised anomaly detection on sparse time series. The exemplar demonstrates practical techniques applicable to fraud detection, operational monitoring, industrial equipment diagnostics, and public sector data analysis.

Software Tools πŸ› οΈ

  • Python 3.x
  • pandas
  • numpy
  • scikit-learn
  • HDBSCAN
  • matplotlib
  • seaborn

Project Structure πŸ—‚οΈ

.
β”œβ”€β”€ notebooks
β”‚   β”œβ”€β”€ 01_dataset_preparation.ipynb
β”‚   β”œβ”€β”€ 02_preprocessing_and_baseline_iforest.ipynb
β”‚   β”œβ”€β”€ 03_dimensionality_and_clustering.ipynb
β”‚   β”œβ”€β”€ 04_model_interpretation_and_explanation.ipynb
β”‚   β”œβ”€β”€ 05_ethical_reflection.ipynb
β”‚   β”œβ”€β”€ 06_visual_polishing_and_citations.ipynb
β”‚   β”œβ”€β”€ 07_reproducibility_and_environment_testing.ipynb
β”‚   └── 08_finalised_summary_notebook.ipynb
β”œβ”€β”€ src
β”‚   └── (core model modules β€” optional extension)
β”œβ”€β”€ data
β”‚   └── InternalBleeding14.csv
β”œβ”€β”€ docs
β”œβ”€β”€ utils
β”œβ”€β”€ test
β”œβ”€β”€ LICENSE.md
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ mkdocs.yml
└── .github/workflows

Code Organisation

notebooks/ β€” step-by-step Jupyter notebooks following weekly structure.

src/ β€” reusable model code extensions (optional).

data/ β€” dataset files.

docs/ β€” documentation for deployment.

utils/ β€” helper scripts.

test/ β€” testing scripts.

github/workflows/ β€” GitHub CI/CD automation.

Roadmap πŸ—ΊοΈ

Core 🧩

  • Dataset ingestion and preprocessing
  • Baseline Isolation Forest anomaly detection
  • PCA dimensionality reduction
  • HDBSCAN clustering
  • Model interpretation and markdown commentary
  • Visualisation of anomaly scores and clustering
  • Ethical reflection module
  • Fully reproducible codebase with documentation

Updates:
- Week 1 (13–16 May): Set up GitHub, load dataset, create initial notebook. - Week 2 (19–23 May): Start anomaly detection model (e.g. Isolation Forest). - Week 3 (26–30 May): Create plots and graphs (e.g. PCA, HDBSCAN). - Week 4 (2–6 May): Write markdown explanations and model interpretation. - Week 5 (9–13 May): Add ethics section to notebook. - Week 6 (16–20 June): Improve visuals, clean comments, add sources. - Week 7 (23–27 June): Add environment file and test notebook. - Week 8 (30 Jun–4 July): Finalise README and dataset information. - Week 9 (7–11 July): Peer review and polish. - Week 10 (14–18 July): Final check and submit core materials. - Week 11 (21-24 July): Final GitHub reproducibility setup and devcontainer testing

Extensions πŸ”Œ

  • Advanced ethical scenario analysis
  • Visualisation refinement for industry or academic presentation

Data πŸ“Š

List datasets used with:

  • Dataset: InternalBleeding14
  • Description: Univariate physiological-style time series used for anomaly detection benchmark tasks.
  • Source: UCR Time Series Anomaly Archive (2021)
  • Licence: Public benchmark dataset
  • Location: Included in repository

Best Practice Notes πŸ“

  • Version controlled via GitHub
  • GitHub Projects used for task tracking
  • Clean notebook structure aligned to Imperial ReCoDE 10-week schedule
  • Embedded markdown reflections for ethical context
  • BSD-3-Clause licence for reproducibility and reuse

Estimated Time ⏳

Task Estimated Time
Reading 3 hours
Practising 3 hours

Additional Resources πŸ”—

  • Wu, R., & Keogh, E. (2020). Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress. arXiv:2009.13807.
  • scikit-learn official documentation
  • HDBSCAN official documentation
  • Imperial College London ReCoDE Exemplar Guide

Note: Some learners may initially assume we are detecting scarce data per se. That is not the case. Scarcity here refers to the context in which anomalies occur. They are uncommon, possibly unlabelled, and embedded within larger typical patterns. The exemplar explores methods that work despite this scarcity.

Licence πŸ“„

BSD-3-Clause License