ReCoDE Exemplar - Data-Scarce Behavioural Anomaly Detection
This exemplar provides a complete pipeline for unsupervised anomaly detection applied to univariate time series data. Using the InternalBleeding14 dataset from the UCR Time Series Anomaly Archive, the project demonstrates techniques for detecting irregular patterns in physiological-style sensor recordings, where normal operating conditions are occasionally interrupted by anomalous deviations. The exemplar guides learners through data preparation, preprocessing, Isolation Forest modelling, dimensionality reduction with PCA, clustering with HDBSCAN, model interpretation, and ethical considerations when analysing scarce or sensitive time series data. The exemplar is fully modular, industry-aligned, and reproducible for academic and applied machine learning use cases.
(Visual representative image to be inserted - PCA cluster visualisation with anomaly overlays will be included after Week 6 visual refinement.)
This exemplar was developed at Imperial College London by Duke T J Ludera in collaboration with Saranjeet Kaur S S Bhogal from Research Software Engineering and Dr. Jianliang Gao from Research Computing & Data Science at the Early Career Researcher Institute.
Learning Outcomes π
Upon completion, students will:
- Preprocess univariate time series data for anomaly detection.
- Implement Isolation Forest and HDBSCAN clustering in unsupervised anomaly detection contexts.
- Interpret model outputs, identify anomalous deviations, and reflect on ethical challenges in modelling scarce time series data.
Target Audience π―
- Postgraduate students
- Early career data scientists
- Researchers working on time series analysis, anomaly detection, fraud detection, operational monitoring, or applied machine learning pipelines.
Prerequisites β
Academic π
- Python programming (intermediate level)
- Familiarity with machine learning (unsupervised models, clustering)
- Introductory understanding of anomaly detection and time series concepts
System π»
- Python 3.10+
- Anaconda or virtualenv recommended
- Disk space: ~2 GB
- RAM: 8 GB or higher
Hardware or HPC requirements
- Standard desktop or laptop (no HPC required)
Getting Started π
- Clone this GitHub repository.
- Install environment using provided requirements.txt file.
- Launch Jupyter Notebook environment.
- Work through notebooks in sequence:
notebooks/ βββ 01_dataset_preparation.ipynb βββ 02_preprocessing_and_baseline_iforest.ipynb βββ 03_dimensionality_and_clustering.ipynb βββ 04_model_interpretation_and_explanation.ipynb βββ 05_ethical_reflection.ipynb βββ 06_visual_polishing_and_citations.ipynb βββ 07_reproducibility_and_environment_testing.ipynb βββ 08_finalised_summary_notebook.ipynb
- Follow the markdown guidance and embedded exercises in Notebooks 02, 03, and 06, where the newly added Fantasia dataset is used for practical exploration of unsupervised anomaly detection methods.
- Review ethical reflection sections in Week 5.
Disciplinary Background π¬
This exemplar sits at the intersection of anomaly detection, unsupervised machine learning, and time series data science. While the dataset originates from physiological sensor measurements, it is applied here as a general case of unsupervised anomaly detection on sparse time series. The exemplar demonstrates practical techniques applicable to fraud detection, operational monitoring, industrial equipment diagnostics, and public sector data analysis.
Software Tools π οΈ
- Python 3.x
- pandas
- numpy
- scikit-learn
- HDBSCAN
- matplotlib
- seaborn
Project Structure ποΈ
.
βββ notebooks
β βββ 01_dataset_preparation.ipynb
β βββ 02_preprocessing_and_baseline_iforest.ipynb
β βββ 03_dimensionality_and_clustering.ipynb
β βββ 04_model_interpretation_and_explanation.ipynb
β βββ 05_ethical_reflection.ipynb
β βββ 06_visual_polishing_and_citations.ipynb
β βββ 07_reproducibility_and_environment_testing.ipynb
β βββ 08_finalised_summary_notebook.ipynb
βββ src
β βββ (core model modules β optional extension)
βββ data
β βββ InternalBleeding14.csv
βββ docs
βββ utils
βββ test
βββ LICENSE.md
βββ README.md
βββ requirements.txt
βββ mkdocs.yml
βββ .github/workflows
Code Organisation
notebooks/ β step-by-step Jupyter notebooks following weekly structure.
src/ β reusable model code extensions (optional).
data/ β dataset files.
docs/ β documentation for deployment.
utils/ β helper scripts.
test/ β testing scripts.
github/workflows/ β GitHub CI/CD automation.
Roadmap πΊοΈ
Core π§©
- Dataset ingestion and preprocessing
- Baseline Isolation Forest anomaly detection
- PCA dimensionality reduction
- HDBSCAN clustering
- Model interpretation and markdown commentary
- Visualisation of anomaly scores and clustering
- Ethical reflection module
- Fully reproducible codebase with documentation
Updates:
- Week 1 (13β16 May): Set up GitHub, load dataset, create initial notebook.
- Week 2 (19β23 May): Start anomaly detection model (e.g. Isolation Forest).
- Week 3 (26β30 May): Create plots and graphs (e.g. PCA, HDBSCAN).
- Week 4 (2β6 May): Write markdown explanations and model interpretation.
- Week 5 (9β13 May): Add ethics section to notebook.
- Week 6 (16β20 June): Improve visuals, clean comments, add sources.
- Week 7 (23β27 June): Add environment file and test notebook.
- Week 8 (30 Junβ4 July): Finalise README and dataset information.
- Week 9 (7β11 July): Peer review and polish.
- Week 10 (14β18 July): Final check and submit core materials.
- Week 11 (21-24 July): Final GitHub reproducibility setup and devcontainer testing
Extensions π
- Advanced ethical scenario analysis
- Visualisation refinement for industry or academic presentation
Data π
List datasets used with:
- Dataset: InternalBleeding14
- Description: Univariate physiological-style time series used for anomaly detection benchmark tasks.
- Source: UCR Time Series Anomaly Archive (2021)
- Licence: Public benchmark dataset
- Location: Included in repository
Best Practice Notes π
- Version controlled via GitHub
- GitHub Projects used for task tracking
- Clean notebook structure aligned to Imperial ReCoDE 10-week schedule
- Embedded markdown reflections for ethical context
- BSD-3-Clause licence for reproducibility and reuse
Estimated Time β³
Task | Estimated Time |
---|---|
Reading | 3 hours |
Practising | 3 hours |
Additional Resources π
- Wu, R., & Keogh, E. (2020). Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress. arXiv:2009.13807.
- scikit-learn official documentation
- HDBSCAN official documentation
- Imperial College London ReCoDE Exemplar Guide
Note: Some learners may initially assume we are detecting scarce data per se. That is not the case. Scarcity here refers to the context in which anomalies occur. They are uncommon, possibly unlabelled, and embedded within larger typical patterns. The exemplar explores methods that work despite this scarcity.
Licence π
BSD-3-Clause License