04 Model Interpretation & Explanation¶
This notebook in Week 4 focuses on interpreting the results of our anomaly detection pipeline using Isolation Forest and HDBSCAN clustering. Building on the outputs of Notebooks 01-03, we draw behavioural insights and visual inferences from the models, without introducing new code logic. The focus here lies in understanding what the models reveal about the signal: its structure, its irregularities, and its interpretive value for real-world applications.
Step 1 - Summary of the Pipeline¶
The ReCoDE exemplar pipeline follows a modular structure:
- Data Preparation: Load and verify the InternalBleeding14 time series dataset.
- Preprocessing: Apply MinMaxScaler to normalise signal values between 0 and 1.
- Isolation Forest: Detect anomalies using unsupervised tree-based scoring (5% contamination).
- PCA Transformation: Reduce data dimensionality to a single component.
- HDBSCAN Clustering: Discover latent behavioural clusters and noise points.
- Visualisation: Overlay plots to reveal anomaly distribution and cluster structures.
Step 2 - Model Assumptions¶
- The time series displays periodic behaviour with sporadic anomalies.
- Anomalies are rare and sparsely distributed.
- PCA captures meaningful signal variance in a single component.
- HDBSCAN does not require a pre-set number of clusters.
- Cluster
-1
(noise) may indicate either outliers or weak patterning.
Step 3 - Isolation Forest Results¶
The Isolation Forest model detected anomalies by evaluating how easily each data point could be isolated in the scaled value space. This method identified 374 anomalous points, out of a total of 7501 observations; approximately 5%, consistent with the model’s contamination parameter. These points typically reflect abrupt deviations from regular amplitude oscillations.
Step 4 - HDBSCAN Clustering Insights¶
HDBSCAN grouped data points into behavioural clusters using the PCA-reduced representation. The model revealed:
Cluster 1; 7173 points (dominant behavioural group)
Cluster 0; 99 points (minor variation group)
Noise cluster (-1); 229 points flagged as unclassifiable
This clustering structure reinforces the presence of two distinct behavioural regimes; noise points may overlap with regions identified as anomalous earlier. The model offers an interpretable lens for understanding dense versus irregular signal patterns.
Step 5 - Commentary on Cross-Validation in Unsupervised Pipelines¶
In supervised learning, cross-validation typically relies on labelled data to assess model performance. However, in unsupervised anomaly detection, where labels are unavailable, such validation becomes more interpretive than definitive. In this step, we focus instead on the robustness, overlap, and divergence of outputs generated by different models in the pipeline.
This exemplar makes use of two complementary unsupervised methods:
- Isolation Forest (IF) identifies anomalies by recursively partitioning the data and isolating points that require fewer splits.
- HDBSCAN labels points as noise or assigns them to clusters based on local density and hierarchical structure.
Rather than treating either model as authoritative, we encourage learners to reflect on their interaction and internal consistency.
Suggested Exploratory Checks:¶
Overlap: Compare the points flagged as anomalies by Isolation Forest with those labelled as noise by HDBSCAN. High overlap may suggest shared detection of structural irregularity.
Divergence: Examine whether any Isolation Forest outliers are embedded within HDBSCAN clusters. These may reflect globally anomalous but locally typical behaviour, which is especially relevant in behavioural or transport modelling.
Stability: Run the pipeline on multiple random subsets of the data. Assess the consistency of model outputs using:
- Internal validation metrics such as the Silhouette Score or Davies–Bouldin Index.
- Set-based measures such as Jaccard Similarity to quantify overlap in anomaly labels across subsets.
These steps help learners form a critical view of unsupervised outputs without relying on ground truth, promoting deeper reasoning about the model’s behaviour and limitations.
Optional Extension:¶
Learners may explore a basic form of pseudo cross-validation by applying the full pipeline to repeated subsamples. Anomaly scores, cluster assignments, and overlap metrics may be tracked across runs to assess the reliability of results. This remains an optional activity and is intended to build confidence rather than serve as formal validation.
Step 6 - Visual Patterns and Significance¶
Interpreting the structure and meaning of anomalies
After detecting anomalies and clusters, it is essential to examine their visual context and practical significance. Anomalies should not be treated as definitive errors or artefacts. Rather, they require interpretation within the structure of the time series or the reduced latent space.
In this step, learners are encouraged to:
Plot the original signal with detected anomaly points overlaid. Reflect on their position. Do they appear at signal boundaries, isolated peaks, or periods of sudden fluctuation?
Review dimensionality-reduced embeddings, such as UMAP or PCA projections, with cluster labels and anomalies highlighted. Consider whether anomalies fall outside dense regions, or whether they cluster in unexpected areas.
Annotate the behavioural or operational context, for example:
- “Anomalies are concentrated at signal troughs. This may indicate sensor dropout.”
- “Outliers emerge during transitions between regimes. These may represent decision delays or hesitation.”
- “High anomaly scores occur near maximum amplitude. This is often characteristic of sensor saturation.”
The objective here is to encourage a form of practical reasoning about model outputs. Not all anomalies are mistakes. Some may represent rare but valid occurrences, while others may arise from known artefacts such as sensor noise, boundary distortion, or missing data.
This interpretive practice supports learners in moving from algorithmic output to meaningful, defensible insights.
Step 7 - Educational Commentary¶
This notebook demonstrates the value of combining multiple unsupervised techniques for time series interpretation. Isolation Forest and HDBSCAN offer complementary strengths:
- Tree-based anomaly scoring (Isolation Forest)
- Density-based segmentation (HDBSCAN)
Together, they provide a layered understanding of behavioural structure within unlabelled, physiologically-inspired data. This approach generalises well to transport analysis, medical monitoring, and public service data pipelines where labels are scarce but signal integrity is vital.
Reflective Task¶
Learners may now revisit anomaly visualisation in Notebook 03 and reflect on the behavioural meaning of detected points. Are any patterns consistent with artefacts? Which anomalies appear explainable versus unexpected?
Further Reading¶
- Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, pp. 53–65.
- Chandola, V., Banerjee, A. and Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), Article 15.
- Aggarwal, C.C. (2017). Outlier Analysis. 2nd edn. Springer.
- Campello, R.J.G.B., Moulavi, D. and Sander, J. (2013). Density-based clustering based on hierarchical density estimates. PAKDD 2013, pp. 160–172.
- Efron, B. and Tibshirani, R.J. (1994). An Introduction to the Bootstrap. CRC Press.