Scaling anomaly detection with RRCF

Scaling anomaly detection with RRCF
Adam Cassar

Co-Founder

2 min read

As the volume of data grows, so does the need to scale the anomaly detection process. While the RRCF algorithm is powerful and efficient, it may face challenges when dealing with large, high-dimensional data. Here are several strategies to scale the RRCF algorithm.

Compute Summary Statistics Instead of Shingling

Shingling is a process that transforms a single time series into a multivariate one by stacking lagged versions of the data. Although this can help capture the temporal dependencies in the data, it also increases the dimensionality of the data points inserted into each tree, which can hamper performance.

An alternative approach is to compute summary statistics that capture the types of anomalies you are looking for. For instance, if you're interested in detecting spikes, your data points could consist of second central differences. If you're looking for long-term trends, your data points could consist of rolling means at different window sizes. This reduces the dimension of the points you're inserting into each tree, leading to better performance.

Placeholder for Summary Statistics graph

Buffer Input and Compute Rolling Summary Statistics

When data is arriving too quickly to be inserted into the trees, buffering the input and computing rolling summary statistics (mean, median, max, etc.) can help manage the influx of data. This reduces the number of points that need to be inserted into the trees and allows the algorithm to keep up with the streaming data.

Placeholder for Rolling Summary Statistics graph

Parallelisation

RRCF can be parallelised, which is particularly useful when dealing with multiple independent time series. Different RRCF instances can be run for each time series, using separate processes or server instances. This distributes the computational load and can significantly improve performance.

For instance, if you have 10 independent time series, you can run 10 instances of RRCF in parallel, each focusing on one time series. This allows you to scale up the anomaly detection process to handle larger volumes of data.

Placeholder for Parallelization graph

Conclusion

Scaling the RRCF algorithm for large datasets involves several strategies, including computing summary statistics, buffering input, and parallelisation. These methods can help manage high-dimensional data and high data velocities, allowing for efficient and effective anomaly detection even as data volumes grow. By implementing these strategies, you can ensure that your anomaly detection processes remain robust and reliable, no matter the size of your data.

Enterprise-Grade Security and Performance

Peakhour offers enterprise-grade security to shield your applications from DDoS attacks, bots, and online fraud, while our global CDN ensures optimal performance.

Contact Us

Related Content

When Bots Break Bad

When Bots Break Bad

Even 'good' bots can end up abusing your site and impacting performance, learn why and how to stop it.

Advanced Anomaly Detection

Advanced Anomaly Detection

Deep dive into Robust Random Cut Forest (RRCF) implementation for real-time anomaly detection in Application Security Platforms. Learn how advanced machine learning algorithms enhance threat detection and automated response capabilities.

Double MAD?

Double MAD?

This article explores the use of Double Median Absolute Deviation (Double MAD) for anomaly detection in time series data, particularly in skewed or non-symmetric distributions.

Double MAD vs the Rest

Double MAD vs the Rest

A look at the limitations of Double MAD for anomaly detection, and a comparison with the Z-score method, to help you choose the right approach for your data.

© PEAKHOUR.IO PTY LTD 2025   ABN 76 619 930 826    All rights reserved.