Scaling anomaly detection with RRCF

Adam Cassar

Co-Founder

2 min read

As the volume of data grows, so does the need to scale the anomaly detection process. While the RRCF algorithm is powerful and efficient, it may face challenges when dealing with large, high-dimensional data. Here are several strategies to scale the RRCF algorithm.

Compute Summary Statistics Instead of Shingling

Shingling is a process that transforms a single time series into a multivariate one by stacking lagged versions of the data. Although this can help capture the temporal dependencies in the data, it also increases the dimensionality of the data points inserted into each tree, which can hamper performance.

An alternative approach is to compute summary statistics that capture the types of anomalies you are looking for. For instance, if you're interested in detecting spikes, your data points could consist of second central differences. If you're looking for long-term trends, your data points could consist of rolling means at different window sizes. This reduces the dimension of the points you're inserting into each tree, leading to better performance.

Placeholder for Summary Statistics graph

Buffer Input and Compute Rolling Summary Statistics

When data is arriving too quickly to be inserted into the trees, buffering the input and computing rolling summary statistics (mean, median, max, etc.) can help manage the influx of data. This reduces the number of points that need to be inserted into the trees and allows the algorithm to keep up with the streaming data.

Placeholder for Rolling Summary Statistics graph

Parallelisation

RRCF can be parallelised, which is particularly useful when dealing with multiple independent time series. Different RRCF instances can be run for each time series, using separate processes or server instances. This distributes the computational load and can significantly improve performance.

For instance, if you have 10 independent time series, you can run 10 instances of RRCF in parallel, each focusing on one time series. This allows you to scale up the anomaly detection process to handle larger volumes of data.

Placeholder for Parallelization graph

Conclusion

Scaling the RRCF algorithm for large datasets involves several strategies, including computing summary statistics, buffering input, and parallelisation. These methods can help manage high-dimensional data and high data velocities, allowing for efficient and effective anomaly detection even as data volumes grow. By implementing these strategies, you can ensure that your anomaly detection processes remain robust and reliable, no matter the size of your data.

Enterprise-Grade Security and Performance

Peakhour offers enterprise-grade security to shield your applications from DDoS attacks, bots, and online fraud, while our global CDN ensures optimal performance.

Contact Us

Related Content

AI as the Translator Between Human and Machine

AI as the Translator Between Human and Machine

We've gone from command lines to graphical interfaces. The next great leap in how we interact with computers won't be seen, it will be understood. AI is poised to become the ultimate translator between human intent and machine execution.

Data-Driven Risk Management

Data-Driven Risk Management

How Peakhour's contextual security aligns with Visa's data-driven risk management approach in the 2025-2028 Security Roadmap.

The Hidden Cost of Click Fraud

The Hidden Cost of Click Fraud

Click fraud drains marketing budgets and corrupts campaign data. Learn how bots and residential proxies impact your ad spend and marketing strategy.

© PEAKHOUR.IO PTY LTD 2025   ABN 76 619 930 826    All rights reserved.