Scaling anomaly detection with RRCF

Mon 15 May 2023

Co-Founder

2 min read

As the volume of data grows, so does the need to scale the anomaly detection process. While the RRCF algorithm is powerful and efficient, it may face challenges when dealing with large, high-dimensional data. Here are several strategies to scale the RRCF algorithm.

Compute Summary Statistics Instead of Shingling

Shingling is a process that transforms a single time series into a multivariate one by stacking lagged versions of the data. Although this can help capture the temporal dependencies in the data, it also increases the dimensionality of the data points inserted into each tree, which can hamper performance.

An alternative approach is to compute summary statistics that capture the types of anomalies you are looking for. For instance, if you're interested in detecting spikes, your data points could consist of second central differences. If you're looking for long-term trends, your data points could consist of rolling means at different window sizes. This reduces the dimension of the points you're inserting into each tree, leading to better performance.

Placeholder for Summary Statistics graph

Buffer Input and Compute Rolling Summary Statistics

When data is arriving too quickly to be inserted into the trees, buffering the input and computing rolling summary statistics (mean, median, max, etc.) can help manage the influx of data. This reduces the number of points that need to be inserted into the trees and allows the algorithm to keep up with the streaming data.

Placeholder for Rolling Summary Statistics graph

Parallelisation

RRCF can be parallelised, which is particularly useful when dealing with multiple independent time series. Different RRCF instances can be run for each time series, using separate processes or server instances. This distributes the computational load and can significantly improve performance.

For instance, if you have 10 independent time series, you can run 10 instances of RRCF in parallel, each focusing on one time series. This allows you to scale up the anomaly detection process to handle larger volumes of data.

Placeholder for Parallelization graph

Conclusion

Scaling the RRCF algorithm for large datasets involves several strategies, including computing summary statistics, buffering input, and parallelisation. These methods can help manage high-dimensional data and high data velocities, allowing for efficient and effective anomaly detection even as data volumes grow. By implementing these strategies, you can ensure that your anomaly detection processes remain robust and reliable, no matter the size of your data.

#Anomaly Detection #Threat Detection