Applied RRCF - thresholding techniques.

Adam Cassar

Co-Founder

3 min read

Thresholding the RRCF Score: An Important Step in Anomaly Detection

Once we've applied the RRCF algorithm to our streaming data, the resulting scores are a measure of how anomalous each data point is. However, to classify data points as "normal" or "anomalous", we need to set a threshold. This step is critical, as it defines what level of deviation is considered anomalous, thereby preventing the over-identification or under-identification of anomalies.

Placeholder for RRCF score graph

Why is Thresholding Needed?

Thresholding is a vital step in anomaly detection because it helps discriminate between normal and anomalous behavior. Without a threshold, we would be left with a set of scores that indicate relative degrees of anomalousness but lack a clear dividing line between what is considered normal and what is considered an anomaly.

Setting the threshold too low could lead to a high rate of false positives, where normal data points are misclassified as anomalies. On the other hand, setting the threshold too high could result in a high rate of false negatives, where actual anomalies are not detected.

How to Set the Threshold?

There are several methods to set the threshold for the RRCF scores, including the Median Absolute Deviation (MAD), Min/Max, and others. The choice of method can depend on the characteristics of the data and the specific use case.

Median Absolute Deviation (MAD)

The Median Absolute Deviation is a robust measure of variability in a data set. In the context of RRCF scores, we can use MAD to set a threshold. The typical approach is to set the threshold as some multiple of the MAD above the median. This approach is robust to outliers and can be particularly useful when the data has heavy-tailed distributions.

Placeholder for MAD graph

Min/Max

Another approach is to use the minimum and maximum values of the RRCF scores to set the threshold. This could involve setting the threshold as a certain percentage of the range between the minimum and maximum scores. While this method is straightforward, it might be sensitive to extreme values in the scores.

Placeholder for Min/Max graph

Z-Score

Several other methods can be used to set the threshold, depending on the characteristics of the data. These could involve statistical techniques such as setting the threshold based on standard deviations from the mean, using quartiles of the data, or even machine learning techniques to dynamically adjust the threshold based on the observed data.

Placeholder for Other Methods graph

Conclusion

Thresholding is a crucial step in the anomaly detection process. It provides a clear boundary between what is considered normal and what is considered anomalous, enabling the effective identification of potential issues such as cyber threats or system errors. The choice of thresholding method depends on the specific use case and the characteristics of the data. Regardless of the method used, it's important to ensure that the threshold is set in a way that balances the need to detect anomalies against the risk of false positives and negatives.

Enterprise-Grade Security and Performance

Peakhour offers enterprise-grade security to shield your applications from DDoS attacks, bots, and online fraud, while our global CDN ensures optimal performance.

Contact Us

Related Content

From Research Paper to Running Code

From Research Paper to Running Code

Exploring how AI can dramatically accelerate the process of turning complex academic research into functional code, with examples from anomaly detection to small LLMs.

Advanced Anomaly Detection

Deep dive into Robust Random Cut Forest (RRCF) implementation for real-time anomaly detection in Application Security Platforms. Learn how advanced machine learning algorithms enhance threat detection and automated response capabilities.

Double MAD?

This article explores the use of Double Median Absolute Deviation (Double MAD) for anomaly detection in time series data, particularly in skewed or non-symmetric distributions.

Double MAD vs the Rest

A look at the limitations of Double MAD for anomaly detection, and a comparison with the Z-score method, to help you choose the right approach for your data.

Scaling anomaly detection with RRCF

Discusses strategies for scaling the Robust Random Cut Forest (RRCF) algorithm for large-scale anomaly detection, including using summary statistics, buffering input, and parallelisation.

What is Account Monitoring?

Back to learning

Account Monitoring is the continuous surveillance and analysis of user account activities to detect security threats, unusual behavior, and policy violations. This proactive security approach tracks user actions, login patterns, and account changes to identify potential account takeover attempts and fraudulent activities.

Monitoring Components

Activity Tracking

Comprehensive …

© PEAKHOUR.IO PTY LTD 2025   ABN 76 619 930 826    All rights reserved.