Thresholding the RRCF Score: An Important Step in Anomaly Detection
Once we've applied the RRCF algorithm to our streaming data, the resulting scores are a measure of how anomalous each data point is. However, to classify data points as "normal" or "anomalous", we need to set a threshold. This step is critical, as it defines what level of deviation is considered anomalous, thereby preventing the over-identification or under-identification of anomalies.
Why is Thresholding Needed?
Thresholding is a vital step in anomaly detection because it helps discriminate between normal and anomalous behavior. Without a threshold, we would be left with a set of scores that indicate relative degrees of anomalousness but lack a clear dividing line between what is considered normal and what is considered an anomaly.
Setting the threshold too low could lead to a high rate of false positives, where normal data points are misclassified as anomalies. On the other hand, setting the threshold too high could result in a high rate of false negatives, where actual anomalies are not detected.
How to Set the Threshold?
There are several methods to set the threshold for the RRCF scores, including the Median Absolute Deviation (MAD), Min/Max, and others. The choice of method can depend on the characteristics of the data and the specific use case.
Median Absolute Deviation (MAD)
The Median Absolute Deviation is a robust measure of variability in a data set. In the context of RRCF scores, we can use MAD to set a threshold. The typical approach is to set the threshold as some multiple of the MAD above the median. This approach is robust to outliers and can be particularly useful when the data has heavy-tailed distributions.
Min/Max
Another approach is to use the minimum and maximum values of the RRCF scores to set the threshold. This could involve setting the threshold as a certain percentage of the range between the minimum and maximum scores. While this method is straightforward, it might be sensitive to extreme values in the scores.
Z-Score
Several other methods can be used to set the threshold, depending on the characteristics of the data. These could involve statistical techniques such as setting the threshold based on standard deviations from the mean, using quartiles of the data, or even machine learning techniques to dynamically adjust the threshold based on the observed data.
Conclusion
Thresholding is a crucial step in the anomaly detection process. It provides a clear boundary between what is considered normal and what is considered anomalous, enabling the effective identification of potential issues such as cyber threats or system errors. The choice of thresholding method depends on the specific use case and the characteristics of the data. Regardless of the method used, it's important to ensure that the threshold is set in a way that balances the need to detect anomalies against the risk of false positives and negatives.