This article explores the use of Double Median Absolute Deviation (Double MAD) for [anomaly detection](/learning/threat-detection/what-is-anomaly-detection/) in time series
data, particularly in skewed or non-symmetric distributions. Double MAD, which calculates two median absolute
deviations — one for data below the median and one for data above — provides a more nuanced approach than traditional
MAD, allowing for accurate detection of anomalies even in skewed data distributions. We also delve into its application
in identifying slow abuse, like bots, by catching lower range anomalies. However, it's important to note Double MAD's
limitations such as not capturing seasonal data shape and trends over time. A comparison is also drawn with the Z-score
method, highlighting that the choice between the two depends on the nature of your data. The article provides insights
into the practical implementation of Double MAD and its potential to improve your data analysis toolkit.
Unveiling the Power of Double MAD in Anomaly Detection
As we tread deeper into the digital age, the importance of leveraging data for informed decision-making is becoming increasingly apparent. Anomaly detection in time-series data is one such vital application. By identifying patterns that deviate from the norm, businesses can proactively take measures to address potential issues or leverage unexpected opportunities.
One powerful technique for anomaly detection is the Median Absolute Deviation (MAD) and, more specifically, its extension, the Double MAD. This article will delve into the world of Double MAD, exploring its utility for anomaly detection in time series data and its application in identifying anomalous clients.
Understanding MAD and Double MAD
MAD, a robust measure of variability, is less susceptible to outliers than standard deviation. It calculates the median of absolute deviations from the data's median, providing a more accurate representation of 'normal' behaviour in datasets with skewed distributions or outliers.
Double MAD is an extension of MAD, where two MADs are calculated — one for the data below the median and another for the data above. This bifurcation of data offers an improved detection process for asymmetric data, which is common in real-world time series data.
Why Double MAD?
While MAD provides a robust way to understand the 'normal' range of a dataset, it assumes a symmetric distribution of data around the median, which may not always hold true. This is where Double MAD shines, offering an enhanced anomaly detection process for skewed or asymmetric datasets.
In time-series analysis, especially with 24-hour cycles like web traffic or server usage, patterns can exhibit seasonality and trend components. These patterns can often be asymmetric, making Double MAD a valuable tool for capturing the variability in different parts of the data.
Using Double MAD in Anomaly Detection
The Double MAD implementation provided uses Rust, a system programming language, known for its speed and memory safety features. The code calculates the lower and upper MAD values, along with their respective thresholds. Anomalies can then be detected by comparing each data point to these thresholds.
An anomaly is defined as a data point that deviates significantly from the expected range. If a data point falls below the lower MAD threshold or above the upper one, it can be flagged as an anomaly. This approach is especially effective when handling datasets with high variability or extreme values.
Double MAD for Anomalous Client Detection
Beyond time-series data, Double MAD can also be instrumental in identifying anomalous behaviour among clients. By comparing each client's behaviour against the Double MAD of the time-series data, one can pinpoint clients that deviate from the norm.
For instance, in the context of web service usage, an anomalous client might be one that is sending an unusually high or low number of requests. By using Double MAD, you can effectively flag such outliers and take appropriate action, like investigating potential misuse or reaching out to understand and address any issues they may be facing.
Detecting Lower-Range Anomalies: A Case of Slow Abuse
An interesting application of Double MAD is in detecting lower-range anomalies, a pattern often associated with slow abuse such as bots or Distributed Denial of Service (DDoS) attacks. These abuses are characterised by an unusually low frequency of activity that is consistent over a prolonged period. This consistent, low-level activity can fly under the radar of typical anomaly detection systems, making it a potentially harmful threat.
By setting a lower MAD threshold, Double MAD can effectively detect these lower-range anomalies, providing early warning of slow abuse. This ability to detect both high and low anomalies makes Double MAD a versatile and powerful tool for anomaly detection.
The Math Behind Double MAD
To illustrate the power of Double MAD, let's consider a dataset from a right-skewed distribution. Applying the conventional MAD approach might lead tofalse positives where normal data points are marked as outliers. This is because MAD uses a symmetric interval around the median, which doesn't account for the skewed nature of our data.
With Double MAD, we instead calculate two MADs — one for the data below the median (MAD-lower) and another for the data above (MAD-upper). Outlier thresholds are then defined using these two MADs. The lower threshold is calculated as the median minus a multiplier (k) times MAD-lower. The upper threshold is the median plus k times MAD-upper.
This approach takes into account the asymmetric nature of our data, thereby providing more accurate anomaly detection. For example, in a right-skewed distribution, Double MAD would correctly identify only the extreme right tail values as outliers without incorrectly flagging data points on the left tail.
Wrapping Up
In an era of big data, being able to accurately detect anomalies in time series data is increasingly vital. The Double MAD approach provides a robust, nuanced method for achieving this, allowing businesses to better understand their data, spot potential issues early, and ultimately make more informed decisions.
Whether you're monitoring web traffic, server usage, or client behaviour, leveraging Double MAD can offer valuable insights and help ensure your operations continue to run smoothly. The ability to detect both high and low anomalies makes it especially powerful, providing protection against potential threats like slow abuse.
Understanding and implementing Double MAD can be a game changer in your data analysis toolkit, providing a more holistic view of your data and enabling you to stay one step ahead of potential anomalies.