admin管理员组

文章数量:1630201

异常行为检测算法

Anomaly detection is a critical problem that has been researched within diverse research areas and application disciplines. This article aims to construct a structured and comprehensive overview of the selected algorithms for anomaly detection by targeting data scientists, data analysts, and machine learning specialists as an audience.

异常检测是已在各种研究领域和应用学科中研究的一个关键问题。 本文旨在通过将数据科学家,数据分析师和机器学习专家作为受众,针对所选的异常检测算法构建结构全面的概述。

异常检测的概念 (Concept of Anomaly Detection)

An unexpected change that performs highly divergent attitudes from other observations in a time period can be represented as abnormal behavior. In other words, Anomaly Detection can be defined as the measure of specifying the outliers in the existing dataset which acts considerably different from the rest of the data points by profiling them as non-conforming normal points.

在一段时间内与其他观察结果表现出高度分歧的意外变化可以表示为异常行为。 换句话说,异常检测 可以定义为在现有数据集中指定离群值的度量,该离群通过将它们配置为不合格的法线点而与其余数据点有很大不同。

Anomalous points might be produced by errors in the data; however, it could point out to a historically or currently existing unidentified or hidden process or behavior by Hawkins.

异常点可能是由数据错误产生的; 但是,它可能指出了Hawkins在历史上或当前存在的未识别或隐藏的过程或行为

As the publicly available data volume reaches in mass amounts, outlier detecting algorithms are modified to run on these data sets to be able to predict the unusual patterns. For instance, a “suspiciously high” count of login trials might outline a possible cyber intrusion or a considerable increase in incoming network traffic can be pointed to malicious activity in network systems. Considering these activities, they hold a shared aspect that they are “appealing” and “unusual” to the data scientists and data analysts. The “curiosity” or real-life applicability of anomalies is an essential element of anomaly detection.

随着公开可用数据量的大量增加,离群值检测算法已修改为在这些数据集上运行,以便能够预测异常模式。 例如,登录试验的“可疑数量可能概述了可能的网络入侵,或者传入网络流量的显着增加可能表明网络系统中存在恶意活动。 考虑到这些活动,它们具有一个共同的方面,即它们对数据科学家和数据分析师“具有吸引力”“不同寻常” 。 异常的“好奇心”或现实适用性是异常检测的基本要素。

异常类型 (Types of Anomalies)

There exist three different kinds of anomalies in the literature.

文献中存在三种不同类型的异常。

Figure 1. Types & Examples of Anomaly Detection. (Image by the author) 图1.异常检测的类型和示例。 (图片由作者提供)

Descriptions can be found below:

可以在下面找到说明:

1. Point Anomaly: An anomaly when a distinct item in a dataset is largely dissimilar from others corresponding to its attributes.

1.点异常:数据集中的不同项目与对应于其属性的其他项目在很大程度上不同时异常

Figure 2. The point anomaly is marked with red. (Image by the author) 图2.点异常标记为红色。 (图片由作者提供)

2. Contextual Anomalies: An anomaly which has a divergence that points to a context-based knowledge. This kind of anomaly may not be recognized when the contextual information is absent.

2.上下文异常:具有差异的异常,该异常指向基于上下文的知识。 当缺少上下文信息时,可能无法识别这种异常。

Figure 3. The contextual anomalous point that can be explained in its context. (Image by Author) 图3.可以在上下文中解释的上下文异常点。 (图片由作者提供)

3. Collective Anomalies: Anomalies that are composed of multiple related instances of elements that may not constitute an anomalous point individually. The collective summation of specific events is considered while analyzing outlier behaviors.

3.集体异常:由元素的多个相关实例组成的异常,这些元素可能不会单独构成异常点。 在分析异常行为时,应考虑特定事件的集体汇总。

Figure 4. Collective Anomaly Detection highlighted in the red line. (Image by the author) 图4.红线中突出显示的集体异常检测。 (图片由作者提供)

目录 (Table of Contents)

1. Statistical Approach1.1. Minimum Covariance Determinant (MCD)1.2. Principle Component Analysis (PCA)

1.统计方法1.1。 最小协方差决定因素(MCD) 1.2。 主成分分析(PCA)

2. Distance-based Approach
2.1.
Local Outlier Factor (LOF)
2.2.
Novelty Detection Local Outlier Factor (ND LOF)
2.3.
Mahalanobis Distance (MDist)

2.基于距离的方法2.1。 局部离群因子(LOF) 2.2。 新奇检测局部离群因子(ND LOF) 2.3。 马氏距离(MDist)

3. Density-based Approach
3.1.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
3.2.
Ordering Points To Identify the Clustering Structure (OPTICS)

3.基于密度的方法3.1。 基于密度的噪声应用空间聚类(DBSCAN) 3.2。 识别聚类结构的排序点(OPTICS)

4. Isolation-based Approach
4.1.
Isolation Forest (iForest)

4.基于隔离的方法4.1。 隔离林(iForest)

5. Classification-based Approach
5.1.
One-Class SVM

5.基于分类的方法5.1。 一类SVM

1.统计方法 (1. STATISTICAL APPROACH)

1.1。 最小协方差行列式(MCD) (1.1. Minimum Covariance Determinant (MCD))

Minimum Covariance Determinant (MCD) acts as the covariance estimator that is to be applied to Gaussian-distributed data. It basically searches for the subset of a specified number of data points whose covariance matrix contains the lowest determinant.

最小协方差行列式(MCD) 用作将应用于高斯分布数据的协方差估计器。 它基本上搜索指定数量的数据点的子集,这些数据点的协方差矩阵包含最低的行列式。

Because of the geometrical representation of the covariance matrix, the MCD algorithm tends to learn a rotationally symmetrical shape and works best with elliptically symmetric unimodal distributions. For this reason, it would be more performant to apply this algorithm while detecting outliers on the dataset which belongs to a unimodal distribution, so it is not advised to be used with multi-modal data. The more the size of the data and unimodality gets lower, the more the performance of the algorithm diminishes.

由于协方差矩阵的几何表示,MCD算法倾向于学习旋转对称的形状,并且最适合椭圆对称的单峰分布。 因此,在检测属于单峰分布的数据集上的离群值时,应用该算法会更有性能,因此不建议与多数据一起使用。 数据的大小和单峰性越小,算法的性能下降的幅度就越大。

For the formulation and the detailed parameter explanations, please kindly visit this article.

有关配方和详细的参数说明,请访问 这篇文章

本文标签: 异常算法类型