异常检测
在数据挖掘中,异常检测(英语:anomaly detection)对不符合预期模式或数据集中其他项目的项目、事件或观测值的识别。[1]通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。[2]
特别是在检测滥用与网络入侵时,有趣性对象往往不是罕见对象,但却是超出预料的突发活动。这种模式不遵循通常统计定义中把异常点看作是罕见对象,于是许多异常检测方法(特别是无监督的方法)将对此类数据失效,除非进行了合适的聚集。相反,聚类分析算法可能可以检测出这些模式形成的微聚类。[3]
有三大类异常检测方法。[1] 在假设数据集中大多数实例都是正常的前提下,无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集,并涉及到训练分类器(与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性)。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型,然后检测由学习模型生成的测试实例的可能性。
应用
异常检测技术用于各种领域,如入侵检测、欺诈检测、故障检测、系统健康监测、传感器网络事件检测和生态系统干扰检测等。它通常用于在预处理中删除从数据集的异常数据。在监督式学习中,去除异常数据的数据集往往会在统计上显著提升准确性。[4][5]
热门方法
文献中提出了几种异常检测方法。一些热门方法有:
- 基于密度的方法(最近邻居法[6][7][8]、局部异常因子[9]及此概念的更多变化[10])。
- 基于子空间[11]与相关性[12]的高维数据的孤立点检测。[13]
- 一类支持向量机。[14]
- 复制神经网络。[15]
- 基于聚类分析的孤立点检测。[16][17]
- 与关联规则和频繁项集的偏差。
- 基于模糊逻辑的孤立点检测。
- 运用特征袋[18][19]、分数归一化[20][21]与不同多样性来源的集成方法。[22][23]
不同方法的性能在很大程度上取决于数据集和参数,比较许多数据集和参数时,各种方法与其他方法相比的系统优势不大。[24][25]
数据安全方面的应用
多萝西·丹宁教授在1986年提出了入侵检测系统(IDS)的异常检测方法[26]。入侵检测系统的异常检测通常是通过阈值和统计完成的,但也可以用软计算和归纳学习。[27] 在1999年提出的统计类型包括检测用户、工作站、网络、远程主机与用户组的配置文件,以及基于频率、均值、方差、协方差和标准差的程序。[28] 在入侵检测系统中,与异常检测模式相对应的还有误用检测模式。
软件
- ELKI是一个包含若干异常检测算法及其索引加速的开源Java数据挖掘工具箱。
参见
- 统计学中的离群值
- 变化检测
- 新奇检测
- 分级暂存记忆
参考文献
- ^ 1.0 1.1 Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey (PDF). ACM Computing Surveys. 2009, 41 (3): 1–58 [2016-09-13]. doi:10.1145/1541880.1541882. (原始内容 (PDF)存档于2021-05-06).
- ^ Hodge, V. J.; Austin, J. A Survey of Outlier Detection Methodologies (PDF). Artificial Intelligence Review. 2004, 22 (2): 85–126 [2016-09-13]. doi:10.1007/s10462-004-4304-y. (原始内容 (PDF)存档于2015-06-22).
- ^ Dokas, Paul; Ertoz, Levent; Kumar, Vipin; Lazarevic, Aleksandar; Srivastava, Jaideep; Tan, Pang-Ning. Data mining for network intrusion detection (PDF). Proceedings NSF Workshop on Next Generation Data Mining. 2002 [2016-09-13]. (原始内容 (PDF)存档于2015-09-23).
- ^ Tomek, Ivan. An Experiment with the Edited Nearest-Neighbor Rule. IEEE Transactions on Systems, Man, and Cybernetics. 1976, 6 (6): 448–452. doi:10.1109/TSMC.1976.4309523.
- ^ Smith, M. R.; Martinez, T. Improving classification accuracy by identifying and removing instances that should be misclassified. The 2011 International Joint Conference on Neural Networks (PDF). 2011: 2690 [2016-09-13]. ISBN 978-1-4244-9635-8. doi:10.1109/IJCNN.2011.6033571. (原始内容存档 (PDF)于2016-11-09).
- ^ Knorr, E. M.; Ng, R. T.; Tucakov, V. Distance-based outliers: Algorithms and applications. The VLDB Journal the International Journal on Very Large Data Bases. 2000, 8 (3–4): 237–253. doi:10.1007/s007780050006.
- ^ Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient algorithms for mining outliers from large data sets. Proceedings of the 2000 ACM SIGMOD international conference on Management of data – SIGMOD '00: 427. 2000. ISBN 1-58113-217-4. doi:10.1145/342009.335437.
- ^ Angiulli, F.; Pizzuti, C. Fast Outlier Detection in High Dimensional Spaces. Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science 2431: 15. 2002. ISBN 978-3-540-44037-6. doi:10.1007/3-540-45681-3_2.
- ^ Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; Sander, J. LOF: Identifying Density-based Local Outliers (PDF). Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD. 2000: 93–104 [2016-09-13]. ISBN 1-58113-217-4. doi:10.1145/335191.335388. (原始内容 (PDF)存档于2015-09-23).
- ^ Schubert, E.; Zimek, A.; Kriegel, H. -P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery. 2012, 28: 190–237. doi:10.1007/s10618-012-0300-z.
- ^ Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science 5476: 831. 2009. ISBN 978-3-642-01306-5. doi:10.1007/978-3-642-01307-2_86.
- ^ Kriegel, H. P.; Kroger, P.; Schubert, E.; Zimek, A. Outlier Detection in Arbitrarily Oriented Subspaces. 2012 IEEE 12th International Conference on Data Mining: 379. 2012. ISBN 978-1-4673-4649-8. doi:10.1109/ICDM.2012.21.
- ^ Zimek, A.; Schubert, E.; Kriegel, H.-P. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining. 2012, 5 (5): 363–387. doi:10.1002/sam.11161.
- ^ Schölkopf, B.; Platt, J. C.; Shawe-Taylor, J.; Smola, A. J.; Williamson, R. C. Estimating the Support of a High-Dimensional Distribution. Neural Computation. 2001, 13 (7): 1443–71. PMID 11440593. doi:10.1162/089976601750264965.
- ^ Hawkins, Simon; He, Hongxing; Williams, Graham; Baxter, Rohan. Outlier Detection Using Replicator Neural Networks. Data Warehousing and Knowledge Discovery. Lecture Notes in Computer Science 2454. 2002: 170–180. ISBN 978-3-540-44123-6. doi:10.1007/3-540-46145-0_17.
- ^ He, Z.; Xu, X.; Deng, S. Discovering cluster-based local outliers. Pattern Recognition Letters. 2003, 24 (9–10): 1641–1650. doi:10.1016/S0167-8655(03)00003-5.
- ^ Campello, R. J. G. B.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Transactions on Knowledge Discovery from Data. 2015, 10 (1): 5:1–51. doi:10.1145/2733381.
- ^ Lazarevic, A.; Kumar, V. Feature bagging for outlier detection. Proc. 11th ACM SIGKDD international conference on Knowledge Discovery in Data Mining. 2005: 157–166. ISBN 1-59593-135-X. doi:10.1145/1081870.1081891.
- ^ Nguyen, H. V.; Ang, H. H.; Gopalkrishnan, V. Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces. Database Systems for Advanced Applications. Lecture Notes in Computer Science 5981: 368. 2010. ISBN 978-3-642-12025-1. doi:10.1007/978-3-642-12026-8_29.
- ^ Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. Interpreting and Unifying Outlier Scores. Proceedings of the 2011 SIAM International Conference on Data Mining: 13–24. 2011 [2016-09-13]. ISBN 978-0-89871-992-5. doi:10.1137/1.9781611972818.2. (原始内容 (PDF)存档于2019-06-12).
- ^ Schubert, E.; Wojdanowski, R.; Zimek, A.; Kriegel, H. P. On Evaluation of Outlier Rankings and Outlier Scores. Proceedings of the 2012 SIAM International Conference on Data Mining: 1047–1058. 2012 [2016-09-13]. ISBN 978-1-61197-232-0. doi:10.1137/1.9781611972825.90. (原始内容 (PDF)存档于2019-06-16).
- ^ Zimek, A.; Campello, R. J. G. B.; Sander, J. R. Ensembles for unsupervised outlier detection. ACM SIGKDD Explorations Newsletter. 2014, 15: 11–22. doi:10.1145/2594473.2594476.
- ^ Zimek, A.; Campello, R. J. G. B.; Sander, J. R. Data perturbation for outlier detection ensembles. Proceedings of the 26th International Conference on Scientific and Statistical Database Management – SSDBM '14: 1. 2014. ISBN 978-1-4503-2722-0. doi:10.1145/2618243.2618257.
- ^ Campos, Guilherme O.; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J. G. B.; Micenková, Barbora; Schubert, Erich; Assent, Ira; Houle, Michael E. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery. 2016, 30 (4): 891. ISSN 1384-5810. doi:10.1007/s10618-015-0444-8.
- ^ Anomaly detection benchmark data repository (页面存档备份,存于互联网档案馆) of the Ludwig-Maximilians-Universität München; Mirror (页面存档备份,存于互联网档案馆) at University of São Paulo.
- ^ Denning, D. E. An Intrusion-Detection Model (PDF). IEEE Transactions on Software Engineering. 1987, SE–13 (2): 222–232 [2016-09-13]. doi:10.1109/TSE.1987.232894. CiteSeerX: 10.1.1.102.5127 . (原始内容 (PDF)存档于2015-06-22).
- ^ Teng, H. S.; Chen, K.; Lu, S. C. Adaptive real-time anomaly detection using inductively generated sequential patterns (PDF). Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy. 1990: 278–284. ISBN 0-8186-2060-9. doi:10.1109/RISP.1990.63857.[永久失效链接]
- ^ Jones, Anita K.; Sielken, Robert S. Computer System Intrusion Detection: A Survey. Technical Report, Department of Computer Science, University of Virginia, Charlottesville, VA. 1999. CiteSeerX: 10.1.1.24.7802 .