分子挖掘

分子挖掘（Molecule mining）为使用分子的数据挖掘。由于分子可由分子图表示，这与图形挖掘和结构化数据挖掘密切相关。主要问题是如何在区分数据实例时表示分子。其中一种方法是化学相似性度量，这在化学信息学领域具有悠久的传统。

计算化学相似性的典型方法是使用化学指纹，但这会导致丢失有关分子拓扑的基础信息。挖掘分子图直接避免了这个问题。反向QSAR问题也适用于矢量映射问题。

编码(分子i,分子j\neq i)

核心方法

边缘化图形核心
^[1]
最优分配核心^[2]^[3]^[4]
药效核心^[5]
C++(and R)执行（页面存档备份，存于互联网档案馆）结合
- 标记图之间的边缘化图形核心
  ^[1]
- 边缘化核心的扩展^[6]
- 谷本核(Tanimoto kernels)^[7]
- 基于树形图的图形内核^[8]
- 基于用于分子3D结构的药效核心^[5]

最大值共同图形方法(Maximum Common Graph methods)

MCS-HSCS^[9] (单MCS最高得分普通子结构（HSCS）排名策略)
小分子子图检测器（SMSD）^[10]-是一个基于Java的软件库，用于计算小分子之间的最大共同子图（MCS）。这将有助于我们找到两个分子之间的相似性/距离。 MCS也用于通过击打分子来筛选药物化合物，其分享共同的子图（子结构）。^[11]

编码（分子i）

分子查询方法

Warmr^[12]^[13]
AGM^[14]^[15]
PolyFARM^[16]
FSG^[17]^[18]
MolFea^[19]
MoFa/MoSS^[20]^[21]^[22]
Gaston^[23]
LAZAR^[24]
ParMol^[25] (包括 MoFa, FFSM, gSpan 和 Gaston)
optimized gSpan^[26]^[27]
SMIREP^[28]
DMax^[29]
SAm/AIm/RHC^[30]
AFGen^[31]
gRed^[32]
G-Hash^[33]

基于神经网络特殊架构的方法

BPZ^[34]^[35]
ChemNet^[36]
CCS^[37]^[38]
MolNet^[39]
Graph machines^[40]

参见

分子查询语言
化学图论

参考文献

^ ^1.0 ^1.1 H. Kashima, K. Tsuda, A. Inokuchi, Marginalized Kernels Between Labeled Graphs, The 20th International Conference on Machine Learning (ICML2003), 2003. PDF
^ H. Fröhlich, J. K. Wegner, A. Zell, Optimal Assignment Kernels For Attributed Molecular Graphs, The 22nd International Conference on Machine Learning (ICML 2005), Omnipress, Madison, WI, USA, 2005, 225-232. PDF
^ H. Fröhlich, J. K. Wegner, A. Zell, Kernel Functions for Attributed Molecular Graphs - A New Similarity Based Approach To ADME Prediction in Classification and Regression, QSAR Comb. Sci., 2006, 25, 317-326. doi:10.1002/qsar.200510135
^ H. Fröhlich, J. K. Wegner, A. Zell, Assignment Kernels For Chemical Compounds, International Joint Conference on Neural Networks 2005 (IJCNN'05), 2005, 913-918. CiteSeer
^ ^5.0 ^5.1 P. Mahe, L. Ralaivola, V. Stoven, J. Vert, The pharmacophore kernel for virtual screening with support vector machines, J Chem Inf Model, 2006, 46, 2003-2014. doi:10.1021/ci060138m
^ P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret and P. Vert, J.-P. Extensions of marginalized graph kernels. Proceedings of the 21st ICML. 2004: 552–559.
^ L. Ralaivola, S. J. Swamidass, S. Hiroto and P. Baldi. Graph kernels for chemical informatics. Neural Networks. 2005, 18: 1093–1110 [2017-07-02]. doi:10.1016/j.neunet.2005.07.009. （原始内容存档于2015-09-24）.
^ P. Mahé and J.-P. Vert. Graph kernels based on tree patterns for molecules. Machine Learning. 2009, 75 (1): 3–35. ISSN 0885-6125. doi:10.1007/s10994-008-5086-2.
^ J. K. Wegner, H. Fröhlich, H. Mielenz, A. Zell, Data and Graph Mining in Chemical Space for ADME and Activity Data Sets, QSAR Comb. Sci., 2006, 25, 205-220. doi:10.1002/qsar.200510009
^ S. A. Rahman, M. Bashton, G. L. Holliday, R. Schrader and J. M. Thornton, Small Molecule Subgraph Detector (SMSD) toolkit, Journal of Cheminformatics 2009, 1:12. doi:10.1186/1758-2946-1-12
^ 存档副本. [2017-07-02]. （原始内容存档于2020-01-28）.
^ R. D. King, A. Srinivasan, L. Dehaspe, Wamr: a data mining tool for chemical data, J. Comput.-Aid. Mol. Des., 2001, 15, 173-181. doi:10.1023/A:1008171016861
^ L. Dehaspe, H. Toivonen, King, Finding frequent substructures in chemical compounds, 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press., 1998, 30-36.
^ A. Inokuchi, T. Washio, T. Okada, H. Motoda, Applying the Apriori-based Graph Mining Method to Mutagenesis Data Analysis, Journal of Computer Aided Chemistry, 2001, 2, 87-92.
^ A. Inokuchi, T. Washio, K. Nishimura, H. Motoda, A Fast Algorithm for Mining Frequent Connected Subgraphs, IBM Research, Tokyo Research Laboratory, 2002.
^ A. Clare, R. D. King, Data mining the yeast genome in a lazy functional language, Practical Aspects of Declarative Languages (PADL2003), 2003.
^ M. Kuramochi, G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs, IEEE Transactions on Knowledge and Data Engineering, 2004, 16(9), 1038-1051.
^ M. Deshpande, M. Kuramochi, N. Wale, G. Karypis, Frequent Substructure-Based Approaches for Classifying Chemical Compounds, IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8), 1036-1050.
^ C. Helma, T. Cramer, S. Kramer, L. de Raedt, Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds, J. Chem. Inf. Comput. Sci., 2004, 44, 1402-1411. doi:10.1021/ci034254q
^ T. Meinl, C. Borgelt, M. R. Berthold, Discriminative Closed Fragment Mining and Perfect Extensions in MoFa, Proceedings of the Second Starting AI Researchers Symposium (STAIRS 2004), 2004.
^ T. Meinl, C. Borgelt, M. R. Berthold, M. Philippsen, Mining Fragments with Fuzzy Chains in Molecular Databases, Second International Workshop on Mining Graphs, Trees and Sequences (MGTS2004), 2004.
^ T. Meinl, M. R. Berthold, Hybrid Fragment Mining with MoFa and FSG, Proceedings of the 2004 IEEE Conference on Systems, Man & Cybernetics (SMC2004), 2004.
^ S. Nijssen, J. N. Kok. Frequent Graph Mining and its Application to Molecular Databases, Proceedings of the 2004 IEEE Conference on Systems, Man & Cybernetics (SMC2004), 2004.
^ C. Helma, Predictive Toxicology, CRC Press, 2005.
^ M. Wörlein, Extension and parallelization of a graph-mining-algorithm, Friedrich-Alexander-Universität, 2006. PDF
^ K. Jahn, S. Kramer, Optimizing gSpan for Molecular Datasets, Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS-2005), 2005.
^ X. Yan, J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), IEEE Computer Society, 2002, 721-724.
^ A. Karwath, L. D. Raedt, SMIREP: predicting chemical activity from SMILES, J Chem Inf Model, 2006, 46, 2432-2444. doi:10.1021/ci060159g
^ H. Ando, L. Dehaspe, W. Luyten, E. Craenenbroeck, H. Vandecasteele, L. Meervelt, Discovering H-Bonding Rules in Crystals with Inductive Logic Programming, Mol Pharm, 2006, 3, 665-674 . doi:10.1021/mp060034z
^ P. Mazzatorta, L. Tran, B. Schilter, M. Grigorov, Integration of Structure-Activity Relationship and Artificial Intelligence Systems To Improve in Silico Prediction of Ames Test Mutagenicity, J. Chem. Inf. Model., 2006, ASAP alert. doi:10.1021/ci600411v
^ N. Wale, G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM, ''2006, 678-689.
^ A. Gago Alonso, J.E. Medina Pagola, J.A. Carrasco-Ochoa and J.F. Martínez-Trinidad Mining Connected Subgraph Mining Reducing the Number of Candidates, In Proc. of ECML--PKDD, pp. 365–376, 2008.
^ Xiaohong Wang, Jun Huan , Aaron Smalter, Gerald Lushington, Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases , in BMC Bioinformatics Vol. 11 (Suppl 3):S8 2010.
^ Baskin, I. I.; V. A. Palyulin; N. S. Zefirov. [A methodology for searching direct correlations between structures and properties of organic compounds by using computational neural networks]. Doklady Akademii Nauk SSSR. 1993, 333 (2): 176–179.
^ I. I. Baskin, V. A. Palyulin, N. S. Zefirov. A Neural Device for Searching Direct Correlations between Structures and Properties of Organic Compounds. J. Chem. Inf. Comput. Sci. 1997, 37 (4): 715–721. doi:10.1021/ci940128y.
^ D. B. Kireev. ChemNet: A Novel Neural Network Based Method for Graph/Property Mapping. J. Chem. Inf. Comput. Sci. 1995, 35 (2): 175–180. doi:10.1021/ci00024a001.
^ A. M. Bianucci; Micheli, Alessio; Sperduti, Alessandro; Starita, Antonina. Application of Cascade Correlation Networks for Structures to Chemistry. Applied Intelligence. 2000, 12 (1-2): 117–146. doi:10.1023/A:1008368105614.
^ A. Micheli, A. Sperduti, A. Starita, A. M. Bianucci. Analysis of the Internal Representations Developed by Neural Networks for Structures Applied to Quantitative Structure-Activity Relationship Studies of Benzodiazepines. J. Chem. Inf. Comput. Sci. 2001, 41 (1): 202–218. PMID 11206375. doi:10.1021/ci9903399.
^ O. Ivanciuc. Molecular Structure Encoding into Artificial Neural Networks Topology. Roumanian Chemical Quarterly Reviews. 2001, 8: 197–220.
^ A. Goulon, T. Picot, A. Duprat, G. Dreyfus. Predicting activities without computing descriptors: Graph machines for QSAR. SAR and QSAR in Environmental Research. 2007, 18 (1-2): 141–153. PMID 17365965. doi:10.1080/10629360601054313.

进一步阅读

Schölkopf, B., K. Tsuda and J. P. Vert: Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004.
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2001. ISBN 0-471-05669-3
Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997。 ISBN 0-521-58519-8
R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, 2000. ISBN 3-527-29913-0

参见

定量构效关系
ADME
分配系数

外部链接

小分子子图检测器(SMSD) （页面存档备份，存于互联网档案馆） - 是一个基于Java的软件库，用于计算小分子之间的最大共同子图（MCS）。
2007年第五届国际挖掘与学习研讨会（页面存档备份，存于互联网档案馆）
2006年概览（页面存档备份，存于互联网档案馆）
分子开采（基础化学专家系统）
ParMol 和硕士论文文档（页面存档备份，存于互联网档案馆） - Java - 开源 - 分布式挖掘 - 基准算法库
TU慕尼黑 - 克莱默集团
分子采矿（高级化学专家系统）
DMax化学助理 -商业软件
AFGen （页面存档备份，存于互联网档案馆） -用于生成基于片段的描述符的软件

[kti03-1] 1.0 ^1.1 H. Kashima, K. Tsuda, A. Inokuchi, Marginalized Kernels Between Labeled Graphs, The 20th International Conference on Machine Learning (ICML2003), 2003. PDF

[fwz05b-2] H. Fröhlich, J. K. Wegner, A. Zell, Optimal Assignment Kernels For Attributed Molecular Graphs, The 22nd International Conference on Machine Learning (ICML 2005), Omnipress, Madison, WI, USA, 2005, 225-232. PDF

[fwz06-3] H. Fröhlich, J. K. Wegner, A. Zell, Kernel Functions for Attributed Molecular Graphs - A New Similarity Based Approach To ADME Prediction in Classification and Regression, QSAR Comb. Sci., 2006, 25, 317-326. doi:10.1002/qsar.200510135

[fwz05a-4] H. Fröhlich, J. K. Wegner, A. Zell, Assignment Kernels For Chemical Compounds, International Joint Conference on Neural Networks 2005 (IJCNN'05), 2005, 913-918. CiteSeer

[mrsv06-5] 5.0 ^5.1 P. Mahe, L. Ralaivola, V. Stoven, J. Vert, The pharmacophore kernel for virtual screening with support vector machines, J Chem Inf Model, 2006, 46, 2003-2014. doi:10.1021/ci060138m

[Mahe2004-6] P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret and P. Vert, J.-P. Extensions of marginalized graph kernels. Proceedings of the 21st ICML. 2004: 552–559.

[Ralaivola2005-7] L. Ralaivola, S. J. Swamidass, S. Hiroto and P. Baldi. Graph kernels for chemical informatics. Neural Networks. 2005, 18: 1093–1110 [2017-07-02]. doi:10.1016/j.neunet.2005.07.009. （原始内容存档于2015-09-24）.

[Mahe2009-8] P. Mahé and J.-P. Vert. Graph kernels based on tree patterns for molecules. Machine Learning. 2009, 75 (1): 3–35. ISSN 0885-6125. doi:10.1007/s10994-008-5086-2.

[wfmz06-9] J. K. Wegner, H. Fröhlich, H. Mielenz, A. Zell, Data and Graph Mining in Chemical Space for ADME and Activity Data Sets, QSAR Comb. Sci., 2006, 25, 205-220. doi:10.1002/qsar.200510009

[SMSD09-10] S. A. Rahman, M. Bashton, G. L. Holliday, R. Schrader and J. M. Thornton, Small Molecule Subgraph Detector (SMSD) toolkit, Journal of Cheminformatics 2009, 1:12. doi:10.1186/1758-2946-1-12

[SMSD-11] 存档副本. [2017-07-02]. （原始内容存档于2020-01-28）.

[ksd01-12] R. D. King, A. Srinivasan, L. Dehaspe, Wamr: a data mining tool for chemical data, J. Comput.-Aid. Mol. Des., 2001, 15, 173-181. doi:10.1023/A:1008171016861

[dtk98-13] L. Dehaspe, H. Toivonen, King, Finding frequent substructures in chemical compounds, 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press., 1998, 30-36.

[iwom01-14] A. Inokuchi, T. Washio, T. Okada, H. Motoda, Applying the Apriori-based Graph Mining Method to Mutagenesis Data Analysis, Journal of Computer Aided Chemistry, 2001, 2, 87-92.

[iwnm02-15] A. Inokuchi, T. Washio, K. Nishimura, H. Motoda, A Fast Algorithm for Mining Frequent Connected Subgraphs, IBM Research, Tokyo Research Laboratory, 2002.

[ck03-16] A. Clare, R. D. King, Data mining the yeast genome in a lazy functional language, Practical Aspects of Declarative Languages (PADL2003), 2003.

[fsg04-17] M. Kuramochi, G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs, IEEE Transactions on Knowledge and Data Engineering, 2004, 16(9), 1038-1051.

[fsg05-18] M. Deshpande, M. Kuramochi, N. Wale, G. Karypis, Frequent Substructure-Based Approaches for Classifying Chemical Compounds, IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8), 1036-1050.

[hckr04-19] C. Helma, T. Cramer, S. Kramer, L. de Raedt, Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds, J. Chem. Inf. Comput. Sci., 2004, 44, 1402-1411. doi:10.1021/ci034254q

[mbb04-20] T. Meinl, C. Borgelt, M. R. Berthold, Discriminative Closed Fragment Mining and Perfect Extensions in MoFa, Proceedings of the Second Starting AI Researchers Symposium (STAIRS 2004), 2004.

[mbbp04-21] T. Meinl, C. Borgelt, M. R. Berthold, M. Philippsen, Mining Fragments with Fuzzy Chains in Molecular Databases, Second International Workshop on Mining Graphs, Trees and Sequences (MGTS2004), 2004.

[mb04-22] T. Meinl, M. R. Berthold, Hybrid Fragment Mining with MoFa and FSG, Proceedings of the 2004 IEEE Conference on Systems, Man & Cybernetics (SMC2004), 2004.

[nk04-23] S. Nijssen, J. N. Kok. Frequent Graph Mining and its Application to Molecular Databases, Proceedings of the 2004 IEEE Conference on Systems, Man & Cybernetics (SMC2004), 2004.

[hel05-24] C. Helma, Predictive Toxicology, CRC Press, 2005.

[woe06-25] M. Wörlein, Extension and parallelization of a graph-mining-algorithm, Friedrich-Alexander-Universität, 2006. PDF

[jk05-26] K. Jahn, S. Kramer, Optimizing gSpan for Molecular Datasets, Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS-2005), 2005.

[yh02a-27] X. Yan, J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), IEEE Computer Society, 2002, 721-724.

[kr06-28] A. Karwath, L. D. Raedt, SMIREP: predicting chemical activity from SMILES, J Chem Inf Model, 2006, 46, 2432-2444. doi:10.1021/ci060159g

[ahlcvm06-29] H. Ando, L. Dehaspe, W. Luyten, E. Craenenbroeck, H. Vandecasteele, L. Meervelt, Discovering H-Bonding Rules in Crystals with Inductive Logic Programming, Mol Pharm, 2006, 3, 665-674 . doi:10.1021/mp060034z

[mtsg06-30] P. Mazzatorta, L. Tran, B. Schilter, M. Grigorov, Integration of Structure-Activity Relationship and Artificial Intelligence Systems To Improve in Silico Prediction of Ames Test Mutagenicity, J. Chem. Inf. Model., 2006, ASAP alert. doi:10.1021/ci600411v

[afgen06-31] N. Wale, G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM, ''2006, 678-689.

[gago08-32] A. Gago Alonso, J.E. Medina Pagola, J.A. Carrasco-Ochoa and J.F. Martínez-Trinidad Mining Connected Subgraph Mining Reducing the Number of Candidates, In Proc. of ECML--PKDD, pp. 365–376, 2008.

[wang10-33] Xiaohong Wang, Jun Huan , Aaron Smalter, Gerald Lushington, Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases , in BMC Bioinformatics Vol. 11 (Suppl 3):S8 2010.

[34] Baskin, I. I.; V. A. Palyulin; N. S. Zefirov. [A methodology for searching direct correlations between structures and properties of organic compounds by using computational neural networks]. Doklady Akademii Nauk SSSR. 1993, 333 (2): 176–179.

[35] I. I. Baskin, V. A. Palyulin, N. S. Zefirov. A Neural Device for Searching Direct Correlations between Structures and Properties of Organic Compounds. J. Chem. Inf. Comput. Sci. 1997, 37 (4): 715–721. doi:10.1021/ci940128y.

[36] D. B. Kireev. ChemNet: A Novel Neural Network Based Method for Graph/Property Mapping. J. Chem. Inf. Comput. Sci. 1995, 35 (2): 175–180. doi:10.1021/ci00024a001.

[37] A. M. Bianucci; Micheli, Alessio; Sperduti, Alessandro; Starita, Antonina. Application of Cascade Correlation Networks for Structures to Chemistry. Applied Intelligence. 2000, 12 (1-2): 117–146. doi:10.1023/A:1008368105614.

[38] A. Micheli, A. Sperduti, A. Starita, A. M. Bianucci. Analysis of the Internal Representations Developed by Neural Networks for Structures Applied to Quantitative Structure-Activity Relationship Studies of Benzodiazepines. J. Chem. Inf. Comput. Sci. 2001, 41 (1): 202–218. PMID 11206375. doi:10.1021/ci9903399.

[39] O. Ivanciuc. Molecular Structure Encoding into Artificial Neural Networks Topology. Roumanian Chemical Quarterly Reviews. 2001, 8: 197–220.

[40] A. Goulon, T. Picot, A. Duprat, G. Dreyfus. Predicting activities without computing descriptors: Graph machines for QSAR. SAR and QSAR in Environmental Research. 2007, 18 (1-2): 141–153. PMID 17365965. doi:10.1080/10629360601054313.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]