1、1Discuss the application of data mining in BioinformaticsAbstract. Bioinformatics is a hot subject, that crossing and penetration with a wide range of subjects. To study and understand the background and condition of bioinformatics, and the application of data mining in Bioinformatics, It is helpful
2、 to promote the development of biology and related sciences. The improvement of Bioinformatics depends on the breakthroughs of related discipline, at the same time, its development also provide information, materials and research methods to those disciplines. Key words: Bioinformatics,Data mining,Ap
3、plication. 1. Introduction Bioinformatics is the core of Biological technology, which accompanied by genome research and produce. Bioinformatics is a subject that combines biology, computer science and network, its research content develops with the emergence and development of genome research. The
4、human genome project was initiated and carried out the nucleic acid, protein data increase rapidly, how to obtain the effective information from 2the massive data into bioinformatics is an urgent problem to be solved. The bioinformatics has put forward higher requirements, and its also the challenge
5、s of theory of information and technical, to meet the needs of data collection, collation, retrieval, analysis. As a kind of emerging technology which based on database, statistics and artificial intelligence, data mining offer a data analysis tools never seen before appeared for genome scientists,
6、provided a new and strong tool for Gene and protein information analysis and extraction. Data mining and Bioinformatics has a good combination point which great potential in application is drawing increasing attention in the field of bioinformatics. The article will introduce the concept of data min
7、ing, biological data mining steps, discuss potential applications of data mining, and the development and Application of Bioinformatics mining tool. Studies show that data mining technology is a powerful tool in biological information processing and its application will make more progress. 2. The co
8、nception of bioinformatics Bioinformatics is a science using computer to store, retrieve and analyze biological information in biology science, its one of the important frontier of life science and physical 3science. The development of bioinformatics depend on the breakthrough of biology, computer s
9、cience and other related disciplines, on the other hand, bioinformatics provide information, materials and methods for these disciplines, and query, search, comparison, analysis the biological information, from the acquisition of genes encoding, gene regulation, protein and nucleic acid structure an
10、d function and relationship of rational knowledge of bioinformatics is the use of genomic information in the coding region of the protein space structure simulation and protein function prediction, and combined such information and biology and life process of physiological and biochemical informatio
11、n, outlines its molecular mechanism, finally applied to protein nucleic acid molecular design, drug design and personalized health care design. The three important parts of Bioinformatics are genome informatics, protein structure modeling and drug design. Its source is a DNA sequence information ana
12、lysis, in the protein coding region information for protein structure prediction and simulation, and then on the basis of specific protein functions necessary for drug design. 3. The relation of Data mining and knowledge discovery There are two popular views of the relation of Data mining 4and knowl
13、edge discovery, one view is that the data mining and knowledge discovery are the same concept, just have different name in different areas, in the field of scientific research, we call Knowledge discovery, and we call data mining in the field of engineering application. The other view said that Know
14、ledge discovery is acquire and mine knowledge from mass data, such knowledge is implicit, previously unknown, and potentially useful information. It means Data mining is the core stage of knowledge discovery. Data mining, knowledge discovery system is an organic whole. Data mining system is the proc
15、ess of knowledge discovery which around a data mining task. All the algorithms service for a mining system, Study the data mining system is use for establishes a scientific system of structure, in favor of mining algorithm for reuse, embedding, algorithm and system organic combination of other modul
16、es. Figure 1 is prototype structure of one mining system. 4. Data mining classification and Mining steps 4.1 Data mining involves many fields and methods,there are artificial intelligence, statistical data, visualization, parallel computing. Data mining has a variety of classifications. 54.1.1 Accor
17、ding to mining task, it can be divided into classification model, clustering, association rule discovery, sequence analysis, variance analysis, data visualization. 4.1.2 According to mining objects, it can be divided into relational database, object oriented database, spatial database, the temporal
18、database, text data source, multimedia database, database and web. 4.1.3 According to mining method, it can be divided into the machine learning method, statistical method, neural network method, decision tree, visualization, nearest neighbor technology. In machine learning, can be divided into Indu
19、ctive learning methods (such as decision tree, rule induction), case-based learning, genetic algorithm. In the statistical method, it can be divided into: regression analysis (multivariate regression, regression, discriminant analysis (BDF), Fischer discriminate, nonparametric discriminant and clust
20、er analysis (system), clustering, dynamic clustering), exploratory analysis (principal component analysis, correlation analysis and so on). 4.2 Data mining includes three parts, business requirements, a large amount of data and the algorithm of mining. The first thing to be sure of real data mining
21、is business requirements, and mining algorithm is one of the presently studying hotspots, 6it was mainly focused on adopting new mining algorithm to solve specific business problems. The mining algorithm can form a mining tool. The common process of it is as follows: (1) analyze problems, source dat
22、a database must be assessed to confirm whether it accords with the standard of data mining. Determine the expected results, and choose the optimal algorithm of the job.(2)Extraction, cleaning and checking data. Run the extracted data on a database that structure and data model was compatible. Provid
23、ing clean consolidated data with uniform structure, than browser a created model, ensure that all data is already present and complete. (3)Creating and debugging model, application of algorithm to model, than produce a structure, browse the structure in the data, confirm it to the source data “facts
24、 “ accurate representation, this is the important point. Though it may not be possible for every detail to do this, but by viewing the generated model, might find important characteristics. (4)Query the data of the data mining model, Once this model was building, the data can be used for decision su
25、pport. In the Microsoft data mining solution, the process usually uses VB or ASP DB for Data Mining by OLE Provider prepared front-end inquiry program. (5) A data mining model was maintenance, after data model was building, 7Initial data characteristics (such as validity) may change, and some inform
26、ation on the changes will affect precision greatly affected, because it changes as the basis of the original model of the nature. Therefore, maintaining the data mining model is a very important link. 5. The application of data mining in Bioinformatics. 5.1 Data mining base on privacy protection Dat
27、a mining technology provide effective tool for biological worker, at the same time comes about privacy protection problems. For example, the research unit of the confidential data, personal medical diagnostic records, and medical records are potentially open to misuse. In the data mining process usi
28、ng limited data access, fuzzy data, reducing the unnecessary packet, increase the noise data and other methods to achieve the purpose of protection of privacy. Such as anonymity technology is the identity of the hidden in the most direct technology. It as privacy protection technology of data mining
29、 is data mining result protection, also do not have primitive data hiding camouflage, but released with privacy of all data, but others have privacy data but cannot be deduced from the data owners identity. For example, a medical information data sheet as follow, date of birth, zip, allergic 8drug w
30、ere identified as a specific recording feature attribute collection, the past medical history as a privacy property protection. Anonymity privacy protection is hiding attribute collection which can be used as the only sign of it, which play indirect protection of privacy effect. From the table, we c
31、an know that identifier attribute value is not the same. An identity value can be associated with a particular record, a specific person to correspond. The privacy of data is match with a particular person, privacy can not be protected. But if we choose zip, allergy medication for identifying attrib
32、utes, past medical history is privacy attribute, the same is 07030 value without allergy 2 records, not the privacy attribute values polio colitis , and 07030 no allergy marked records only determined, can achieve the purpose of protection of personal privacy. After many years research and practice,
33、 a lot of data mining, machine learning systems and tools applied to the processing of biological information. General data mining analysis system can be divided into SAS Enterprise Miner , IBM Intelligent Miner ,SGIMinSet and so on. Some special integrated software package in the processing of biol
34、ogical information plays a great role. GCG (Genetics Computer Group) are used 9mainly in anglicizing DNA sequence and portioning sequence. Staden is the software package of DNA and protein sequence analysis. Moreover, there are Sequencher which used for large-scale sequencing, and VectorNTI which us
35、ed for rapid cloning. GeneMine is composed of Molecular Application Group development of bioinformatics data mining system, the system can be used for biological information data filtering, computing and cluster operations support, and further comprehensive analysis and visualization. At present the
36、 world database giant ORACL E, IBM will have biological information mining tools are embedded to ORACL E 9i, DB2, greatly improves the safety of the biological data and analysis of accuracy. 5.2 Semantic integration of data cleaning, data integration, heterogeneous, distributed database. Many countr
37、ies and organizations have established a biological sequence database, protein structure and function of the database to provide a wealth of information for people, but there were asunder distributed data, and the storage medium is also tending to be various. There are a large number of repeated inf
38、ormation sequence and some highly similar data in the same database. It is easy to result in data redundancy, so the heterogeneous and distributed database semantic integration 10has become an important task. Data cleaning, data integration method of data mining can help to solve the problems of dat
39、a redundancy. 5.3 Similarity search and alignment DNA sequence Sequence alignment can identify the evolutionary relationship of a newly discovered genes and a known gene family, identify their homology or similar, find the maximum matching between them, thereby quantitatively the degree of similarit
40、y. Because sequence data is digital, its internal different between nucleotide precision cross plays an important role. So the exploration of efficient search and alignment algorithm in sequence analysis is very important. At the same time for path analysis, evolution analysis found at different sta
41、ges of disease. Cause of a disease gene more than one, different genes in different stages of disease play a role. We can find the different stages of pathogenic gene sequence by the way of path analysis, evolution analysis, can be developed in different stages of treatment drugs, so as to achieve more effective therapeutic effect. 5.4The analysis of genome characterization and simultaneous occurrence of gene sequence. For a group of sequence of gene family, the only way of