1、1Research and Improvement of Database Storage MethodAbstract: This paper presents a massive data storage and parallel processing method based on MPP architecture, and put forward full persistent data storage way from the client to request, and the integration the idea of Map/Reduce, the system will
2、be distributed to each data node, the data has high scalability, high availability, high concurrency. And the simulation test and verifies the feasibility of mass data storage mode by building a distributed data node. Keywords: Parallel processing; data storage; distributed database 0 Introductions
3、The amount of data is too large and fast growth, storage efficiency decreased, while the business demand and competitive pressure on the real-time data processing and validity put forward higher requirements, in this case, computer technicians gradually began to research on new technologies, includi
4、ng distributed file system, distributed database, distributed cache, based on MPP a variety of NoSQL distributed storage 2scheme. Centralized relational database faces enormous challenges in dealing with massive data, showing its shortcomings: (1) a single database to store a limited amount of data,
5、 when the data expansion to the T level, single database will not be able to meet the demand of data level; (2) the data stored on a single server, once the server fails, the entire data center may be paralyzed, low reliability; (3) the centralized data storage size is too small, can only meet the d
6、emand through the promotion of hard disk and disk storage size, will not only meet the bottleneck of storage, and easy to cause problems such as low performance. To sum up, in order to solve the problem of centralized database can be summarized as the following aspects: the first is the huge amount
7、of data storage requirements; followed by storage processing continuous incremental data; finally is to the massive data query processing of high performance and high concurrency. In the case of large amounts of data, query processing will become very slow, poses serious performance problems. Theref
8、ore, it can be used to break up the whole into parts of the data, divide and rule method to solve this problem. This paper will research based on this concept, introduced 3the concept of parallel processing architecture based on MPP, all the pressure of request is allocated to each database. 1 Datab
9、ase storage method (1) Parallel processing technology Parallel processing is taking parallel means to achieve efficient computing development in information processing. In other words, it refers to perform a number of tasks at the same time range; these could be the same nature and can also be diffe
10、rent. As long as the time has the existence of overlap, it is called parallel. It is mainly used in high performance processor, large database management, complex mathematical modeling and other fields, and its scope of application is still expanding constantly. It mainly consists of three factors:
11、at the same time, concurrency and water. (2) The MPP architecture and the idea of MapReduce Massive Parallel Processing (MPP:) refers to the massive parallel processing architecture, usually used for mathematical modeling, large computation amount of heavy database processing, complex weather modeli
12、ng and other fields, its characteristic is can accommodate multiple processing server running in parallel, connected via the Internet communication, the terminal can be a plurality of low cost server process together 4(3) The SQL parsing technology The SQL parsing technology have three ways: one is
13、professional grammar parser uses such as ANTLR analysis, this method has high flexibility, can be self, but the workload is very large; the use of open source SQL parser to parse, this way will work less, but may lead to functional requirements is not perfect; third, the use of analytic functions wi
14、th SQL database. (4) Data segmentation technique Data segmentation can be divided into two kinds of segmentation model. The first is the vertical segmentation; segmentation is the segmentation of different tables on different database host. Another is the horizontal segmentation, according to the di
15、fferent business logic, will be in the same table data in accordance with a conditional split to multiple database host. For massive data database, it is suitable for the use of the vertical segmentation, which is close to the table segmentation on a Server; if the data table is not much, but very m
16、uch data to each data table, the horizontal segmentation is more appropriate, that is in accordance with the rules of data by segmentation into multiple Servers. Of course, in 5reality, to deal with the situation is more complicated, it is possible to use a combination of the two approaches, and the
17、 user must be according to the specific business according to the actual situation to choose. 2 The profile of distributed database Distributed database data are stored in different local database, supported by a different operating system, each node has a data management center, and the nodes direc
18、tly connected network is not the same. Distributed database while the underlying implementation is more complex, but the client application does not need to understand the underlying database is distributed, it can be as a whole, shield the bottom structure, direct and transparent operation to the d
19、atabase. Figure 1 is the principle of distributed storage. The difference between distributed and centralized database is that, although it is dispersed into each part, stored in different physical nodes, but the logic is a unified whole. However, from the users point of view, as data in a distribut
20、ed database system is stored on the same computers, the user will not feel any difference. Figure 2 shows distributed storage architecture. Compared with the centralized database system, the distributed can be carried out 6on Sharding data, so the distributed database system is extensible; between e
21、ach data node independence is very high. In order to ensure the system more reliable, can increase the data redundancy, data save a copy. In a distributed database system, commonly used method is adopted to read and write separate, master-slave database configuration. The advantage of this is: in a
22、database when a fault occurs, the system can access the copy in another database, even if a node paralysis, the entire data center or running, make the system more stable, reliable. Secondly, the user may select the copy according to the distance, also can be used to select each data node location a
23、ccording to the distance. Such as the storage Beijing business server placed in Beijing, will be stored in Shenzhen business server placed in Shenzhen, this reduces the communication cost of the system, improve the performance of the whole system 3. 3 The design of distributed database system 3.1 Th
24、e overall system architecture According to the above design and realize the goal, Figure 3 is the schematic diagram of the whole system, design the distributed mass data storage solutions. The system is consists of 4 modules 4: 7(1) The SQL analysis module. The SQL statement is complex, diverse form
25、ats, various forms, analytical results as data segmentation based on. Analysis of the SQL statement is compiled into byte code, generate the syntax tree, the advantages of this method is high accuracy, clear hierarchy, data structure is correct, but the design to the syntax tree of knowledge, more d
26、ifficult than parsing the string prostitution, currently on the market there are some SQL statement parsing open source tools. This paper will start from the parsing rules of SQL and SQL statement parsing process, high efficiency, the use of research analysis to SQL rules convenient tool, used in th
27、e system, each part of the SQL statement to operate. (2) The data distribution module. If no data segmentation in cluster system, many database servers is stored with exactly the same data, this is actually a waste of hardware resources, are consistent to waste more time and effectiveness in synchro
28、nous data. But once the data go up a level, probably a server will not be able to store large amounts of data. So the data segmentation strategy right is sooner or later, the solution will be combined with the existing data segmentation strategy, combined with the business logic, to provide a 8varie
29、ty of segmentation method, segmentation and reserved interface users the flexibility to customize the self realization, the availability of the system higher. (3) Parallel processing module. It is composed of a distribution server and database server. Compared with the centralized database, distribu
30、ted query cost needs to consider the following factors: the CPU processing time, I/O time, and data transmission on the network time. (4) Aggregate processing module. Summary of results can be divided into two types: single base case, return the result; multi machine multi situation required a summa
31、ry in the forwarding node. Now the vast majority of distributed database solutions not aggregated processing steps of the results, this paper focus on the research of several common query combination, such as order by, JOIN etc. 3.2 The detailed flow chart of the system First of all, a forwarding no
32、de receives the SQL statement of the client, will be analysis of nodes of the current workload, is expected to complete the analysis work time, the query time history response prediction, time and other factors, the SQL statement is forwarded to each parse node, syntax analysis on the. When all the
33、work through the forwarding nodes 9when, will have the problem of high concurrency. In the presence of multiple distribution node case, in order to eliminate the performance bottleneck of a single node, we design a distributed nodes, each node task can be forwarded to the analysis of different nodes
34、. This paper use RoundRobin method to tasks in turn distributed to each parse node to let the workload balance. Secondly, parsing node parse the query of SQL statement, easy to understand SQL object through the interface, call the appropriate method to be achieved on the SQL statement operation. The
35、n, node will send the SQL statement forwarded to a different database server based on the different segmentation algorithm; enable the database server parallel to query processing. Finally, the database server to execute the SQL statement, the query result is a summary and return, if is the single q
36、uery classification, then the processing results can be directly back to the client; if it is multi query, in the results will be forwarded to the parse node, parse node will be based on the query conditions on the result set merge. SQL parsing, data segmentation and merging work forward by the abov
37、e four modules cooperate to complete. 104 The detailed architecture of DB Mapping Figure 4 shows the detailed architecture of DB Mapping. 4.1 The SQL parser module 1. Syntax check: check SQL query syntax spelling to meets the specifications. 2. Semantic check: The existence of objects involved SQL q
38、uery, initiating the query request that users have the corresponding inquiry authority. 3. Parse the SQL statement (): analysis of the query by using SQL internal algorithm, generative grammar parse tree (parse tree), to the next module. 4.2 The data distribution module Connection pool management op
39、erations include connection to create, distribute, pool and recovery. The data connection is a limited and precious resource in the system, if use of the process, launched each an operation request will create a new data connection, it will consume a large amount of system resources. We need a conne
40、ction pool, the idle connections are stored in the connection pool, not eager to release them, once you have a connection needs new, it can be called directly, without the need to establish a new connection. The thread pool management, because the system uses a parallel processing architecture, so the system can exist in