A Survey on Event Mining for ICT Network Infrastructure Management.doc

资源描述

1、A Survey on Event Mining for ICT Network Infrastructure Management1 Introduction owadays in China， there are more than six hundred million netizens 1. On April 11， 2015， the number of simultaneous online users of the Chinese instant message application QQ reached two hundred million 2. The fast grow

2、th of the Internet pushes the rapid development of information technology （IT） and communication technology （CT）. Many traditional IT service and CT equipment providers are facing the fusion of IT and CT in the age of digital transformation， and heading toward ICT enterprises. Large global ICT enter

3、prises， such as Apple， Google， Microsoft， Amazon， Verizon， and AT&T， have been contributing to the performance improvement of IT service and CT equipment. As a result， the performance of IT service and CT equipment has become increasing powerful. The speed of the worlds top high?performance computin

4、g system， Chinese Tianhe?2 supercomputer， is 33.86 petaflop 3. The data I/O of a modern Internet backbone router is more than tens of terabytes per seconds， while its routing table usually consists of millions of routes. The scale of modern networks becomes larger and larger. A global information gr

5、id 4 built by the US military is capable of collecting， processing， storing， disseminating and managing information from more than two million nodes. These large?scale， high?performance ICT networks are supported by ICT network infrastructures. ICT network infrastructure refers to the combination of

6、 all computing and network hardware components， as well as software resources of an ICT network. Computing hardware components include computing servers， storage systems， etc. Network hardware components include routers， switches， LAN cards， etc. Software resources include virtual machine platforms，

7、 operating systems， security applications， network operation and management platforms， etc. These resources facilitate the communications and services between users and service providers. The network infrastructure of a large ICT enterprise， e.g.， a world?wide online shopping company like Amazon， us

8、ually has several world?wide data centers. Each data center has tens of thousands of servers， switches， routers， firewalls， as well as other affiliated systems like power supply systems or cooling systems. A typical architecture of data centers is shown in Fig. 1 5. The ICT network infrastructure fo

9、r Carriers is even more complex. For example， besides data centers， there are nation?wide communication networks in a 3G/4G network infrastructure （Fig. 2） 6. Each communication network includes access network equipment， core network equipment， transport network equipment， and other application syst

10、ems， containing tens of thousands of network elements that provide authentication， billing， data/voice communications， and multimedia services. These large?scale complex networks introduce many difficulties in designing， architecting， operating， and maintaining the corresponding network infrastructu

11、res， on which multiple complex systems are coordinated to ensure that the computation and communication functions work smoothly. Cloud technology is widely used in modern ICT network infrastructures due to the development of virtualization technology and its low cost. But cloud technology also bring

12、s hierarchy and heterogeneity to network infrastructures. During the operation and maintenance of network infrastructures， equipment failure， communication error and system misconfiguration have high impact on the reliability of the whole network 7-9， as a result of unstable upper?level service and

13、business. Traditionally， system administrators resolve the aforementioned incidents according to the workflow consisting of detection， localization and repair， by using network tools such as ping， traceroute， and tcpdump， or network monitor toolkits such as Nagios 10， Zabbix 11， and OpsView 12. This

14、 process has been well?known and experienced as a labor?intensive and error?prone process and may not be effective when the systems/networks become large and complex. Fortunately， several industry organizations have already paid attention to these issues and put lots of efforts on making specificati

15、ons related to best practices in operating and maintaining large?scale complex systems/networks. In the IT service area， Information Technology Infrastructure Library （ITIL） 13 is a collection of specifications for service management， with which the best practices are organized according to the full

16、 life cycle of IT services including incident management， failure management， problem management， configuration management， and knowledge management. In the carrier service area， international organizations， such as ITU?T 14 and TM Forum 15， also make recommended specifications for managing telecomm

17、unication network infrastructures， partial ideas of which are borrowed from ITIL. Fig. 3 shows a general workflow of problem detection， determination and resolution for IT service providers prescribed by the ITIL specifications 16. The workflow aims at resolving incidents and quickly restoring the p

18、rovision of services while relying on monitoring or human intervention to detect the malfunction of a component 16. For problem detection， there is usually monitoring software running on servers or network elements， which continuously monitors the status of network elements and detects possible prob

19、lems by computing metrics for the hardware and software performance at regular intervals. The monitoring software would issue an alert if those metrics are not acceptable according to predefined thresholds， known as monitoring situations， and emits an event if the alert does not disappear after a pe

20、riod. All events coming from the network infrastructure are consolidated in an enterprise console， where these events are analyzed and corresponding incident tickets are created， if necessary， in an Incident， Problem， and Change （IPC） system. System administrators are responsible for the problem det

21、ermination and resolution based on the detailed information in these tickets. The efficiency of these resources is critical for the provisioning of the services 17. However， the best practices in those specifications only provide the guidance on operating and maintaining network infrastructure， whic

22、h is a standard workflow of consecutive procedures and definitions. Many key issues in these procedures are not answered in these specifications， especially in large?scale complex networks. The challenges in managing large?scale network infrastructures are listed as follows： 1） Large complex network

23、 infrastructures are heterogeneous and often consist of various network elements made by different equipment makers. There are different software components running on the various network elements and generating huge amount of messages and alerts in different types and formats. The heterogeneity com

24、plicates the management work 18， 19 and understanding these messages and alerts is not an easy task. In a small network， system administrators can analyze the messages and alerts one by one， and understand their corresponding event types. Apparently， it is not practical in large complex networks. Au

25、tomatic event generation is important for reducing the maintenance cost with limited human resources. 2） The diagnosis and resolution depend on experienced system administrators who analyze performance metrics， alert logs， event information and other network characteristics. Unexpected behaviors are

26、 usually discovered in daily operation of large complex networks. Malfunction of certain network elements can cause alerts in both upper?level business applications and other connected network elements. The scale and complexity of root cause analysis 20 in such networks are often beyond the ability

27、of human operators. Therefore， automatic root cause analysis is necessary in managing large complex network infrastructures. 3） Root cause analysis is to identify the actual network element that causes an alert， while failure prediction tries to avoid the situation where the expected services cannot

28、 be delivered 21， 22. Proactive fault management can enhance the network reliability， which is usually done by system administrators based on predefined business rules. With failure prediction， proactive fault management can be more efficient. Failure prediction based on historical incident tickets

29、and server attributes plays an important role in managing large complex network infrastructures. Mining valuable knowledge from events and tickets can efficiently improve the performance of system diagnosis. In this survey， we focus on recent research studies dealing with the above three challenges.

30、 The reminder of this survey is organized as follows. Section 2 reviews the event generation approaches. Root cause analysis and failure prediction are investigated in Section 3 and Section 4， respectively. Finally， Section 5 concludes the survey. 2 Event Generation The monitoring software on networ

31、k elements in large complex networks generates huge amount of alerts， alarms， and messages， indicating the equipment status at real time. These alerts， alarms， messages are usually collected in log files. Contents of the data in log files may include time， element name， the running states of softwar

32、e components （e.g.， started， interrupted， connected， and stopped）， and other performance parameter values. In this section， we mainly focus on the methodologies of event generation from log files. The contents of log files in some systems are unstructured， that is， each event is stored as a short m

33、essage in plain text files， such as server logs， Linux logs and Hadoop logs. In other systems， the logs may be semi?structured or structured， e.g.， Window event logs， database management system logs. Such logs are often stored in a database. Each record in the database represents an event， often inc

34、luding time， server name， process name， error code and other related information. A lot of data mining algorithms are based on structured or semi?structured data， while unstructured textual logs cannot be handled by these algorithms. Event generation is to convert textual logs into structured events

35、 for later analysis. A simple log example is shown in Table 1 23， in which messages from a Simple File Transfer Protocol （SFTP） log are collected from a FTP software called FileZilla. Each line in Table 1 is a short message describing a certain event. In order to analyze the behaviors of FTP visits，

36、 these raw log messages need to be translated into types of events. The generated events are usually organized by timeline so that people can understand the behaviors and discover event patterns 24. In Table 1， Message 4 is the event of uploading a webpage to the FTP server， and Message 9 is an erro

37、r alert that the operation of creating a new directory is not successful. By converting raw log messages into canonical events， these events are able to be correlated across logs from different elements. It seems that obtaining events from the log files is not a difficult task. However， due to the h

38、eterogeneity of network infrastructures， each network element generates raw messages with its own format and contents. These messages may be disparate and inconsistent， which creates difficulty in deciphering events reported by multiple network elements 24. For example， it is supposed that we need t

39、o perform the following task： if any element stops， the system administrator is notified by email. Given the variability among different network elements， one element might record “The server has stopped” in the log file， while another one might record “The server has changed the state from running

40、to stop.” The inconsistency in log files makes the above task difficult. All the messages indicating the stop status from all network elements must be collected， in order to write a program to automate this simple task. This is less possible in large complex networks with newly added network element

41、s and many legacy network elements. When one needs to analyze the historical event data across multiple elements， it is necessary to encode semantics in a system?independent manner. All raw log messages in log files should be consistent in semantics across similar fields， which allows the organization of common semantic events into categories. The converted canonical events provide the ability of describing the semantics of log data as well as the initial connection of

展开阅读全文