1、 外文翻译 原文 Enterprise Information Integration: Successes, Challenges and Controversies Material Source: New York: ACM Press Author: Alon Y. Halevy (Editor), Naveen Ashishy, Dina Bittonz, Michael Careyx, Denise Draper,Jeff Pollock,k Arnon Rosenthal, Vishal Sikkayy ABSTRACT The goal of EII Systems is to
2、 provide uniform access to multiple data sources without having to first loading them into a data warehouse. Since the late 1990s, several EII products have appeared in the marketplace and significant experience has been accumulated from fielding such systems. This collection of articles, by individ
3、uals who were involved in this industry in various ways, describes some of these experiences and points to the challenges ahead. 1. INTRODUCTORY REMARKS Beginning in the late 1990s, we have been witnessing the budding of a new industry: Enterprise Information Integration (EII). The vision underlying
4、 this industry is to provide tools for integrating data from multiple sources without having to first load all the data into a central warehouse. In the research community we have been referring to these as data integration systems. This collection of articles accompanies a session at the SIGMOD 200
5、5 Conference in which the authors discuss the successes of the EII industry, the challenges that lie ahead of it and the controversies surrounding it. The following few paragraphs serve as an introduction to the issues, admittedly with the perspective of the editor. Several factors came together at
6、the time to contribute to the development of the EII industry. First, some technologies developed in the research arena have matured to the point that they were ready for commercialization, and several of the teams responsible for these developments started companies (or spun off products from resea
7、rch labs). Second, the needs of data management in organizations have changed: the need to create external coherent web sites required integrating data from multiple sources; the web-connected world raised the urgency for companies to start communicating with others in various ways. Third, the emerg
8、ence of XML piqued the appetites of people to share data. Finally, there was a general atmosphere in the late 90s that any idea is worth a try (even good ones!). Importantly, data warehousing solutions were deemed inappropriate for supporting these needs, and the cost of ad-hoc solutions were beginn
9、ing to become unaffordable. Broadly speaking, the architectures underlying the products were based on similar principles. A data integration scenario started with identifying the data sources that will participate in the application, and then building a virtual schema (often called a mediated schema
10、), which would be queried by users or applications. Query processing would begin by reformulating a query posed over the virtual schema into queries over the data sources, and then executing it efficiently with an engine that created plans that span multiple data sources and dealt with the limitatio
11、ns and capabilities of each source. Some of the companies coincided with the emergence of XML, and built their systems on an XML data model and query language (XQuery was just starting to be developed at the time). These companies had to address double the problems of the other companies, because th
12、e research on efficient query processing and integration for XML was only in its infancy, and hence they did not have a vast literature to draw on. Some of the first applications in which these systems were fielded successfully were customer-relationship management, where the challenge was to provid
13、e the customer-facing worker a global view of a customer whose data is residing in multiple sources, and digital dashboards that required tracking information from multiple sources in real time. As with any new industry, EII has faced many challenges, some of which still impede its growth today. The
14、 following are representative ones: Scaleup and performance: The initial challenge was to convince customers that the idea would work. How could a query processor that accesses the data sources in real time have a chance of providing adequate and predictable performance? In many cases, administrator
15、s of (very carefully tuned) data sources would not even consider allowing a query from an external query engine to hit them. In this context EII tools often faced competition from the relatively mature data warehousing tools. To complicate matters, the warehousing tools started emphasizing their rea
16、l-time capabilities, supposedly removing one of the key advantages of EII over warehousing. The challenge was to explain to potential customers the tradeoffs between the cost of building a warehouse, the cost of a live query and the cost of accessing stale data. Customers want simple formulas they c
17、ould apply to make their buying decisions, but those are not available. Horizontal vs. Vertical growth: From a business perspective, an EII company had to decide whether to build a horizontal platform that can be used in any application or to build special tools for a particular vertical. The argume
18、nt for the vertical approach was that customers care about solving their entire problem, rather than paying for yet another piece of the solution and having to worry about how it integrates with other pieces. The argument for the horizontal approach is the generality of the system and often the inab
19、ility to decide (in time) which vertical to focus on. The problem boiled down to how to prioritize the scarce resources of a startup company. Integration with EAI tools and other middleware: To put things mildly, the space of data management middleware products is a very complicated one. Different c
20、ompanies come at related problems from different perspectives and its often diffcult to see exactly which part of the problem a tool is solving. The emergence of EII tools only further complicated the problem. A slightly more mature sector is EAI (Enterprise Application Integration) whose products t
21、ry to facilitate hooking up applications to talk to each other and thereby support certain workflows. Whereas EAI tends to focus on arbitrary applications, EII focuses on the data and querying it. However, at some point, data needs to be fed into applications, and their output feeds into other data
22、sources. In fact, to query the data one can use an EII tool, but to update the data one typically has to resort to an EAI tool. Hence, the separation between EII and EAI tools may be a temporary one. Other related products include data cleaning tools and reporting and analysis tools, whose integrati
23、on with EII and EAI does stand to see significant improvement. Meta-data Management and Semantic Heterogeneity: One of the key issues faced in data integration projects is locating and understanding the data to be integrated. Often, one would find that the data needed for a particular integration ap
24、plication is not even captured in any source in the enterprise. In other cases, significant effort is needed in order to understand the semantic relationships between sources and convey those to the system. Tools addressing these issues are relatively in their infancy. They require both a framework
25、for storing the meta-data across an enterprise, and tools that make it easy to bridge the semantic heterogeneity between sources and maintain it over time. Summary: The EII industry is real - in 2005 it is expected to have revenues of at least half a billion dollars. However, it is clear that the pr
26、oducts we have today will have to change considerably in order for this industry to realize its full potential, and its positioning still needs to be further refined. I personally believe that the success of the industry will depend to a large extent on delivering useful tools at the higher levels o
27、f the information food chain, namely for meta-data management and schema heterogeneity. 2. TOWARDS COST EFFECTIVE AND SCALABLE INFORMATION INTEGRATION EII is an area that I have had involvement with for the past several years. I am currently a researcher at NASA Ames Research Center, where data inte
28、gration has been and remains one of my primary research and technical interests. At NASA I have been involved in both developing information integration systems in various domains, as well as developing applications for specific mission driven applications of use to NASA. The domains have included i
29、ntegration of enterprise information, and integration of aviation safety related data sources. Previously I was a co-developer for some of the core technologies for Fetch Technologies, an information extraction and integration company spun out of my research group at USC/ISI in 2000. My work at NASA
30、 has provided the opportunity to look at EII not only from a developer and provider perspective, but also from a consumer perspective in terms of applying EII to the NASA enterprises information management needs. A primary concern for EII today regards the scalability and economic aspects of data in
31、tegration. The need for middleware and integration technology is inevitable, more so with the shifting computing paradigms as noted in 10. Our experience with enterprise data integration applications in the NASA enterprise tells us that traditional schema-centric mediation approaches to data integra
32、tion problems are often overkill and lead to overly and unnecessary investment in terms of time, resources and cost. In fact the investment in schema management per new source integrated and in heavy-weight middleware are reasons why user costs increase directly (linearly) with the user benefit with
33、 the primary investment going to the middleware IT product and service providers. What is beneficial to end users however are integration technologies that truly demonstrate economies of scale, with costs of adding newer sources decreasing significantly as the total number of sources integrated incr
34、eases. How is scalability and cost-effectiveness in data integration achieved? Note that the needs of different data integration applications are very diverse. Applications might require data integration across anywhere from a handful of information sources to literally hundreds of sources. The data
35、 in any source could range from a few tables that could well be stored in a spreadsheet to something that requires a sophisticated DBMS for storage and management. The data could be structured, semi-structured, or unstructured. Also, the query processing requirements for any application could vary f
36、rom requiring just basic keyword search capabilities across the different sources to sophisticated structured query processing across the integrated collection. This is why a one-size-fits-all approach is often unsuitable for many applications and provides the motivation for developing an integratio
37、n approach that is significantly more nimble and adaptable to the needs of each integration application. We begin by eliminating some tacit, schema-centric assumptions that seem to be holding for data integration technology, namely: Data must always be stored and managed in DBMS systems Actually, re
38、quirements of applications vary greatly ranging from data that can well be stored in spreadsheets, to data that does indeed require DBMS storage. The database must always provide for and manage the structure and semantics of the data through formal schemas Alternatively, the “database“ can be nothin
39、g more than intelligent storage. Data could be stored generically and imposition of structure and semantics (schema) may be done by clients as needed. Managing multiple schemas from several independent sources and interrelationships between them Alternatively, any imposition of schema can be done by
40、 the clients, as and when needed by applications. This assumption is based on a 1960s paradigm where clients had almost negligible computing power. Clients of today have significant processing power and sophisticated functionality can well be pushed to the client side. 译文 企业信息化 : 成功、挑战和争议 资料来源 : New
41、 York: ACM Press 作者: Alon Y. Halevy (Editor), Naveen Ashishy, Dina Bittonz, Michael Careyx, Denise Draper, Jeff Pollock, k Arnon Rosenthal, Vishal Sikkayy 摘要 企业信息化的目标是提供统一的系统,这个系统利用多个数据来源而不需要事先装成一个数据仓库。 90 年代末 期以来 , 若干企业信息化产品已在市场上出现,并从这种系统中积累了重要的经验。这些文章的搜集 , 由个人以各种不同的方式参与这个行业,来介绍了一部分体验和指出未来的挑战。 1导论
42、90 年代末期以来 , 我们目睹了企业信息集成这个新兴工业的萌芽。这景象支撑这种产业将从多种来源提供工具集成数据而不需要事先负荷所有的数据到一个中央仓库。在研究上 , 我们已经将这些归为数据集成系统。 这篇文章汇编伴随着 2005 年 数据管理国际 会议探讨 EII 行业的成功,无限的挑战,围绕着的争论。接下来的几段作为问题的介绍 , 支持编者的观点。 在这个时 候若干因素聚集在一起为企业信息化产业的发展作出贡献。首先 ,一些科技发展在研究竞技中成熟 , 他们准备好了商品化 , 几支团队负责开始这些公司的发展 (或从研究实验室丢弃产品 )。第二 , 在组织中数据管理的需要已经改变了 : 为了创
43、造外部连接的网站需要从多种来源中整合数据 ;连接的网络世界提高了企业开始以各种不同的方式交流的紧迫性。第三 , XML可扩展标记语言 的出现激发了人们分享数据。最后 , 在 90 年代末期有这样一个大环境,任何想法是值得一试的,甚至是非常好的。重要的是 , 数据仓库的解决方案是不恰当地支持这些需要 , 并且解决这些问题 的特别成本开始成为负担。 一般来说 , 软件架构以类似的原则的产品为基础。数据集成的方案开始于参与应用的资料来源识别 , 再建立一个由用户查询或应用的虚构模式 (常称为调解模式 )。查询处理将将提出在虚拟模式通过资料来源进入查询 , 然后有效执行通过创造了横跨多个数据来源计划引
44、擎 , 处理每个源的限制和能力的质疑。而与此同时 , 一些公司与 XML的出现相一致 , 在 XML数据系统的模型和查询语言上构建他们的系统,在那个时期查询 XML 的语言才刚刚开始被开发。这些公司不得解决其他公司两倍的问题,因为研究 XML 有效查询处理和整合只是处于起步 阶段 , 因此他们没有大量文献用来利用。 这些系统最早的一些应用成功地完成了客户关系管理,挑战在于面向客户在全球范围内提供多重来源的数据、数字仪表板从多种来源所需实时跟踪的信息。 与任何新的产业一样 , EII 也面临着许多挑战 , 其中一些还阻碍了它现在的成长。下面是具有代表性的挑战: 性能和签三方协议:首要任务是说服客
45、户,这一想法是可行的。一个查询处理器怎么在具体的时间提供合适的系统性能来存取资料来源 ?在许多情况下 ,资料来源的管理者甚至不考虑允许一个外部引擎查询。在此背景下 EII 工具经常面临着相对成熟的数据仓库工具的 竞争。更糟的是,仓储工具开始强调他们实时性能,恐怕要去除 EII 仓储的最重要的一个优势。 所面临的挑战是 向 潜在客户 说明权衡 仓库建设的费用 , 活的查询成本和访问 陈旧 数据的成本。顾客想要 一个可以应用于做出 购买决定的简单公式 , 但是这些都是 无效 的。 水平与垂直增长 : 从经济角度 , 一个 EII 公司不得不决定是否来建造水平平台 , 可以用在任何应用程序或为特定的
46、垂直增长建立特殊的工具。这一垂直的方法的参数是 , 客户只关心帮助解决他们所有的问题 , 而不是让一件的解决方案结合着另一件解决方案而花费更多。有人认为 , 这一水平的方法是系统的共性, 经常无法决定去关注哪项垂直增长。这个问题归结为如何安排创业公司的稀缺资源。 整合与企业应用集成工具和其他中间件:使空间数据管理的中间件产品适度是非常复杂的。 不同的公司从不同角度的相关问题来 看 , 还经常难很清楚地看到问题的哪个部分工具是 能 解决。 EII 工具的出现只是 使 问题更加复杂。稍微成熟的 部门用 EAI(企业应用集成 )的产品试图促进 连接 应用 与 对方 沟通 , 从而支持某种工作流。而
47、EAI 可以面向 任意应用 , 集中在数据和查询。然而 , 从某种意义上说 , 数据需要被 输入 应用 , 以及输出其他 数据 来源。事实上 , 查询数据可以 EII 的工具 , 但 要 更新数据通常 要求助 于 EAI 工具。因此 , EII 和 EAI的分离也许 是 暂时的。其他相关产品包括数据清理工具和报告分析工具 , 其整合 有待 与 EII 和 EAI 是否显著改善。 元 数据管理和语义异构 : 数据集成项目的一个重要的问题 是 把 定位和理解数据 整合起来 。通常 , 人们 会发现数据需要特定的集成 应用 。在其他情况下 ,了解系统来源和传达的语义关系必须 通过 努力。工具解决这些
48、问题的过程中相对处于初创阶段。他们 需要 我们 通过 企业 建立一个存储元数据的框架 , 可以使工具 成为 语义异构问题来源和维护之间的桥梁变得容易。 摘要 : EII 工业是真实的 在 2005 年 , 这 项 收入预计至少有 5 亿美元。然而 , 为了 这个行业 能够认清它 的完整潜力 , 我们 现有 的产品就必须 大大 改变 ,其定位还需要进一步 精确 。我个人认为行业的成功在很大程度上依赖于在更高层次的信息链中提供有用的工具 , 即为元数据管理和 语义结构 。 2 对低成本和可扩展的信息一体化 EII 我得在过去的几年里 一直 参与 的领域 。目前我是美国国家航空和宇宙航行局艾米斯研究
49、中心研究员 , 数据集成一直是是我的技术研究的一个主要 方面 。我在 不同范畴中开发 信息集成系统 , 以及为美国国家航空和宇宙航行局 使用 特定任务驱动程序 开发 应用 系统 。 此领域 包 括企业信息 集成 、航空安全有关的资料来源综合。以前我是的核心技术 的 合作开发者 , 于 2000 年从我 在 美国南加州大学信息科学研究所的研究小组 中,形成了一个 信息提取技术和整合公司。就 美国国家航空和宇宙航行局的企业 信息 管理的需要 而言, 美国国家航空和宇宙航行局工作 给我 提供了机会去看 EII, 不仅从产品开发和服务的视野 , 而且从消费者角度 去看 。 今天 EII 的 一个主要课题 是 把可扩展性和经济方面的数据集成。需要中间件是不可避免的 , 与集成技术转移的 就如 计算范式 10。我们在美国国家航空和宇宙航行局企业的经验与企业数据集成应用告诉我 们 , 传统的数据集成方法往往调解多余的问题并导致过度的 或 没有必要时间、资源和成本 上 的投资。事实上 , 原因是每 个 新 资 源里 的 投资管理的模式 , 重量级用户直接成本增加 , 以用户主要的投资效益 , 去中