SQL Server查询优化器的测试：挑战性，技术性和经验性【外文翻译】.doc-资源下载-文客久久网

SQL Server查询优化器的测试：挑战性，技术性和经验性【外文翻译】.doc

1、外文翻译原文 Testing SQL Servers Query Optimizer: Challenges, Techniques and Experiences Material Source:IEEE Data Engineering Bulletin Author: Leo Giakoumakis, Cesar Galindo-Legaria 1 Introduction Todays query optimizers provide highly sophisticated functionality that is designed to serve a large varie

2、ty of workloads, data sizes and usage patterns. They are the result of many years of research and development, which has come at the cost of increased engineering complexity, specifically in validating correctness and measuring quality. There are several unique characteristics that make query optimi

3、zers exceptionally complex systems to validate, more so than most other software systems. Query optimizers handle a practically infinite input space of declarative data queries (e.g. SQL,XQuery), logical/physical schema and data. A simple enumeration of all possible input combinations is unfeasible

4、and it is hard to predict or extrapolate expected behavior by grouping similar elements of the input space into equivalence classes. The query optimization process itself is of high algorithmic complexity, and relies on inexact cost estimation models. Moreover, query optimizers ought to satisfy work

5、loads and usage scenarios with a variety of different requirements and expectations, e.g. to optimize for throughout or for response time. Over time, the number of existing customers that need to be supported increases, a fact that introduces constraints in advancing query optimization technology wi

6、thout disturbing existing customer expectations. While new optimizations may improve query performance by orders of magnitude for some workloads, the same optimizations may cause performance regressions (or unnecessary overhead) to other workloads. For those reasons, a large part of the validation p

7、rocess of the query optimizer is meant to provide an understanding of the different tradeoffs and design choices in respect to their impact across different customer scenarios. At the same time, the validation process needs to provide an assessment of regression risk for code changes that may have a

8、 large impact across a large number of workload and query types. 2 Key Challenges The goal of query optimization is to produce efficient execution strategies for declarative queries. This involves the selection of an optimal execution plan out of a space of alternatives, while operating within a set

9、 of resource constraints. Depending on the optimization goals, the best-performing strategy could be optimized for response time, throughput, I/O, memory, or a combination of such goals. The different attributes of the query optimization process and the constraints within which it has to function ma

10、ke the tuning of the optimization choices and tradeoffs a challenging problem. Large input space and multiple paths: The expressive power of query languages results in a practically infinite space of inputs to the query optimizer. For each query the query optimizer considers a large number of execut

11、ion plans, which are code paths that need to be exercised and validated. The unbounded input space of possible queries along with the large number of alternative execution paths, generate a combinatorial explosion that makes exhaustive testing impossible. The selection of a representative set of tes

12、t cases in order to achieve appropriate coverage of the input space can be a rather difficult task. Optimization time: The problem of finding the optimal join order in query optimization is NP-hard. Thus, in many cases the query optimizer has to cut its path aggressively through the search space and

13、 settle for a plan that is hopefully near to the theoretical optimum. The infeasibility of exhaustive search introduces a tradeoff between optimization time and plan performance. The finding of the ”sweet spot” between optimization time/resources and plan performance along with the tuning of the dif

14、ferent heuristics is a challenging engineering problem. New optimizations typically introduce new alternatives and extend the search space, often making necessary the tuning of such tradeoff decisions. Cardinality estimation: A factor that complicates the validation of execution plan optimality is t

15、he reliance of the query optimizer on cardinality estimation. Query optimizers mainly rely on statistical information to make cardinality estimates, which is inherently inexact and it has known limitations as data and query patterns become more complex. Moreover, there are query constructs and data

16、patterns that are not covered by the mathematical model used to estimate cardinalities. In such cases, query optimizers make crude estimations or resort to simple heuristics. While in the early days of SQL Server the majority of workloads consisted of prepared, single query-block statements, at this

17、 time query generator interfaces are very common, producing complex ad-hoc queries with characteristics that make cardinality estimation very challenging. Inevitably, testing the plan selection functionality of the query optimizer depends on the accuracy of the cardinality estimation. Improvements i

18、n the estimation model, such as increasing the amount of detail captured by statistics and enhancing the cardinality estimation algorithms, increase the quality of the plan selection process. However, such enhancements typically come with additional CPU cost and increased memory consumption. Cost es

19、timation: Cost models used by query optimizers, similarly to cardinality estimation models are also inexact and incomplete. Not all hardware characteristics, runtime conditions, and physical data layouts are modeled by the query optimizer. Although such design choices can obviously lead to reliabili

20、ty problems, there are often reasonable compromises chosen in order to avoid highly complex designs or to satisfy optimization time and memory constraints. Figure 1: An illustration of the database application space “Two wrongs can make a right” and Overfitting: Occasionally, the query optimizer can

21、 produce nearlyoptimal plans, even in presence of large estimation errors and estimation guesses. They can be the result of “lucky” combinations of two or more inaccuracies canceling each other. Additionally, applications may be built in a way that they rely on specific limitations of the optimizers

22、 model. Such overfitting of the applications behavior around the limitations of the optimizers model can happen intentionally, when a developer has knowledge of specific system idiosyncrasies and develops their application in a way that depends on those idiosyncrasies. It can also happen unintention

23、ally, when the developer continuously tries different ways to develop their application until the desired performance is achieved (because a specific combination of events was hit). Of course there are no guarantees that system idiosyncrasies and lucky combinations of events would remain constant be

24、tween product releases or over changes during the application lifecycle. Therefore, applications (and any tests based on such applications) that rely on overfitting may experience unpredictable regressions when the conditions on which they depend change. Adaptive optimization and self-tuning techniq

25、ues: The use of self-tuning techniques to simplify the tasks of system administration and to mitigate the effect of estimation errors, themselves generate tuning and validation challenges. For example, SQL Servers policy for automatically updating statistics , can be too eager for certain customer s

26、cenarios, resulting in unnecessary CPU and I/O consumption and for others it can be too lazy, resulting in inaccurate cost estimations. Advanced techniques used to mitigate the cost model inaccuracies and limitations, for example the use of execution feedback to correct cardinality estimates, or the

27、 implementation of corrective actions during execution time, introduce similar tradeoffs and tuning problems. Optimization quality is a problem of statistical nature: SQL Servers customer base includes a variety of workload types with varying performance requirements. Figure 1 illustrates the space

28、of different workloads. Workloads on the left-bottom area of the space are typical Online Transaction Processing (OLTP) workloads, which include simple, often parameterized queries. Such workloads require short optimization times and they benefit from plan reuse. Workloads on the right side of the s

29、pace may include Decision Support System (DSS) or data warehousing applications, which usually consist of complex queries over large data sets. DSS workloads have higher tolerance for longer optimization times and thus more advanced optimization techniques can be used for those. They typically conta

30、in ad-hoc queries, generated by query-generator tools/interfaces. The middle area of the application space contains a larger variety of applications that cannot be characterized as simply as the ones above. Those applications can contain a mixture of simple and more complex queries, which can be eit

31、her short or long running. Changes in the optimization process affect queries from different parts of the application space in different ways, either because of shifts in existing tradeoffs and policies, or because of issues related to overfitting. Inevitably, that makes the measurement of optimizat

32、ion quality a problem of statistical nature. As an example, Figure 2 illustrates the results of experiments with new query optimizer features on a realistic query workload. In most cases the new features provide significant performance gains (especially for long-running queries), but they cause regr

33、essions for some parts of the workload (all points below 1 in Figure 2). Some short-running queries were affected by increases in compilation time, while a few others regressed because of a suboptimal plan choice or lack of tuning for the specific hardware used in the experiment. While in this examp

34、le the benefits of the new functionality outweigh the performance regressions, there have been other cases where it was more difficult to make a judgment about the advantages vs. the disadvantages of introducing a particular new feature. Figure 2: Performance impact of new optimizations 3 Future Cha

35、llenges and Conclusions Query optimization has a very big impact on the performance of a DBMS and it continuously evolves with new, more sophisticated optimization strategies. We describe two more challenges, which we expect will play a larger role in the future. The transformation-based optimizer a

36、rchitecture of Volcano and Cascades provides an elegant frame-work, which makes the addition of new optimization rules easy. While it is straightforward to test each rule in isolation using simple use cases, it is harder to test the possible combinations and interactions between rules and ensure pla

37、n correctness. Also, with every addition of a new exploration rule, the search space expands and the number of possible plan choices increases accordingly. There is a need of advanced metrics and tools that help the analysis of the impact of such changes in the plan space. As query optimizers advanc

38、e, the opportunities for optimizations that provide value across most scenarios decrease, hence optimization logic becomes more granular. There has been research that indicates that query optimizers are already making very ne-grained choices, perhaps unnecessarily so, given the presence of cardinali

39、ty estimation errors. Although we described query optimization testing with focus on correctness and optimality, another interesting dimension of the query optimization quality is the concept of performance predictability. For a certain segment of mission-critical applications we see the need for pr

40、edictable performance to be as important as the need for optimal performance. More work is needed on defining, measuring and validating predictability for different classes of applications. Clearly, not all the challenges that we presented in this paper have been fully tackled. The validation proces

41、s and testing techniques will continue to evolve along with the evolution of the optimization technology and product goals. The techniques described in this paper allow basic validation and also provide insight regarding the impact of code changes in the optimization process. As query optimizers bec

42、ome more sophisticated and supplemented with more self -tuning techniques, additional challenges will continue to surface. 译文 SQL Server 查询优化器的测试：挑战性，技术性和经验性资料来源 :IEEE 数据工程公告作者： Leo Giakoumakis, Cesar Galindo-Legaria 1、介绍今天的查询优化器提供了能服务于大量工作量、数据大小和使用模式的高精致的功能。他们是经过多年研究和发展的具有成本增加工程复杂性

43、的结果，特别是在验证正确性和测量质量上。查询优化器有几个独特的比其他大多数软件系统复杂的特点。查询优化处理的声明性数据查询（例如 SQL 和 XQuery）几乎有无限的输入空间、逻辑 /物理架构和数据。一个简单的通过分组的输入空间分成等价类的输入组合的列举是很难预测或推断类似的。查询优化过程本身具有很高的算法复杂性，它依赖于不精确的成本估算模型。此外，查询优化器应该满足各种不同的要求并能优化整个响应时间的工作负载。随着时间的推移，现有客户需求量不断增加，这一事实说明了在不受现有客户量增加的前提下要提高查询优化技术。虽然新的

44、优化技术可能会提高对某些工作负载数量级的查询性能，但是可能会产生不必要的成本。由于这些原因，在不同客户情况的影响下，一个查询优化器的验证过程很大一部分是为了提供不同的权衡和设计。与此同时，在大量的工作量和查询类型上，在需要提供代码变更确认过程方面可能会产生巨大影响。 2、主要挑战查询优化的目标是产生快速有效的查询策略。这涉及到在操作中空间资源约束在一起时选择最优的执行计划。根据优化目标、优化策略可以表现为响应时间、生产量、 I/O、记忆或这些目标的结合。在查询的优化过程的不同属性和在它的约束范围内具有的功能进行调整和优化中选择权

45、衡，这是一个具有挑战性的问题。大的输入空间和多条路径：查询语言的表达能力会使查询优化器输入一个无限的空间。对于每个查询而言，查询优化器考虑了大量的执行计划路径，这需要进行验证的。大量地输入无限的空间可能会出现问题，会产生组合激增使其不可能进行全面测试。使具有代表性的测试案例的集合获得适当的输入空间是一个相当艰巨的任务。最优化时间：寻找加入查询优化顺序的最优化是困难的。因此，在许多情况下，查询优化器要积极通过减少其路径去搜索空间的计划，希望能接近理论上的最优化。而发现“ 最有效点 ”之间的优化时间 /资源和调整随着不同的启

46、发式计划性能是一个具有挑战性的工程问题。权衡决定通常要调整新的优化选择和扩展搜索空间。基数估计：使复杂的执行计划最优化的一个因素依赖在查询优化器的基数估计上。对于查询优化器主要依赖于基数估计上的说法是不确切的，它本质上是使已经知道的数据查询模式的局限性变得更加复杂。此外，我们构建查询数据模式的和所未能涵盖的基数用来评估数学模型。在这种情况下，查询优化使最初估计依赖于简单的启发式上。早期的 SQL Server 大多数的工作负荷中包括单个的大量查询报表，在这段时间常见查询发生器的接口与生产复杂的特别的查询特征基数估计是很有挑战性的。测试计划的查询优化

47、器选择功能取决于基数估计的准确性是不可避免的。不断改进的估计模型，比如增加所捕获的细节数量和提高统计基数估计算法，提高质量的方案选择过程。然而，这样的改进会出现额外的 CPU 成本和更多的内存消耗。成本估算：使用成本模型查询优化类似于基数估计模型也是不精确不完整的。并非所有的硬件特性、运行条件和物理数据布局都能通过查询优化器对其进行建模。不管这样的设计能否明显避免高度复杂还是满足优化时间和内存方面的约束都是合理的。 “两个错误可以使一个正确”和过拟合：即使在大估计误差和估算的猜测中查询优化器有时也可以产生最佳方案，但他们可能是“幸运” 的两个或两个以上彼此错误

48、取消组合的结果。此外，应用程序能够用一种依靠特定局限性的优化器模型安装。当一个开发人员拥有特定系统特质的知识和掌握发展他们的方式时，这样过拟合应用程序的行为围绕优化程序的模型的局限性取决于对这些特质的应用程序，可能会有意发生。当开发人员不断地尝试用不同的方式来发展自己的应用程序时，它也可能无意发生，直到达到所需的性能（当一个特定事件的组合被击中）。当然，在应用程序生命周期的变化上不能保证该系统的特质和事件幸运地组合在产品发布和产品更新之间将保持不变。因此，任何测试的应用程序（以及基于这样的应用）依靠过拟合可能经历不可预知的

49、回归等条件使他们改变。自适应最优化和自我调整技术：使用自动校正技术来简化系统管理的任务有助于减轻自己估计误差对产生调整和验证挑战的影响。例如 SQL服务器自动更新统计资料的政策，在渴望客户的某些情况下自动升级时，怠慢会导致不必要的 CPU 和 I/O 消费，还会导致不准确的成本预估。先进的用于减少成本模型错误和局限性的技术例如使用执行反馈来纠正基数的估计并在执行时间确保纠正措施得到执行，会提出类似的权衡和改制问题。最优化品质是统计性质的问题： SQL Server 的客户群包括了许多具有不同性能要求的各种工作负荷类型。图 1 说明了不同的数据库应用空间。左下角区是典型的在线事物处理（ OLTP）工作负荷，其中包括简单的经常进行参数化查询。这样的工作负荷需要时间短，他们会从优化计划中受益。右侧的空间工作负荷可能包括决策支持系统（ DSS）或数据仓库应用程序，这通常包括对大型数据集的复杂查询。决策支持系统的工作负荷对于更长的最优化时间具有更高耐受性，因此可用于这些更先进的优化技术。它们通常包含通过查询生成器的工具 /界面生成特设的查询。中间是一个不能被定性为上述那些种类繁多的应

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？