1、 外文翻译 原文 Testing SQL Servers Query Optimizer: Challenges, Techniques and Experiences Material Source:IEEE Data Engineering Bulletin Author: Leo Giakoumakis, Cesar Galindo-Legaria 1 Introduction Todays query optimizers provide highly sophisticated functionality that is designed to serve a large varie
2、ty of workloads, data sizes and usage patterns. They are the result of many years of research and development, which has come at the cost of increased engineering complexity, specifically in validating correctness and measuring quality. There are several unique characteristics that make query optimi
3、zers exceptionally complex systems to validate, more so than most other software systems. Query optimizers handle a practically infinite input space of declarative data queries (e.g. SQL,XQuery), logical/physical schema and data. A simple enumeration of all possible input combinations is unfeasible
4、and it is hard to predict or extrapolate expected behavior by grouping similar elements of the input space into equivalence classes. The query optimization process itself is of high algorithmic complexity, and relies on inexact cost estimation models. Moreover, query optimizers ought to satisfy work
5、loads and usage scenarios with a variety of different requirements and expectations, e.g. to optimize for throughout or for response time. Over time, the number of existing customers that need to be supported increases, a fact that introduces constraints in advancing query optimization technology wi
6、thout disturbing existing customer expectations. While new optimizations may improve query performance by orders of magnitude for some workloads, the same optimizations may cause performance regressions (or unnecessary overhead) to other workloads. For those reasons, a large part of the validation p
7、rocess of the query optimizer is meant to provide an understanding of the different tradeoffs and design choices in respect to their impact across different customer scenarios. At the same time, the validation process needs to provide an assessment of regression risk for code changes that may have a
8、 large impact across a large number of workload and query types. 2 Key Challenges The goal of query optimization is to produce efficient execution strategies for declarative queries. This involves the selection of an optimal execution plan out of a space of alternatives, while operating within a set
9、 of resource constraints. Depending on the optimization goals, the best-performing strategy could be optimized for response time, throughput, I/O, memory, or a combination of such goals. The different attributes of the query optimization process and the constraints within which it has to function ma
10、ke the tuning of the optimization choices and tradeoffs a challenging problem. Large input space and multiple paths: The expressive power of query languages results in a practically infinite space of inputs to the query optimizer. For each query the query optimizer considers a large number of execut
11、ion plans, which are code paths that need to be exercised and validated. The unbounded input space of possible queries along with the large number of alternative execution paths, generate a combinatorial explosion that makes exhaustive testing impossible. The selection of a representative set of tes
12、t cases in order to achieve appropriate coverage of the input space can be a rather difficult task. Optimization time: The problem of finding the optimal join order in query optimization is NP-hard. Thus, in many cases the query optimizer has to cut its path aggressively through the search space and
13、 settle for a plan that is hopefully near to the theoretical optimum. The infeasibility of exhaustive search introduces a tradeoff between optimization time and plan performance. The finding of the ”sweet spot” between optimization time/resources and plan performance along with the tuning of the dif
14、ferent heuristics is a challenging engineering problem. New optimizations typically introduce new alternatives and extend the search space, often making necessary the tuning of such tradeoff decisions. Cardinality estimation: A factor that complicates the validation of execution plan optimality is t
15、he reliance of the query optimizer on cardinality estimation. Query optimizers mainly rely on statistical information to make cardinality estimates, which is inherently inexact and it has known limitations as data and query patterns become more complex. Moreover, there are query constructs and data
16、patterns that are not covered by the mathematical model used to estimate cardinalities. In such cases, query optimizers make crude estimations or resort to simple heuristics. While in the early days of SQL Server the majority of workloads consisted of prepared, single query-block statements, at this
17、 time query generator interfaces are very common, producing complex ad-hoc queries with characteristics that make cardinality estimation very challenging. Inevitably, testing the plan selection functionality of the query optimizer depends on the accuracy of the cardinality estimation. Improvements i
18、n the estimation model, such as increasing the amount of detail captured by statistics and enhancing the cardinality estimation algorithms, increase the quality of the plan selection process. However, such enhancements typically come with additional CPU cost and increased memory consumption. Cost es
19、timation: Cost models used by query optimizers, similarly to cardinality estimation models are also inexact and incomplete. Not all hardware characteristics, runtime conditions, and physical data layouts are modeled by the query optimizer. Although such design choices can obviously lead to reliabili
20、ty problems, there are often reasonable compromises chosen in order to avoid highly complex designs or to satisfy optimization time and memory constraints. Figure 1: An illustration of the database application space “Two wrongs can make a right” and Overfitting: Occasionally, the query optimizer can
21、 produce nearlyoptimal plans, even in presence of large estimation errors and estimation guesses. They can be the result of “lucky” combinations of two or more inaccuracies canceling each other. Additionally, applications may be built in a way that they rely on specific limitations of the optimizers
22、 model. Such overfitting of the applications behavior around the limitations of the optimizers model can happen intentionally, when a developer has knowledge of specific system idiosyncrasies and develops their application in a way that depends on those idiosyncrasies. It can also happen unintention
23、ally, when the developer continuously tries different ways to develop their application until the desired performance is achieved (because a specific combination of events was hit). Of course there are no guarantees that system idiosyncrasies and lucky combinations of events would remain constant be
24、tween product releases or over changes during the application lifecycle. Therefore, applications (and any tests based on such applications) that rely on overfitting may experience unpredictable regressions when the conditions on which they depend change. Adaptive optimization and self-tuning techniq
25、ues: The use of self-tuning techniques to simplify the tasks of system administration and to mitigate the effect of estimation errors, themselves generate tuning and validation challenges. For example, SQL Servers policy for automatically updating statistics , can be too eager for certain customer s
26、cenarios, resulting in unnecessary CPU and I/O consumption and for others it can be too lazy, resulting in inaccurate cost estimations. Advanced techniques used to mitigate the cost model inaccuracies and limitations, for example the use of execution feedback to correct cardinality estimates, or the
27、 implementation of corrective actions during execution time, introduce similar tradeoffs and tuning problems. Optimization quality is a problem of statistical nature: SQL Servers customer base includes a variety of workload types with varying performance requirements. Figure 1 illustrates the space
28、of different workloads. Workloads on the left-bottom area of the space are typical Online Transaction Processing (OLTP) workloads, which include simple, often parameterized queries. Such workloads require short optimization times and they benefit from plan reuse. Workloads on the right side of the s
29、pace may include Decision Support System (DSS) or data warehousing applications, which usually consist of complex queries over large data sets. DSS workloads have higher tolerance for longer optimization times and thus more advanced optimization techniques can be used for those. They typically conta
30、in ad-hoc queries, generated by query-generator tools/interfaces. The middle area of the application space contains a larger variety of applications that cannot be characterized as simply as the ones above. Those applications can contain a mixture of simple and more complex queries, which can be eit
31、her short or long running. Changes in the optimization process affect queries from different parts of the application space in different ways, either because of shifts in existing tradeoffs and policies, or because of issues related to overfitting. Inevitably, that makes the measurement of optimizat
32、ion quality a problem of statistical nature. As an example, Figure 2 illustrates the results of experiments with new query optimizer features on a realistic query workload. In most cases the new features provide significant performance gains (especially for long-running queries), but they cause regr
33、essions for some parts of the workload (all points below 1 in Figure 2). Some short-running queries were affected by increases in compilation time, while a few others regressed because of a suboptimal plan choice or lack of tuning for the specific hardware used in the experiment. While in this examp
34、le the benefits of the new functionality outweigh the performance regressions, there have been other cases where it was more difficult to make a judgment about the advantages vs. the disadvantages of introducing a particular new feature. Figure 2: Performance impact of new optimizations 3 Future Cha
35、llenges and Conclusions Query optimization has a very big impact on the performance of a DBMS and it continuously evolves with new, more sophisticated optimization strategies. We describe two more challenges, which we expect will play a larger role in the future. The transformation-based optimizer a
36、rchitecture of Volcano and Cascades provides an elegant frame-work, which makes the addition of new optimization rules easy. While it is straightforward to test each rule in isolation using simple use cases, it is harder to test the possible combinations and interactions between rules and ensure pla
37、n correctness. Also, with every addition of a new exploration rule, the search space expands and the number of possible plan choices increases accordingly. There is a need of advanced metrics and tools that help the analysis of the impact of such changes in the plan space. As query optimizers advanc
38、e, the opportunities for optimizations that provide value across most scenarios decrease, hence optimization logic becomes more granular. There has been research that indicates that query optimizers are already making very ne-grained choices, perhaps unnecessarily so, given the presence of cardinali
39、ty estimation errors. Although we described query optimization testing with focus on correctness and optimality, another interesting dimension of the query optimization quality is the concept of performance predictability. For a certain segment of mission-critical applications we see the need for pr
40、edictable performance to be as important as the need for optimal performance. More work is needed on defining, measuring and validating predictability for different classes of applications. Clearly, not all the challenges that we presented in this paper have been fully tackled. The validation proces
41、s and testing techniques will continue to evolve along with the evolution of the optimization technology and product goals. The techniques described in this paper allow basic validation and also provide insight regarding the impact of code changes in the optimization process. As query optimizers bec
42、ome more sophisticated and supplemented with more self -tuning techniques, additional challenges will continue to surface. 译文 SQL Server 查询优化器的测试:挑战性,技术性和经验性 资料来源 :IEEE 数据工程公告 作者: Leo Giakoumakis, Cesar Galindo-Legaria 1、介绍 今天的查询优化器提供 了 能 服务于 大量 工作 量、 数据大小和使用模式 的 高精致的 功能 。他们是经过多年研究和发展 的 具有成本增加 工程复杂性
43、的结果,特别是在验证 正确性和测量质量上。查询优化器有几个独特的 比 其他大多数软件系 统 复杂的 特点 。 查询优化处理的声明性数据查询(例如 SQL 和 XQuery)几乎有无限的输入 空间 、 逻辑 /物理架构和数据。一个简单 的 通过分组的输入空间分成等价类 的输入组合的列举 是 很难预测或推断 类似的 。查询优化过程本身具有很高的算法复杂性,它 依赖 于 不精确的成本估算模型 。此外,查询优化器应该满足 各种不同的要求 并能优化整个 响应时间 的 工作负载 。 随着时间的推移,现有客户需求量 不断 增加,这一事实说明了 在 不 受现有客户 量增加的前提下要提高查询优化技术 。虽然新的
44、优化 技术 可能会提高对某些工作负载数量级的查询性能, 但是可能会产生 不必要的 成本 。 由于这些原因,在 不同 客户情况的影响 下,一个查询 优化器的验证过程很大一部分是为了 提供不同的权衡和 设计。 与此同时,在大量的工作量和查询类型上, 在 需要提供 代码 变更确认过程方面可能 会 产生巨大影响。 2、主要挑战 查询优化的目标是产生 快速有效的 查询策略。这涉及到 在操作中 空间资源约束在 一起时 选择最优的执行计划。根据优化目标、优化策略可以表现为响应时间、生产量、 I/O、 记忆或这些目标 的结合。在查询的优化过程 的不同属性和在它的约束范围内具有的功能进行 调整和优化 中 选择权
45、衡 ,这是 一个具有挑战性的问题。 大的输入空间和多条路径: 查询语言的表达能力会 使 查询优化器 输入一个无限的空间。对于每个查询而言,查询优化 器考虑了大量的执行计划 路径 ,这需要进行 验证的。 大量地输入无限的空间可能会出现 问题, 会 产生组合 激增使其 不可能 进行 全面测试。 使 具有代表性的 测试案例的集合 获得适当的输入空间是一个相当艰巨的任务。 最优化时间:寻找 加入查询优化顺序 的最优化 是 困难的 。因此,在许多情况下,查询优化器要积极通过 减少 其路径 去 搜索空间的计划 , 希望能接近理论上的 最优化 。而发现“ 最有效点 ”之间的优化时间 /资源和调整随着不同的启
46、发式计划性能是一个具有挑战性的工程问题。 权衡 决定通常 要 调整新的优化选择和扩展搜索空间。 基数估计:使复杂的执行计划最优 化 的 一个因素依赖在查询优化器的基数估计上。 对于 查询优化器主要依赖 于 基数估计 上的说法是 不确切的,它本质上是使已经知道的数据查询模式的局限性变得更加复杂。此外,我们构建查询数据模式的和所未能涵盖的基数用来评估数学模型。在这种情况下,查询优化使最初估计 依赖于 简单的启发式 上。 早期的 SQL Server 大多数的工作负荷中包括单个的大量查询报表,在这段时间 常见 查询 发生器 的接口 与生产复杂的特别的查询特征基数估计是很有挑战性的。测试计划的查询优化
47、器选择功能取决于基数估计的准确性 是 不可避免的。不断改进的估计模型,比如 增加所捕获的 细节数量和提高统计基数估计算法,提高质量的方案选择 过程。然而 ,这样的改进会出现 额外的 CPU 成本和更多的内存消耗。 成本估算:使用成本模型查询优化类似于基数估计模型也是不精确不完整的。并非所有的硬件特性、运行条件和物理数据布局都能通过查询优化器对其进行建模。不管这 样的设计 能否 明显避免高度复杂还是满足优化时间和内存方面的约束都是合理的。 “两个错误可以使一个正确”和过拟合 :即使在大估计误差和估算的猜测中查询优化器有时也可以产生最佳方案 ,但他们可能 是“幸运” 的 两个或两个以上彼此 错误
48、取消 组合的结果。此外,应用程序能 够用一种依靠特定 局限性 的优化 器 模型 安装 。当一个开发人员 拥有 特定系统特质的知识和 掌握 发展他们的方式 时 ,这样过拟合应用程序的行为围绕优化程序的模型的局限性取决于对这些特质的应用程序 ,可能会有意发生 。当开发人员不断地尝试 用 不同的方式来发展自己的应用程序 时,它也 可能无意发生,直到达到所需的性能( 当 一个特定事件的组合被击中)。当然,在应用程序生命周期的变化 上 不能保证该系统的特质和事件幸运 地 组合 在 产品发布 和产品更新之间 将 保持不变 。因此,任何测试的应用程序 ( 以及基于这样的应用 ) 依靠过拟合可能经历不可预知的
49、回归等条件使他们 改变。 自适应 最优化和自我调整技术:使用自动校正技术来简化系统管理的任务有助于减轻自己估计误差 对 产生 调整和验证 挑战的影响 。例如 SQL服务器 自动更新 统计 资料 的政策, 在 渴望客户的某些情况下自动升级 时,怠慢会导致 不必要的 CPU 和 I/O 消费, 还会 导致不准确的成本预估。先进的用于减少成本模型错误和局限性 的 技术 例如使用执行反馈来纠正基数的估计 并在执行时间确保纠正措施得到执行 ,会提出 类似的 权衡 和改制问题。 最 优化 品质 是统计性质的问题: SQL Server 的 客户群 包括了 许多 具有不同性能要求的各种工作负荷类型 。图 1 说明 了不同 的 数据库应用 空间。 左下角区是 典型的 在线事物 处理( OLTP) 工作负荷 ,其中包括简单的 经常进行 参数化查询。这样的 工作负荷 需要时间短,他们 会 从优化计划 中受益。 右侧的空间工作负荷 可能包括决策支持系统( DSS)或数据仓库应用程序, 这通常包括对大型数据集的复杂查询 。决策支持系统的工作 负荷对于更长的最优化时间 具有更高耐受性 , 因此 可用于这些更先进的优化技术 。 它们通常包含通过查询生成器 的工具 /界面生成 特设的查询 。 中间是 一个不能被定性为上述那些种类繁多的应
Copyright © 2018-2021 Wenke99.com All rights reserved
工信部备案号:浙ICP备20026746号-2
公安局备案号:浙公网安备33038302330469号
本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。