ImageVerifierCode 换一换
格式:PPT , 页数:45 ,大小:538KB ,
资源ID:371516      下载积分:20 文钱
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,省得不是一点点
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.wenke99.com/d-371516.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: QQ登录   微博登录 

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(XML代价估计方法研究综述.PPT)为本站会员(天***)主动上传,文客久久仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知文客久久(发送邮件至hr@wenke99.com或直接QQ联系客服),我们立即给予删除!

XML代价估计方法研究综述.PPT

1、XML 代价估计方法研究综述,XML Group,Outline,研究代价估计的必要性xml中代价估计研究的难点背景知识介绍现有方法分类,研究代价估计的必要性,查询优化的基础不同分支对查询目标的选择率不同,如果选择性低的分支先于选择性高的分支执行,就会有效的减少中间结果从而提供查询效率结构连接在XML中是一个很重要的操作,连接的顺序的选择需要计算不同连接顺序的代价及早给用户提供反馈信息,xml中代价估计研究的难点,XML数据的不规则性是对传统统计信息方法的重要挑战,具体表现在以下几个方面:路径依赖性,如同为name结点,在person下和在city下出现意义就完全不同结构相关性嵌套在不同祖先下

2、面的同类结点的个数差别可能很大,如book结点下的author个数是不确定的值相关性/purchase/Itemtype=bookprice100XML的有序性制约了转换规则的灵活性所有这些问题,都使得在xml中采用传统的代价估计方法不切实际,会带来很大的误差。针对xml数据的特点,我们应该寻求一种新的代价估计方法。,Background Knowledge,XML Data ModelA large, node-labeled tree T(V,E),Example XML Document Tree,Background Knowledge,XML Data ModelA large, n

3、ode-labeled tree T(V,E),Example XML Document,Background Knowledge,XML Query ModelA node-labeled query tree TQEach node of TQ is labeled with a variable name qi in Q Each edge (qi,qj) of TQ is annotated with an XPath expression path (qi,qj) that describes the specific structural constraints specified

4、 in Q,Query Fragment: for $b in doc(“bib.xml”)/bib/book$a in $b/author $t in $b/title,现有估计方法的分类,路径选择性计算Twig匹配个数统计值谓词选择率估计结构连接代价估计,Overview,data tree,different summarization methods based on:,path,all m-length path,F/B-stability,LoreMcHugh et al, VLDB99,pruning,PathTree, Markov TableAboulnaga et al,

5、VLDB01,XSketchPolyzotis et al, VLDB02,type,XSketchesPolyzotis et al, SIGMOD02,+ value,+ twig,XSketchesPolyzotis et al, SIGMOD04,Region Code,2D Model,1D Model,StatiXFreire et al, SIGMOD02,PH HistogramWu Y EDBT 2002,Interva&Position ModelH.Lu SIGMOD 2003,路径选择性计算,问题描述 /a/bc/d/eg/e/f/*/*/e/f,Path Tree &

6、 Markov Table,Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffry F.Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. VLDB 2001,Path Tree,ConstructionStart from an original path treeCount of nodes in absolute pathsAggregate the tree to fit in the limited spac

7、eSibling-*EstimationMatch the query against the path tree, matching * as the last resort.Need to decide whether to use total count or average count.,An XML document and its path tree,Estimation:count(A/B/D)=1 count(A/C/D)=1,Aggregate Path Tree,Aggregate Path Tree,A,1,H,6,E,5,D,7,J,4,I,2,红色结点表示那些不频繁出

8、现的结点,需要删除,从而保证path树能够放入内存,Sibling-* Summarization,A,1,C,9,*,*,K,*,f=23n=2,6,8,3,把查询和path 树匹配需要决定用总数还是平均数,估计方法,Estimation:count(/C/H/K)=11 count(/C/*/K)=23,Markov Table,构造一个表,存储长度=2。当查询路径长度=m时,可以直接从表中读出结果,否则,用公式计算。例如,m3,查询为/A/B/C/D,则当表不能装入内存的时候,进行剪枝操作。,Markov Table : Example,a,b,b,c,d,d,e,e,e,f,d,a,b

9、,d,d,e,1,2,2,1,3,c,f,1,1,m=2,suffix-*,Q1: a/c/e,Twig 匹配个数统计,问题描述Twig Query (basic building block of declarative query languages for XML),FOR $f IN document(“person.xml”)/department/faculty FOR $r in $f/RA, FOR $t in $f/TA,Department,Faculty,RA,TA,$f,$r,$t,XSketch方法,N.Polyzotis, M.Garofalakis. Statis

10、tical Synopses for Graph-Structured XML Databases, In Proc.of ACM SIGMOD Conf.2002 N. Polyzotis and M. Garofalakis. Structure and value synopses for xml data graphs. In Proceedings of 28th VLDB Conf., 2002.N.Polyzotis, M.Garofalakis, and Yannis Iosnnidis. Selectivity Estimation for XML Twigs. In Pro

11、c.of the 20th Intl.Conf. on Data Engineering,2004N. Polyzotis and M. Garofalakis and Y.Ioannidis. Approximate XML Query Answers. In Proc .ACM SIGMOD 2004,XSketch方法,B/F stability,Definition: Let U,V be sets of elements in an XML data Tree T. We say that V is backward-stable(B-stale) with respect to U

12、, iff for each vV there exists a u U such that edge (u,v) is in T.Similarly,U is said to be forward-stable(F-stable) with respect to V,iff for each u U there exists a v V such that the edge(u,v) is in G.,XSketch方法,How to estimate the cardinality of xpath query such as A/P/K or A/p ?,Utilizing edge s

13、tability,We can give an accurate match number:Count(A/P/K)=6Count(A/p)=3,But what about A/P/T which doesnt conform to backward stability?,2 assumptions Path independence Backward-edge uniformityCount(A/P/T)=f(A/P/T)*count(T)f(A/P/T)=f(A/P)*f(P/T|A/P)=f(P/T|A/P) =f(P/T)count(P)/(count(B)+count(P),XSk

14、etch方法(处理值谓词),v1,v3,v5,v2,v4,B,B,F,F,STN(v5),Dep(v5)=v1,v4,v1,v4,Histogram at v5,H(1, 4)= count(v11v2/v3v44/v5),v3 v5 count(v11v2/v33v44/v5)=H(1, 4)*count(v33)/count(v3),A1Edge-Value IndependenceAcross Non-Satble Edgesf(u/v|v) f(u/v)A2Value-Independence Outside Correlation Scope,XSkecth方法(处理twig),Pr

15、oblem with the former modelLet us see a simple example,for t0 in A, t1 in t0/B , t2 in t0/C,(a),(b),(c) Structural XSketch,2000 binding tuples on the first document10100 tuples on the second document,XSkecth方法(处理twig),Singe-path XSketch model dose not store detailed enough information to capture the

16、 correlation patterns between multiple path expressions.If node A records a two-dimensional distribution fA(b,c) for the fraction of elements in A that have b children in B and c children in C.,XSkecth方法(处理twig),CK,CY,Elements,fp,2,1,p4,0.25,P5,4(0.2521100.25110.51)5,CP,CN,2,1,2,1,0.25,2,1,1,1,0.5,1

17、,1,p8,p9,Another Example,Edge-distribution: fP(CY,CK,CP,CN),P,Y,P,K,A,P,A,N,Statix方法,Freire J, Jayant R, Ramanath M, Roy P and Simeon J. StatiX: Making XML Count. In: Proc. of ACM SIGMOD 2002, USA.,Statix方法,Constructionevery instance node is assigned a unique id(type id+sequential node id ) during p

18、arsing and validation, meanwhile gathering statisticsConstructing the histogramsHistogramBuild histogram for each typeMight contain histogram describing the distribution of the parents of a typeValue histogram can be built for types that are defined in terms of base types (eg. integer)EstimationConv

19、ert the query into SQL (i.e., relational join on IDs)Use standard histogram multiplication,StatiX : Example,FOR $s in document (“imdb.xml”)/show, $r in $s/reviewWHERE $s/year 1992RETURN $s/title, $r,SELECT title, reviewFROM Show, ReviewWHERE Show.year 1992 AND Show.ID = Review.ParID,STAT Show CARD:

20、5 ID: 16 STAT Show_Year VALUE: 19902001 BUCKET_NUM: 2 BUCKETS: 1990, 1994): 3; 1994, 2001): 2STAT Review CARD: 8 ID: 3038 PARENT HISTOGRAM Show BUCKET_NUM: 3 BUCKETS: 1, 2): 4; 2, 3): 1; 3, 5): 3 ,STAT Aka CARD:6 ID:6-12 PARENT HISTOGRAM Show_Episode BUCKET_NUM: 3 BUCKETS: 1, 2): 1; 2, 6): 0; 6, 12)

21、: 7 ,31/28/5=2.4,Summary,结构连接代价估计,问题描述给定任意的祖先节点集合A和后代节点集合D,计算A/D结果集合的大小,估计匹配的祖先后代的结果个数 当同一祖先有多个后代,或者同一后代有多个祖先时,节点对个数可能远远大于祖先或后代的个数。应用连接顺序的选择,结构连接代价估计,Existed work2D model: pH histogram Wu Y, Patel J and Jagadish H. V. Estimating Answer Sizes for XML Queries. In: Proc. of the EDBT 2002. Interval and

22、Positional model W.Wang, H.Jiang, H.Lu, and J.F.Xu. Containment join Size Estimation:Models and Methods. In:Proc. Of ACM SIGMOD 2003 ,Absolute Region Code (start, end), blah 1234 5678 0000 ,contact,name,phone,blah,office,home,mobile,1234,5678,0000,(1,16),(2,4),(3,3),(5,15),(6,8),(7,7),(9,11),(10,10)

23、,(12,14),(13,13),2,3,4,a.start d.start and d.end a.end,pH Histogram,Forbidden Regions For a node,R1和R2是结点A的ForbiddenRegion 两个结点的(start,end)满足:(1)no overlap (2)nested 不可能有部分重叠的(partiallyoverlap)情况,pH Histogram,EstPA=HistP1A1/4HistP2A+HistP2B+HistP2C+HistP2E+1/2(HistP2D+HistP2F),Ancestor-based Join Es

24、timation,EstPA=HistP2A1/4HistP1A+HistP1F+HistP2G+HistP1H,Descendant-based Join Estimation,pH Histogram,est = 3*(12+1+2+0.25*5) = 48.75,start,end,start,start,end,end,For off-diagonal cells:,pH for A,pH for D,Interval & Position Models,Map into new problems under 2 newly proposed models.Interval model

25、 and Position modelHistogram based method that is adaptive to the data: PL histogramAssumption: 1D uniform of the descendant setEstimation is robust when correlation is lowSampling based method that captures the correlation: IM Sampling and PM SamplingAssumption: Height of the XML data tree is a sma

26、ll constantBoth have theoretical bounds on the accuracy of the estimation,Interval Model,Interval ModelMap elements to 1D line segments (points)IMA(A): interval setIMD(D): point set|A SJ D|=| interval-point pairs|,interval set,point set,Position Model,Position ModelUse frequency tables to record cov

27、erage and starting point informationPMA(S): coverage tablePMD(S): start table|A SJ D|=Containment joinEquijoin,coverage table,start table,Point-Line (PL) -Histogram,Partition the pace into partitionsEstimate the overlapping (interval, point) pairs within each bucket.AssumptionsData distribution of A

28、 and D are independentD conforms to uniform distribution,Sampling in Interval Model,Adapted from bifocal samplingIM Algorithm:Choose m samples from IMD(D)Sum up the m subjoin sizes.X = Scale up the result by |IMD(D)|/mStatistical GuaranteeE(X) = X (join size)with high probability(by Hoeffding bound),2,1,3,6,m=2,Problem,How to handle parent-child relationship?How to count |A/B/C|How to use estimation for Query Optimization,Question ?,Then We done !,Thank you !,

Copyright © 2018-2021 Wenke99.com All rights reserved

工信部备案号浙ICP备20026746号-2  

公安局备案号:浙公网安备33038302330469号

本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。