1、XML 代价估计方法研究综述,XML Group,Outline,研究代价估计的必要性xml中代价估计研究的难点背景知识介绍现有方法分类,研究代价估计的必要性,查询优化的基础不同分支对查询目标的选择率不同,如果选择性低的分支先于选择性高的分支执行,就会有效的减少中间结果从而提供查询效率结构连接在XML中是一个很重要的操作,连接的顺序的选择需要计算不同连接顺序的代价及早给用户提供反馈信息,xml中代价估计研究的难点,XML数据的不规则性是对传统统计信息方法的重要挑战,具体表现在以下几个方面:路径依赖性,如同为name结点,在person下和在city下出现意义就完全不同结构相关性嵌套在不同祖先下
2、面的同类结点的个数差别可能很大,如book结点下的author个数是不确定的值相关性/purchase/Itemtype=bookprice100XML的有序性制约了转换规则的灵活性所有这些问题,都使得在xml中采用传统的代价估计方法不切实际,会带来很大的误差。针对xml数据的特点,我们应该寻求一种新的代价估计方法。,Background Knowledge,XML Data ModelA large, node-labeled tree T(V,E),Example XML Document Tree,Background Knowledge,XML Data ModelA large, n
3、ode-labeled tree T(V,E),Example XML Document,Background Knowledge,XML Query ModelA node-labeled query tree TQEach node of TQ is labeled with a variable name qi in Q Each edge (qi,qj) of TQ is annotated with an XPath expression path (qi,qj) that describes the specific structural constraints specified
4、 in Q,Query Fragment: for $b in doc(“bib.xml”)/bib/book$a in $b/author $t in $b/title,现有估计方法的分类,路径选择性计算Twig匹配个数统计值谓词选择率估计结构连接代价估计,Overview,data tree,different summarization methods based on:,path,all m-length path,F/B-stability,LoreMcHugh et al, VLDB99,pruning,PathTree, Markov TableAboulnaga et al,
5、VLDB01,XSketchPolyzotis et al, VLDB02,type,XSketchesPolyzotis et al, SIGMOD02,+ value,+ twig,XSketchesPolyzotis et al, SIGMOD04,Region Code,2D Model,1D Model,StatiXFreire et al, SIGMOD02,PH HistogramWu Y EDBT 2002,Interva&Position ModelH.Lu SIGMOD 2003,路径选择性计算,问题描述 /a/bc/d/eg/e/f/*/*/e/f,Path Tree &
6、 Markov Table,Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffry F.Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. VLDB 2001,Path Tree,ConstructionStart from an original path treeCount of nodes in absolute pathsAggregate the tree to fit in the limited spac
7、eSibling-*EstimationMatch the query against the path tree, matching * as the last resort.Need to decide whether to use total count or average count.,An XML document and its path tree,Estimation:count(A/B/D)=1 count(A/C/D)=1,Aggregate Path Tree,Aggregate Path Tree,A,1,H,6,E,5,D,7,J,4,I,2,红色结点表示那些不频繁出
8、现的结点,需要删除,从而保证path树能够放入内存,Sibling-* Summarization,A,1,C,9,*,*,K,*,f=23n=2,6,8,3,把查询和path 树匹配需要决定用总数还是平均数,估计方法,Estimation:count(/C/H/K)=11 count(/C/*/K)=23,Markov Table,构造一个表,存储长度=2。当查询路径长度=m时,可以直接从表中读出结果,否则,用公式计算。例如,m3,查询为/A/B/C/D,则当表不能装入内存的时候,进行剪枝操作。,Markov Table : Example,a,b,b,c,d,d,e,e,e,f,d,a,b
9、,d,d,e,1,2,2,1,3,c,f,1,1,m=2,suffix-*,Q1: a/c/e,Twig 匹配个数统计,问题描述Twig Query (basic building block of declarative query languages for XML),FOR $f IN document(“person.xml”)/department/faculty FOR $r in $f/RA, FOR $t in $f/TA,Department,Faculty,RA,TA,$f,$r,$t,XSketch方法,N.Polyzotis, M.Garofalakis. Statis
10、tical Synopses for Graph-Structured XML Databases, In Proc.of ACM SIGMOD Conf.2002 N. Polyzotis and M. Garofalakis. Structure and value synopses for xml data graphs. In Proceedings of 28th VLDB Conf., 2002.N.Polyzotis, M.Garofalakis, and Yannis Iosnnidis. Selectivity Estimation for XML Twigs. In Pro
11、c.of the 20th Intl.Conf. on Data Engineering,2004N. Polyzotis and M. Garofalakis and Y.Ioannidis. Approximate XML Query Answers. In Proc .ACM SIGMOD 2004,XSketch方法,B/F stability,Definition: Let U,V be sets of elements in an XML data Tree T. We say that V is backward-stable(B-stale) with respect to U
12、, iff for each vV there exists a u U such that edge (u,v) is in T.Similarly,U is said to be forward-stable(F-stable) with respect to V,iff for each u U there exists a v V such that the edge(u,v) is in G.,XSketch方法,How to estimate the cardinality of xpath query such as A/P/K or A/p ?,Utilizing edge s
13、tability,We can give an accurate match number:Count(A/P/K)=6Count(A/p)=3,But what about A/P/T which doesnt conform to backward stability?,2 assumptions Path independence Backward-edge uniformityCount(A/P/T)=f(A/P/T)*count(T)f(A/P/T)=f(A/P)*f(P/T|A/P)=f(P/T|A/P) =f(P/T)count(P)/(count(B)+count(P),XSk
14、etch方法(处理值谓词),v1,v3,v5,v2,v4,B,B,F,F,STN(v5),Dep(v5)=v1,v4,v1,v4,Histogram at v5,H(1, 4)= count(v11v2/v3v44/v5),v3 v5 count(v11v2/v33v44/v5)=H(1, 4)*count(v33)/count(v3),A1Edge-Value IndependenceAcross Non-Satble Edgesf(u/v|v) f(u/v)A2Value-Independence Outside Correlation Scope,XSkecth方法(处理twig),Pr
15、oblem with the former modelLet us see a simple example,for t0 in A, t1 in t0/B , t2 in t0/C,(a),(b),(c) Structural XSketch,2000 binding tuples on the first document10100 tuples on the second document,XSkecth方法(处理twig),Singe-path XSketch model dose not store detailed enough information to capture the
16、 correlation patterns between multiple path expressions.If node A records a two-dimensional distribution fA(b,c) for the fraction of elements in A that have b children in B and c children in C.,XSkecth方法(处理twig),CK,CY,Elements,fp,2,1,p4,0.25,P5,4(0.2521100.25110.51)5,CP,CN,2,1,2,1,0.25,2,1,1,1,0.5,1
17、,1,p8,p9,Another Example,Edge-distribution: fP(CY,CK,CP,CN),P,Y,P,K,A,P,A,N,Statix方法,Freire J, Jayant R, Ramanath M, Roy P and Simeon J. StatiX: Making XML Count. In: Proc. of ACM SIGMOD 2002, USA.,Statix方法,Constructionevery instance node is assigned a unique id(type id+sequential node id ) during p
18、arsing and validation, meanwhile gathering statisticsConstructing the histogramsHistogramBuild histogram for each typeMight contain histogram describing the distribution of the parents of a typeValue histogram can be built for types that are defined in terms of base types (eg. integer)EstimationConv
19、ert the query into SQL (i.e., relational join on IDs)Use standard histogram multiplication,StatiX : Example,FOR $s in document (“imdb.xml”)/show, $r in $s/reviewWHERE $s/year 1992RETURN $s/title, $r,SELECT title, reviewFROM Show, ReviewWHERE Show.year 1992 AND Show.ID = Review.ParID,STAT Show CARD:
20、5 ID: 16 STAT Show_Year VALUE: 19902001 BUCKET_NUM: 2 BUCKETS: 1990, 1994): 3; 1994, 2001): 2STAT Review CARD: 8 ID: 3038 PARENT HISTOGRAM Show BUCKET_NUM: 3 BUCKETS: 1, 2): 4; 2, 3): 1; 3, 5): 3 ,STAT Aka CARD:6 ID:6-12 PARENT HISTOGRAM Show_Episode BUCKET_NUM: 3 BUCKETS: 1, 2): 1; 2, 6): 0; 6, 12)
21、: 7 ,31/28/5=2.4,Summary,结构连接代价估计,问题描述给定任意的祖先节点集合A和后代节点集合D,计算A/D结果集合的大小,估计匹配的祖先后代的结果个数 当同一祖先有多个后代,或者同一后代有多个祖先时,节点对个数可能远远大于祖先或后代的个数。应用连接顺序的选择,结构连接代价估计,Existed work2D model: pH histogram Wu Y, Patel J and Jagadish H. V. Estimating Answer Sizes for XML Queries. In: Proc. of the EDBT 2002. Interval and
22、Positional model W.Wang, H.Jiang, H.Lu, and J.F.Xu. Containment join Size Estimation:Models and Methods. In:Proc. Of ACM SIGMOD 2003 ,Absolute Region Code (start, end), blah 1234 5678 0000 ,contact,name,phone,blah,office,home,mobile,1234,5678,0000,(1,16),(2,4),(3,3),(5,15),(6,8),(7,7),(9,11),(10,10)
23、,(12,14),(13,13),2,3,4,a.start d.start and d.end a.end,pH Histogram,Forbidden Regions For a node,R1和R2是结点A的ForbiddenRegion 两个结点的(start,end)满足:(1)no overlap (2)nested 不可能有部分重叠的(partiallyoverlap)情况,pH Histogram,EstPA=HistP1A1/4HistP2A+HistP2B+HistP2C+HistP2E+1/2(HistP2D+HistP2F),Ancestor-based Join Es
24、timation,EstPA=HistP2A1/4HistP1A+HistP1F+HistP2G+HistP1H,Descendant-based Join Estimation,pH Histogram,est = 3*(12+1+2+0.25*5) = 48.75,start,end,start,start,end,end,For off-diagonal cells:,pH for A,pH for D,Interval & Position Models,Map into new problems under 2 newly proposed models.Interval model
25、 and Position modelHistogram based method that is adaptive to the data: PL histogramAssumption: 1D uniform of the descendant setEstimation is robust when correlation is lowSampling based method that captures the correlation: IM Sampling and PM SamplingAssumption: Height of the XML data tree is a sma
26、ll constantBoth have theoretical bounds on the accuracy of the estimation,Interval Model,Interval ModelMap elements to 1D line segments (points)IMA(A): interval setIMD(D): point set|A SJ D|=| interval-point pairs|,interval set,point set,Position Model,Position ModelUse frequency tables to record cov
27、erage and starting point informationPMA(S): coverage tablePMD(S): start table|A SJ D|=Containment joinEquijoin,coverage table,start table,Point-Line (PL) -Histogram,Partition the pace into partitionsEstimate the overlapping (interval, point) pairs within each bucket.AssumptionsData distribution of A
28、 and D are independentD conforms to uniform distribution,Sampling in Interval Model,Adapted from bifocal samplingIM Algorithm:Choose m samples from IMD(D)Sum up the m subjoin sizes.X = Scale up the result by |IMD(D)|/mStatistical GuaranteeE(X) = X (join size)with high probability(by Hoeffding bound),2,1,3,6,m=2,Problem,How to handle parent-child relationship?How to count |A/B/C|How to use estimation for Query Optimization,Question ?,Then We done !,Thank you !,