LanguageTheoryCombiantioricsandBioinfromatics.ppt

上传人:ga****84 文档编号:366239 上传时间:2018-09-27 格式:PPT 页数:52 大小:282KB
下载 相关 举报
LanguageTheoryCombiantioricsandBioinfromatics.ppt_第1页
第1页 / 共52页
LanguageTheoryCombiantioricsandBioinfromatics.ppt_第2页
第2页 / 共52页
LanguageTheoryCombiantioricsandBioinfromatics.ppt_第3页
第3页 / 共52页
LanguageTheoryCombiantioricsandBioinfromatics.ppt_第4页
第4页 / 共52页
LanguageTheoryCombiantioricsandBioinfromatics.ppt_第5页
第5页 / 共52页
点击查看更多>>
资源描述

1、Grammatical ComplexityOf Symbolic Sequences:A Brief Introducton,Bailin HaoT-Life Research Center, Fudan UniversityInstitute of Theoretical Physics, Academia SinicaThe Santa Fe Institute, New Mexico, USAhttp:/ Paradigms in TheoreticalDescription of Nature,Deterministic: based on periodicities and rec

2、urrences, from Kepler to Yang-MillsStochastic: based on randomness, from Brownian motion to MSR field theory of hydrodynamics and molecular motorsFractal, self-similar, scale invariant: from phase transitions and critical phenomena to chaotic dynamicsFiniteness is the unifying Physics: languages,语言学

3、(language 而非 philology)方法,统计语言学 “字”的频度和关联 Zipf 定律 代数语言学:生成语法和语法复杂性 串行生成:Chomsky体系 平行生成:Lindenmayer 体系(来自发育生物学) 可因式化语言(Factorizable language),自然语言与遗传语言,相似处:多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统有某些语法,但不能完全生成方言、个体差异性演化、突变、灭绝历史“垃圾”、古语、“化石”外来语、横向交换,相异处: 标点符号和间隔不同 两种语言的相互作用 二维、三维的相互作用 重复序列的数目和作用,An Observation,

4、u d c s b tcharge, mass, flavor, charm, p n e charge, mass, spin, magnetic momentum, H C N O P atomic number, ion radius, valence, affinity, H2O NO CO2 molecular weight, polarity, a c g t A D E F G H W Y VBRCA1 PDGF,A PROGRAMME:,Coarse-Grained Description of NatureUse of Symbols and Symbolic Strings

5、LanguageGrammar and Complexity (Chomsky, Lindenmayer, etc.)So far this programme has been best realized in the study of dynamics by using Symbolic Dynamics.There have been preliminary attempts in analyzing biological sequences.,It may not be a coincidence that the two systems in the universe that mo

6、st impress us with their open-ended complex design life and mind are based on discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place.S. Pinker, The Language Instinct (1995),Simple Examples,At the level of word

7、s: DOG GOD At sentence level: Dog bites Man Man bites Dog,N C EGF (Epidermal GF)N C Chymotrypsin (胰凝乳蛋白酶) N C Urokinase (UK) (尿激酶)N C Factor IX (凝血因子IX, X-mas抗血友病因子)N C Plasminogen (纤维蛋白融酶原) 几种丝氨酸蛋白酶的domain组合 B.Alberts 等,Mol.Biology of the Cell 第三版 1994. P.123,Ca 结合蛋白,含3个-s-s-,GC 语法复杂性 字母表 例1. = a,

8、c, g, t 例2. = A, C, D W, Y 例3. = a, z, A, Z, +, , 字母表中各种字母组成的一切字母串 (包括空串) * *的任何子集是基于的一种语言语法 = 字母表,初始字母,产生规则 基于该语法的语言,Classification of Formal Languages,Chomsky Hierarchy Sequential production rulesLindenmayer SystemsParallel production rules,Generative Grammar,S Sentence NP Noun PhraseVP Verb Phras

9、eAdj AdjectiveArt Article,S if S then SS either S or S,Non-Terminal and Terminal Symbols,N boy | girl | scientist | V sees | believes | loves | eats | Adj young | good | beautiful | Art a | one | the,S NP VPVP V NPNP (Art) Adj* N,Chomsky 语法层次 N 非终结字母集(工作用符号) T 终结字母集 S N 起始字母 P = 生成规则(x y)的集合 x, y 为字

10、母串 关于 x, y 的不同规定导致不同语法 语法 G = (N, T, P, S)0 类语法 x (NT)* N(NT)* y (NT)*,至少含有一个非终结字母,1 类语法 上下文有关语法 x = t1 a t2 t1, t2 T* a N 2 类语法 上下文无关语法 x = a N 3 类语法 正规语法 x = a y = b 或 bc a, c N b = 空 或 b T,形式语言的Chomsky层次,a,b,(i),(ii),A transfer function, (a, R) = bA Finite State Automaton(FSA),FSA: Finite State A

11、utomata,Deterministic FSANon-Deterministic SFAEquivalence of DFSA and NDFSA: subset constructionMinimal DFSAMyhill-Nerode theorem (1958): number of nodes in minDFSA,A Pushdown Automaton,Pushdown list StackFirst In Last Out (FILO),A Turing MachineAlan M. Turing (1912-1954),FSA + R/W tapeChurch-Turing

12、 Thesis (1936):Any effective (mechanical) computation canbe carried out by a Turing machine,Terminals = a, b, cNon-terminal = A, BSequential rules: B aBAc | abc bA bb cA Ac B abc B aBAc aabcAc aabAcc B abAc aaBAcAc aaBAAc aaabcAAc aaabAcAc aaabbAcc,Example: ai b ici | i0 CSL,Rules to Generate Gene-L

13、ike Sequences( according to David Searls ),gene upstream transcript downstreamtranscript 5-untranslated-region start-codon coding-region 3-untranslated-regioncoding-region codon coding-region | stop-codon | splice | coding regioncodon lys | asn | thr | met | glu | his | pro | asp | ala | gly | tyr |

14、 trp | phe | leu | ile | ser | arg | gln | val | cysstart-codon metstop-codon taa | tag | tga,leu tt purine | ct base (6)ser ag pyrimidine | tc base (6)arg ag purine | cg base (6)val gt base pro cc base (4)ala gc base gly gg base (4)thr ac base (4) ile at pyrimidine | ata (3)lys aa purine asn aa pyr

15、imidine (2)gln ca purine his ca pyrimidine (2)glu ga purine cys tg pyrimidine (2)phe tt pyrimidine tyr ta pyrimidine (2)asp ga pyrimidine (2)met atg trp tggbase m a | c | g | t purine a | gprimidine c | t,splice intron intron gt | intron-body | agsplice a a intron splice c c intronsplice t t intron

16、splice g g intron a splice intron a c splice intron c t splice intron t g splice intron gupstream enhancer promotor enhancer enhancer promotor silencer isolator ,These rules are capable to generate an unlimitedset of gene-like sequences, mostly biological nonsense. They may be used to recognize gene

17、-like segments in long DNA sequences.Syntax versus Semantics: texts vs. grammar. Physics behind this coarse-grained description:stereochemistry, interaction between proteins andDNA chains, metallic ions etc.,Symbolic Dynamics Languages,1991,1999,谢惠民定理和猜测,单峰映射揉序列对应的语言中的正规语言只有周期序列和最终周期序列两种如何走向比正规语言高一级

18、的上下文无关语言?猜测:单峰映射揉序列对应的语言中没有上下文无关语言,Subintervals determined by the periodic kneadingSequence (RLRRC),Order of visits in the periodic kneadingSequence (RLLRC),Transformations of subintervals,a c + d (on reading L)b d (on reading R)c b + c (on reading R)d a (on reading R),Transfer Functions,Stefan matr

19、ix for 256P in Feigenbaum cascade,Stefan matrix for F13=233; Case (a),Stefan matrix for F13=233. Case (b),Stefan matrix for F13=233. Case (c),Stefan matrix for F13=233. Case (d),Symbolic Dynamics Languages,1991,1999,br bl ar al albr blar Alphabet: S = ar, al, br, bl Production rules: Initial symbol

20、(axiom) = ar Grammar: G = (S, P, ) Language: L (G) S*,Development of Anabaena catenula (串珠藻项圈藻属),br ar ar albr bl al al blar,P =,Lindenmayer SystemsParallel production rules. Finer classification D0L Deterministic, no interaction, i.e., context-free0L non-deterministic, no interaction IL non-determi

21、nistic, with Interaction, i.e., context sensitiveT0L with Table of production rulesTIL E0L Extended to non-terminal symbolsET0L EIL REL of Chomsky,RGL Regular CFL Context-FreeCSL Context-Sensitive REL Recursively Enumerable,CSL,CFL,RGL,FIN,DOL,REL,Chomsky LindenmayerIndexed,0:REL,1:CSL,IND,ET0L,E0L,

22、2:CFL,3:RGL,IL,T0L,0L,D0L,EIL,L = aibici | i 0 CSLG = (S, T, ) = abc S = a, b, c T = t1, t2 T1 = a aa, b bb, c cc T2 = a , b , c T0L,Example a la Lindenmayer,Dyck language: A language of nested parentheses,Many types of parenthesesFinite depth of nestingContext-free languageOur case:Only 3 types of

23、parenthesesShallow nestingConjecture (Xie): may be regular language,模糊语言学 形式推广不难:Z .G .Yu (喻祖国2001) 如何定量地引用生物知识 Consensus 序列和权重矩阵随机语法 隐马可夫链 = 随机正规语法 更高阶的随机语法?,Factorizable Languages,Symbolic dynamics leads to factorizable languagesA complete genome defines a factorizable langaugeAn amino acid sequen

24、ce with unique reconstruction (at certain K) defines a factorizable language,Modeling in Biology,CellsTissuesOrgans“Systems”: circulation, respiration, reproduction, neural, sensory, musclular, etc.Organisms, population, ecosystemsAnimals versus plantsPlant development, morphology, physiology and pa

25、thology,Modeling of Plant MorphologyBy using L-System,P. Prusinkiewicz, J. Hanan, Lindenmayer Systems, Fractals, and Plants, LN in Biomath., vol. 79, Springer, 1989P. Prusinkiewicz, A. Lindenmayer, The Algorithmic Beauty of Plants, Springer, 1990P. Prusinkiewicz, M. Hammel, J. Hanan, R. Mech, Visual

26、 models of plant development, Chap.9 in Handbook of Formal Languages, Vol.3, Springer, 1997,Consistency of Macro-and Micro-Description of Nature,Molecular phylogeny versus phylogeny based on morphological featuresModeling plant development without getting into molecular and cellular descriptionNo ne

27、ed to model protein folding by invoking quarks!,Some Useful URLs,www.grogra.org (Growth Grammar)http:/putableplant.orghttp:/algorithmicbotany.org,Huimin Xie 谢惠民 Grammatical Complexity and 1D dynamical Systems Vol.6 in Directions in Chaos WSPC, 1996.谢惠民 复杂性与动力系统 上海科技教育出版社, 1994Bailin Hao, Weimou Zheng, Applied Symbolic Dynamics and Chaos (WSPC, 1998), Chap. 8J.Hopcroft, J.Ullman, Introduction to Automata Theory, Languages and Computation, Addison-Wesley, 1979.,Thanks!,

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 学术论文资料库 > 毕业论文

Copyright © 2018-2021 Wenke99.com All rights reserved

工信部备案号浙ICP备20026746号-2  

公安局备案号:浙公网安备33038302330469号

本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。