1、本科毕业设计(论文)外文翻译原文ACONCEPTUALDENSITYBASEDAPPROACHFORTHEDISAMBIGUATIONOFTOPONYMSNOWADAYS,AHUGEQUANTITYOFINFORMATIONISSTOREDINDIGITALFORMATAGREATPORTIONOFTHISINFORMATIONISCONSTITUTEDBYTEXTUALANDUNSTRUCTUREDDOCUMENTS,WHEREGEOGRAPHICALREFERENCESAREUSUALLYGIVENBYMEANSOFPLACENAMESACOMMONPROBLEMWITHTEXTUALIN
2、FORMATIONRETRIEVALISREPRESENTEDBYPOLYSEMOUSWORDS,THATIS,WORDSCANHAVEMORETHANONESENSETHISPROBLEMISPRESENTALSOINTHEGEOGRAPHICALDOMAINPLACENAMESMAYREFERTODIFFERENTLOCATIONSINTHEWORLDINTHISPAPERWEINVESTIGATETHEUSEOFOURWORDSENSEDISAMBIGUATIONTECHNIQUEINTHEGEOGRAPHICALDOMAIN,WITHTHEAIMOFRESOLVINGAMBIGUOUS
3、PLACENAMESOURTECHNIQUEISBASEDONWORDNETCONCEPTUALDENSITYDUETOTHELACKOFAREFERENCECORPUSTAGGEDWITHWORDNETSENSES,WECARRIEDOUTTHEEXPERIMENTSOVERASETOF1,210PLACENAMESEXTRACTEDFROMTHESEMCORCORPUSTHATWENAMEDGEOSEMCORANDMADEPUBLICLYAVAILABLEWECOMPAREDOURMETHODWITHTHEMOSTFREQUENTBASELINEANDTHEENHANCEDLESKMETH
4、OD,WHICHPREVIOUSLYHASNOTBEENTESTEDINLARGECONTEXTSTHERESULTSSHOWTHATABETTERPRECISIONCANBEACHIEVEDBYUSINGASMALLCONTEXTPHRASELEVEL,WHEREASAGREATERCOVERAGECANBEOBTAINEDBYUSINGLARGECONTEXTSDOCUMENTLEVELTHEPROPOSEDMETHODSHOULDBETESTEDWITHOTHERCORPORA,DUETOTHEFACTTHATOUREXPERIMENTSEVIDENCEDTHEEXCESSIVEBIAS
5、TOWARDSTHEMOSTFREQUENTSENSEOFTHEGEOSEMCORKEYWORDSWORDSENSEDISAMBIGUATIONTOPONYMRESOLUTIONCONCEPTUALDENSITYSPATIALINDEXING1INTRODUCTIONAGREATPORTIONOFTHEINFORMATIONCURRENTLYAVAILABLEINDIGITALFORMATISCONSTITUTEDBYTEXTUALANDUNSTRUCTUREDDOCUMENTSTHECONTINUOUSGROWTHOFTHISKINDOFINFORMATIONANDTHEINCREASING
6、NUMBEROFUSERSTHATCANACCESSITCONSTITUTEACHALLENGETOTHEDEVELOPERSOFINFORMATIONRETRIEVALIRSYSTEMSONEOFTHEMOSTCHALLENGINGPROBLEMSISTHEAMBIGUITYOFHUMANLANGUAGEWHENSEARCHINGFORSPECIFICKEYWORDS,ITISDESIRABLETOELIMINATEOCCURRENCESINDOCUMENTSWHERETHEWORDORWORDSAREUSEDINANINAPPROPRIATESENSEIDEANDVERONIS1998AM
7、BIGUITYCANBEOFVARIOUSTYPESPROPERNAMESMAYIDENTIFYDIFFERENTCLASSESOFNAMEDENTITIESFORINSTANCE,LONDONMAYIDENTIFYTHEWRITERJACKLONDONORACITYINTHEUK,ORMAYBEUSEDASANAMEFORDIFFERENTINSTANCESOFTHESAMECLASSEGLONDONISALSOACITYINCANADATHETASKOFASSIGNINGTHEMOSTAPPROPRIATESENSETOAWORDWITHINITSCONTEXTISNAMEDWORDSEN
8、SEDISAMBIGUATIONWSDNOTABLY,THISISSTILLANOPENPROBLEMINTHEFIELDOFNATURALLANGUAGEPROCESSINGNLPMANYAPPROACHESHAVEBEENDEVELOPEDANDEVALUATEDATSENSEVAL1ANDSEMEVAL2COMPETITIONS,BUTNOSINGLEDOMINANTMETHODHASEMERGEDUSUALLY,WSDAPPROACHESARECATEGORIZEDINTOCORPUSBASEDANDKNOWLEDGEBASEDTHEFORMERUSEANNOTATEDDATATOTR
9、AINAMODEL,THATISUSEDLATERINORDERTOCARRYOUTTHEDISAMBIGUATIONPROCESSTHELATTERAREBASEDONTHEUSEOFEXTERNALRESOURCESSUCHASONTOLOGIES,THESAURI,ORDICTIONARIESONTHEONEHAND,CORPUSBASEDMETHODSGIVEBETTERRESULTS,BUTTHEYARELIMITEDBYTHELACKOFANNOTATEDCORPORAONTHEOTHERHAND,KNOWLEDGEBASEDMETHODSDONOTNEEDTRAININGDATA
10、,BUTOFTENTHEREARELIMITATIONSONTHECASESINWHICHTHEYCANBEUSED,RESULTINGINLOWERCOVERAGEANDPRECISIONSNYDERANDPALMER2004PREVIOUSWORKDEMONSTRATESTHATWSDISUSEFULFORIRONLYINTHECASEOFIMPROVINGPRECISIONSANDERSON1996,GONZALOETAL1998,ROSSOETAL2004,ORIFITISUSEDINARESTRICTEDDOMAINPALIOURASETAL1998,STEFFENETAL2004O
11、URPREVIOUSEXPERIENCESATGEOCLEFBUSCALDIETAL2006B,CDREWOURATTENTIONTOTHEPROBLEMOFTHEAMBIGUITYOFPLACENAMESTOPONYMSINTHISPAPERWESTUDYTHEAPPLICATIONOFAKNOWLEDGEBASEDWSDMETHODINTHEGEOGRAPHICALDOMAIN,SPECIFICALLYTOTHEDISAMBIGUATIONOFTOPONYMSTHEMETHODWEPROPOSEISBASEDONTHEONEROSSOETAL2003WEDEVELOPEDFORTHEDIS
12、AMBIGUATIONOFNOUNS,WHICHIMPLEMENTEDAVARIATIONOFTHECONCEPTUALDENSITYFORMULABYAGIRREANDRIGAU1996WEUSEDWORDNETMILLER1995ASANEXTERNALKNOWLEDGERESOURCETOPONYMDISAMBIGUATIONISARELATIVELYNEWFIELDFROMANNLPPERSPECTIVE,ITISMERELYTHEAPPLICATIONOFWSDTOPLACENAMESITSMOSTDIRECTAPPLICATIONSHOULDBETHEIMPROVEMENTOFTH
13、ESEARCHESBOTHINTHEWEBANDINLARGENEWSCOLLECTIONS,DUETOTHEFACTTHATITISVERYCOMMONTOFINDGEOGRAPHICALINFORMATIONINWEBPAGESORNEWSSTORIESEGELECTIONSINITALY,PLANECRASHINTEHERANAGROWINGINTERESTINTHEFIELDOFGEOGRAPHICALINFORMATIONRETRIEVALGIRISTESTIFIEDBYTHERECENTCREATIONOFTHEGEOCLEFEXERCISEANDTHEINCREMENTOFTHE
14、ATTENDANCEATTHEGIRWORKSHOPS4HELDATTHELASTSIGIREVENTSTHELACKOFAREFERENCECORPUSHASLONGBEENANOBSTACLETOTHEEVALUATIONOFALGORITHMSFORTOPONYMRESOLUTIONLEIDNER2004RECENTLY,SOMECORPORAHAVEBEENCOMPILEDGARBINANDMANI2005,LEIDNER2006,BUTTHELACKOFAMAPPINGBETWEENWORDNETANDTHELOCATIONSIDSUSEDINTHESECORPORAPREVENTE
15、DUSFROMEVALUATINGOURMETHODWITHTHESERESOURCESWEOVERCOMETHISPROBLEMBYSELECTINGTHEGEOGRAPHICALENTITIESINTHESEMCOR5CORPUSTHATWASORIGINALLYDEVELOPEDFORTHEWSDTASKINTHEFOLLOWINGSECTION,WEWILLGIVEANOVERVIEWOFTHEPREVIOUSEFFORTSINTHEFIELDOFTOPONYMDISAMBIGUATIONINSECTION3,WEWILLPROVIDEABRIEFDESCRIPTIONOFTHEWOR
16、DNETONTOLOGYINSECTION4,WEWILLDESCRIBEOURWSDMETHODINSECTION5,WEWILLRESUMETHEEXPERIMENTSCARRIEDOUTANDTHESYSTEMSWECOMPAREDOURMETHODTO,TOGETHERWITHADESCRIPTIONOFTHECORPUSWEBUILTFINALLY,WEWILLGIVEADISCUSSIONOFTHEOBTAINEDRESULTS2PREVIOUSWORKONTOPONYMRESOLUTIONTOPONYMRESOLUTIONCANBEDEFINEDASTHETASKOFASSIGN
17、INGANAMBIGUOUSPLACENAMEWITHREFERENCETOTHEACTUALLOCATIONTHATITREPRESENTSINAGIVENCONTEXTFORINSTANCE,THEWORDCAMBRIDGEISAMBIGUOUSITCOULDBEUSEDTOREPRESENTONEOFTHEFOLLOWINGLOCATIONSACCORDINGTOWORDNETICAMBRIDGEACITYINEASTERNENGLANDONTHERIVERCAMSITEOFCAMBRIDGEUNIVERSITYIICAMBRIDGEACITYINMASSACHUSETTSJUSTNOR
18、THOFBOSTONSITEOFHARVARDUNIVERSITYANDTHEMASSACHUSETTSINSTITUTEOFTECHNOLOGYASINTHEGENERICWSDTASK,THECLUESTHATCANBEUSEDTODISAMBIGUATETHEWORDAREFOUNDINTHECONTEXTFORINSTANCE,THEPRESENCEOFBOSTONINTHECONTEXTMAYBEAHINTTHATTHECORRECTSENSEOFCAMBRIDGEISTHESECONDONEEXISTINGMETHODSFORTHEDISAMBIGUATIONOFTOPONYMSM
19、AYBESUBDIVIDEDINTOTHREECATEGORIESIMAPBASEDMETHODSTHATUSEANEXPLICITREPRESENTATIONOFPLACESONAMAPIIKNOWLEDGEBASEDTHEYEXPLOITEXTERNALKNOWLEDGESOURCESSUCHASGAZETTEERS,WIKIPEDIAORONTOLOGIESIIIDATADRIVENORSUPERVISEDBASEDONSTANDARDMACHINELEARNINGTECHNIQUESAMONGTHEFIRSTONES,SMITHANDCRANE2001PROPOSEDAMETHODFO
20、RTOPONYMRESOLUTIONBASEDONTHEGEOGRAPHICALCOORDINATESOFPLACESTHELOCATIONSINTHECONTEXTAREARRANGEDINAMAP,WEIGHTEDBYTHENUMBEROFTIMESTHEYAPPEARTHEN,ACENTROIDOFTHISMAPISCALCULATEDANDCOMPAREDWITHTHEACTUALLOCATIONSRELATEDTOTHEAMBIGUOUSTOPONYMTHELOCATIONCLOSESTTOTHECONTEXTMAPCENTROIDISSELECTEDASTHERIGHTONETHE
21、YREPORTEDPRECISIONSOFBETWEEN74AND93DEPENDINGONTESTCONFIGURATION,WHEREPRECISIONISCALCULATEDASTHENUMBEROFCORRECTLYDISAMBIGUATEDTOPONYMSDIVIDEDBYTHENUMBEROFTOPONYMSINTHETESTCOLLECTIONTHEGIPSYSUBSYSTEMBYWOODRUFFANDPLAUNT1994ISALSOBASEDONSPATIALCOORDINATES,ALTHOUGHINTHISCASETHEYAREUSEDTOBUILDPOLYGONSWOOD
22、RUFFANDPLAUNT1994REPORTEDISSUESWITHNOISEANDRUNTIMEPROBLEMSTHEMETHODSOFOLLIGSCHLAEGERANDHAUPTMANN1999ANDRAUCHETAL2003AREBASEDONEVIDENCESCOLLECTEDFROMAVARIETYOFSOURCES,ESPECIALLYGAZETTEERSTHEINFORMATIONCOLLECTEDINORDERTODISAMBIGUATETHEPLACENAMESMAYVARYFROMPOPULATIONDATAREFERENCESTOPOPULOUSPLACESAREMOR
23、EFREQUENTTHANTHOSETOTHELESSPOPULATEDONESTOTHEPRESENCEOFPOSTALADDRESSESOLLIGSCHLAEGERANDHAUPTMANN1999REPORTEDAPRECISIONOF75FORTHEIRRULEBASEDMETHODOVERELLETAL2006PRESENTEDAMETHODBASEDONWIKIPEDIA6,WHICHTAKESADVANTAGEOFSOMEOFITSFEATURES,SUCHASTHEARTICLETEMPLATES,CATEGORIESANDREFERENTSLINKSTOOTHERARTICLE
24、SINWIKIPEDIAANAVEBAYESCLASSIFIERISUSEDBYSMITHANDMANN2003TOCLASSIFYPLACENAMESWITHRESPECTTOTHEUSSTATESORFOREIGNCOUNTRIESTHEYREPORTEDPRECISIONSBETWEEN218AND874,DEPENDINGONTHETESTCOLLECTIONUSEDGARBINANDMANI2005USEDARULEBASEDCLASSIFIER,OBTAININGPRECISIONSBETWEEN653AND884,ALSODEPENDINGONTHETESTCORPUSTHEWE
25、AKNESSOFSUPERVISEDMETHODSHIGHLIGHTSTHENEEDFORALARGEQUANTITYOFTRAININGDATAINORDERTOOBTAINAHIGHPRECISIONMOREOVER,THEINABILITYTOCLASSIFYUNSEENTOPONYMSISALSOAMAJORPROBLEMTHATAFFECTSTHISCLASSOFMETHODS3THEWORDNETONTOLOGYWORDNETISACOMPLEXLEXICALDATABASEOFENGLISH,DEVELOPEDATTHEUNIVERSITYOFPRINCETONUNDERTHED
26、IRECTIONOFGMILLERMILLER1995ITSLASTVERSION30CONTAINS155,327WORDSGROUPEDINTO117,597SYNSETSASYNSETSETOFSYNONYMSISAGROUPOFWORDSTHATARECONSIDEREDSEMANTICALLYEQUIVALENTANEXAMPLEOFSYNSETFORAGEOGRAPHICALLOCATIONISTHEFOLLOWINGLONDON,GREATERLONDON,BRITISHCAPITAL,CAPITALOFTHEUNITEDKINGDOMEACHSYNSETISASSOCIATED
27、TOAUNIQUEIDANDAGLOSS,IETHEDEFINITIONOFTHECONCEPTINTHECASEOFLONDONTHECAPITALANDLARGESTCITYOFENGLANDLOCATEDONTHETHAMESINSOUTHEASTERNENGLANDFINANCIAL,INDUSTRIALANDCULTURALCENTERMOREOVER,THEMOSTIMPORTANTFEATUREOFWORDNETISTHATITALSOPROVIDESASETOFSEMANTICRELATIONSHIPSWHICHCONNECTDIFFERENTSYNSETSINFIGURE1,
28、WESHOWAPORTIONOFWORDNETSURROUNDINGTHELONDONSYNSETINTHEEXAMPLESOMEIMPORTANTSEMANTICRELATIONSHIPSAREVISIBLEEGTHEHYPERNYMYORISARELATIONSHIPTHISRELATIONSHIPCONNECTSTWOCONCEPTSWHEREONEISMOREGENERALTHANTHEOTHER,SUCHASCLOCKANDCUCKOOCLOCKTHEINVERSERELATIONSHIPFROMAMORESPECIFICCONCEPTTOAMOREGENERALONEISCALLE
29、DHYPONYMYIECUCKOOCLOCKISAHYPONYMOFCLOCKTHEMERONYMY,ORPARTOF,RELATIONSHIPCONNECTSCONCEPTSTHATAREAPARTOFTHEOTHERANDVICEVERSAINTHELATTERCASEITISNAMEDHOLONYMYINTHEEXAMPLEOFFIGURE1,ENGLANDISHOLONYMOFLONDONFINALLY,THEINSTANCERELATIONSHIPCONNECTSABSTRACTCONCEPTSTOREALWORLDINSTANCES,SUCHASCLOCKANDBIGBENMOST
30、RELATIONSHIPSCONNECTWORDSOFTHESAMELEXICALCATEGORY,ALSOKNOWNASPARTOFSPEECHPOSCATEGORY,SUCHASTHOSENAMEDHERE,WHICHCONNECTONLYNOUNCONCEPTSWORDNETHASBEENWIDELYUSEDINNLP,MAINLYBECAUSEOFITSROLEASSENSEINVENTORYITWASALSOEMPLOYEDTOSEMANTICALLYANNOTATETHEBROWNCORPUSKUCERAANDFRANICS1967,OBTAININGTHESEMCORSEMANT
31、ICCORRESPONDANCECORPUSLANDESETAL1998INSEMCOREVERYWORDBELONGINGTOTHENOUN,VERB,ADJECTIVEANDADVERBPOSCATEGORIESHASBEENLABELEDWITHAWORDNETSENSEITISOFTENUSEDASATRAININGCORPUSFORSUPERVISEDWORDSENSEDISAMBIGUATIONMETHODS4CONCEPTUALDENSITYBASEDWORDSENSEDISAMBIGUATIONCONCEPTUALDENSITYCDWASINTRODUCEDBYAGIRREAN
32、DRIGAU1996ASAMEASUREOFHECORRELATIONBETWEENTHESENSEOFAGIVENWORDANDITSCONTEXTITISCOMPUTEDONWORDNETSUBHIERARCHIES,DETERMINEDBYTHEHYPERNYMYRELATIONSHIPTHEDISAMBIGUATIONALGORITHMBYMEANSOFCDCONSISTSOFTHEFOLLOWINGSTEPSISELECTTHENEXTAMBIGUOUSWORDW,WITH|W|SENSESIISELECTTHECONTEXTCW,IEASEQUENCEOFWORDS,FORWIII
33、BUILD|W|SUBHIERARCHIES,ONEFOREACHSENSEOFWIVFOREACHSENSESOFW,CALCULATECDSVASSIGNTOWTHESENSEWHICHMAXIMIZESCDSWHEREMARETHERELEVANTSYNSETSINTHESUBHIERARCHY,NISTHETOTALNUMBEROFSYNSETSINTHESUBHIERARCHY,ANDFISTHERANKOFFREQUENCYOFTHEWORDSENSERELATEDTOTHESUBHIERARCHYEG1FORTHEMOSTFREQUENTSENSE,2FORTHESECONDON
34、E,ETCTHEINCLUSIONOFTHEFREQUENCYRANKMEANSTHATLESSFREQUENTSENSESARESELECTEDONLYWHENM/N1THERELEVANTSYNSETSAREBOTHTHESYNSETSOFTHEWORDTODISAMBIGUATEANDTHOSEOFTHECONTEXTWORDSOURFORMULATIONALLOWSSOLVINGSOMEPROBLEMSWITHTHEORIGINALCDDUETOTHEHIGHERGRANULARITYOFNEWERWORDNETVERSIONSTHEWSDSYSTEMBASEDONTHISFORMUL
35、AOBTAINED815INPRECISIONOVERTHENOUNSINTHESEMCORBASELINE755,CALCULATEDBYASSIGNINGTOEACHNOUNITSMOSTFREQUENTSENSE,ANDPARTICIPATEDATTHESENSEVAL3COMPETITIONASTHECIAOSENSOSYSTEMBSCALDIETAL2004,OBTAINING753INPRECISIONOVERNOUNSINTHEALLWORDSTASKBASELINE701THESERESULTSWEREOBTAINEDWITHACONTEXTWINDOWOFONLYTWONOU
36、NS,THEONEPRECEDINGANDTHEONEFOLLOWINGTHEWORDTODISAMBIGUATEWHENWECONSIDEREDADAPTINGTHISALGORITHMTOTHEDISAMBIGUATIONOFTOPONYMS,WEREALIZEDTHATTHEHYPERNYMYRELATIONSHIPWASNOTSUITABLEFORINSTANCE,CAMBRIDGE1ANDCAMBRIDGE2AREBOTHINSTANCESOFTHECITYCONCEPTANDTHEREFORE,THEYSHARETHESAMEHYPERNYMTHERESULTISTHATTHESU
37、BHIERACHIESARECOMPOSEDONLYBYTHESYNSETSOFTHETWOSENSESOFCAMBRIDGE,ANDTHEYARELEFTUNDISAMBIGUATEDBECAUSETHEIRDENSITYISTHESAMEWHICHINBOTHCASESIS1OURIDEAISTOCONSIDERTHEHOLONYMYRELATIONSHIPINSTEADOFHYPERNYMYWITHTHISRELATIONSHIPITISPOSSIBLETOCREATESUBHIERARCHIESTHATALLOWDISCERNINGDIFFERENTLOCATIONSHAVINGTHE
38、SAMENAMEINAMOREEFFECTIVEWAY5EXPERIMENTSTHEHOLONYMBASEDCDDISAMBIGUATORDESCRIBEDINTHEPREVIOUSSECTIONWASTESTEDOVERACOLLECTIONOF1,210TOPONYMSITSRESULTSWERECOMPAREDWITHTHEMOSTFREQUENTMFBASELINE,OBTAINEDBYASSIGNINGTOEACHTOPONYMITSMOSTFREQUENTSENSE,ANDWITHANOTHERWORDNETBASEDMETHODWHICHUSESITSGLOSSES,ANDTHO
39、SEOFITSCONTEXTWORDS,TODISAMBIGUATEITWEWERENOTABLETOCOMPAREOURMETHODWITHANYMAPBASEDMETHOD,PRINCIPALLYBECAUSEWORDNETDOESNOTPROVIDECOORDINATESOFTHEGEOGRAPHICALENTITIESSOMEEFFORTSFORTHEINTEGRATIONOFWORDNETWITHGEOGRAPHICALGAZETTEERSHAVEBEENUNDERTAKENBUSCALDIETAL2006A,BUTAREADYTOUSEMAPPINGSTILLDOESNOTEXIS
40、TNEITHERDIDWECARRYOUTACOMPARISONWITHACORPUSBASEDMETHODBECAUSEOFTHESMALLAMOUNTOFDATACONTAINEDINTHECOLLECTIONSOURCEDAVIDEBUSCALDIANDPAULOROSSO,INTERNATIONALJOURNALOFGEOGRAPHICALINFORMATIONSCIENCEMAR2008,VOL22ISSUE3,P301313,13P,2DIAGRAMS,6CHARTS译文一个以概念演算法为基础的方法为地名消除歧义如今,一个数量巨大的信息存储在数字格式中。此资讯的很大一部分是由文本和
41、非结构化文档构成的,其中地理引用通常是由地名的方式发出组成。一个文本信息检索与常见的问题是多义词的代表,就是一句话可以有一个以上的意义。这个问题目前还在地理域地名可参考在世界不同地点。本文研究的是在我国的地理域技术的使用词义消歧与解决模棱两可的地名的目标。我们的技术是基于词汇网的概念密度。由于缺少参考标记与主体感官,词汇网进行了实验,通过从一组1210地名中提取的SEMCOR语料库,我们叫GEOSEMCOR和使公众能够获得。我们比较我们的方法和最常见的基线和增强LESK方法,还没有经过测试在小的语境结果表明,具有较高的求解精度可以达到使用一个小上下文短语水平,而更大的范围可以通过使用大的背景文
42、件级,拟议的方法应与其他语料库进行测试,由于这样的事实,我们的实验证明的朝向GEOSEMCOR最常见的感觉过多的偏见的事实。关键词词义消歧TOPONYM决议概念密度空间索引1,介绍目前的信息以数字格式提供的很大一部分是由文字和非结构化的文本文件组成的。这种不断增长的信息和越来越多的用户可以访问它构成越来越多的的信息检索IR系统的开发和挑战。最具挑战性的问题之一是人类语言的模糊性。在特定关键字搜寻时,最好是消除在文件中所发生的字或词使用不恰当的感觉IDE和VERONIS1998。歧义可以是多种类型的适当的名称可以识别不同类别的命名实体(例如,LONDONMAY确定LONDONOR在英国城市WRI
43、TERJACK),或可作为名称中同一类的不同实例;如LONDONIS也是一个在加拿大的城市。分配到的任务最适当的意义在它是名为字的上下文词义消歧(WSD)。值得注意的是,这仍然是一个在自然语言处理领域(NLP)的公开问题。许多办法已经制定和评估(在SENSEVAL1ANDSEMEVAL2比赛),但没有一个主要方法已经出现。通常情况下,词义消岐方法分为基于语料库和以知识为基础的。使用前标注数据培养一个模型,用于后来为了进行消歧过程。后者是基于对外部资源,如本体,叙词表,或词典使用。一方面,语料库为基础的方法提供更好的结果,但他们被标注缺乏语料的是有限的;另一方面,以知识为基础的方法并不需要培养数
44、据,但往往有局限性的案件中,他们可以被使用,这样就减少了覆盖范围和精确度(SNYDER和帕尔默2004年)。以往的工作表明,只有在提高精度的情况下,词义消岐是有用的(桑德森1996年,贡萨洛等AL1998,罗索等AL2004),或者如果它是在限制域(PALIOURAS等AL1998,斯特芬等AL2004)使用。以往的经验GEOCLEFBUSCALDI等。AL2006B,C提醒我们的关注的问题是不确定性、模糊性TOPONYMS地名。在这篇文章中我们研究知识的应用在WSD方法中,特别是歧义的地名。我们提出的方法是基于一个罗索AL2003)消歧的名词,实施了变化的公式,AGIRRE概念密度和RIGA
45、U的变化发展1996年,我们用这样的一个资料(米勒1995年)是来源外部知识的资源。地名消歧是一种较新的领域。从一个自然语言处理的角度来看,这仅仅是词义消岐的应用。对地名最直接的应用应该是提高搜索,无论是在网上和大新闻的集合,由于这样的事实,这是非常容易看到在意大利的网页或新闻故事如在意大利的选举”、“飞机坠毁在德黑兰)。一个增长的兴趣地理信息检索领域证明吉尔最近创造了GEOCLEF锻炼和出席在最后情报检索专业的事件缺乏参考语料库长期以来一直是一个障碍。最近买一些语料已编译(加尔宾尼2005年,LEIDNER,2006年,但缺乏映射关系。IDS使用粒子语料,我们没有评价我们的方法。我们利用这些
46、资源解决这个问题的初衷是通过选择地理组织的SEMCOR5CORPUS完成词义消岐的任务。在下面的部分,我们将给出一个在地名消歧领域的前期努力的概述。在第3节,我们将提供词汇网络的本体论的简要说明。在第4节,我们将描述我们的词义消岐方法。在第5节,我们将继续实验,和我们比较的方法连同我们建立主体描述。最后,我们将给予所获得的结果进行讨论。2,以前就地名的决议地名分辨率可被定义为参照分配的实际位置,它在特定情况下代表一个模糊的地名工作。例如,CAMBRIDGE这个词是模棱两可。它可以被用来代表(根据与词汇网络)下列地点之一(一)剑桥(在英国东部的河上的一个城市,也是剑桥大学所在地);(二)剑桥(马
47、萨诸塞州波士顿市的北部,哈佛大学和麻省理工学院)。正如通用WSD任务时,可以用来消除歧义词的线索被发现的情况下,比如,在”波士顿”的背景下存在可能是一个暗示,英国剑桥大学的正确意义是第二个。对于地名消歧现有的方法可以分为三类(一)地图为基础使用一个在地图上明确地表示的方法;(二)知识型他们利用如地名录,维基百科或本体外部知识来源;(三)数据驱动或监督机器学习技术标准为基础。其中首当其冲的是,史密斯和CRANE(2001)提出的基于地名的地理坐标的地方的解决方法在上下文中的位置安排,都是由他们出现的人数加权一张地图的时候。然后,这个地图的质心位置进行了计算和分析,并与地名相关的不明确的实际位置。
48、最接近的位置“关系网图”质心被选定为正确的。他们报告的是74和93的精度(在测试配置而定),其中精度是由地名的测试中收集正确消除歧义地名数量除以数量计算。伍德拉夫和PLAUNT的吉卜赛子系统(1994)也是基于空间坐标,虽然在这种情况下,它们被用来构建多边形。伍德拉夫和PLAUNT(1994)报告了噪音和运行时的问题。豪普特曼和OLLIGSCHLAEGER(1999)以及RAUCH等的方法。(2003年)是根据证据从各种各样的来源,特别是各种名录收集到的证据。收集的信息,以消除歧义的地名可以从不同的人口数据(参考人口稠密地方的频率比那些人口较少的)邮政地址的存在。豪普特曼和OLLIGSCHLA
49、EGER(1999)报道了75,其规则为基础精度的方法。OVERELL等(2006)提出WIKIPEDIA6,它利用它的一些特性,优势为基础的方法,如文章模板,分类和参照物(其他文章在维基百科的链接)。一个贝叶斯分类器已经被史密斯和曼恩2003分类地名。就各州或境外他们报告说,218和874之间的精度,根据测试收集使用加尔宾尼2005年,使用基于规则的分类器,获得精度653和884之间,也取决于测试主体弱势的指导方法凸现了需要大量的训练数据以获取很高的精度。此外,无法分门别类看不见的地名也是一个较大的问题,影响了这个加密算法。3,词汇网络的实体论词汇网络是一个复杂的英语词汇数据库。由普林斯顿大学的G米勒的指导下开发米勒1995。它的最新版本30包含155327到117597个词分组,一个组同义词是一组词语,被认为是语义等效为同义词集。一个地理位置的同义词集的例子如下伦敦、大伦敦,英国资本、资本联合王国)。每个同义词集都关联到一种独特的ID和注释,即一概念的定义如伦敦的首都和最大城市,位于英格兰在泰晤士河上在英格兰东南部、金融、工业和文化中心,词汇网络最重要的特征是它也提供了一系列的语义关系连接不同的同义词集。在图1中,我们展示了一部分围绕着同义词集的伦敦的词汇网络。在这个例子中的一