1、Digital Curation for the Big Data Sciences大数据科研中的数字保存,张智雄中国科学院国家科学图书馆,提纲,Digital Curation的兴起Digital Curation是什么?Digital Curation和Preservation不同?大数据科研带来的Digital Curation挑战、问题及应对措施结语,提纲,Digital Curation的兴起Digital Curation是什么?Digital Curation和Preservation不同?大数据科研带来的Digital Curation挑战、问题及应对措施结语,1、Digita
2、l Curation的兴起,Data Deluge,1、Digital Curation的兴起,From Data Deluge to Data Curation Philip Lord, Alison Macdonald, Liz Lyon, David Giaretta The Digital Archiving Consultancy Limited and the Digital Curation Centre,1、Digital Curation的兴起,The Digital Curation Centre成立在e-Science Core项目的支持下,DCC于2004年3月1日成立
3、总部位于Edinburgh的National e-Science CentreUniversity of Edinburgh (lead,Informatics, Law, Information Services and research institutes) University of Glasgow (HATII and Information Services) UKOLN, University of Bath Council for the Central Laboratory of the Research Councils (CCLRC),1、Digital Curation
4、的兴起,会议期刊International Digital Curation Conference,Bath,Sep. 29 - 30, 2005 8th International Digital Curation Conference, Amsterdam, 14 - 17 January 2013 DigCCurr 2007、DigCCurr 2009、DigCCurr 2013An International Symposium on Digital Curation(April 18-20, 2007)Digital Curation Practice, Promise and Pr
5、ospects(April 1-3, 2009)Chapel Hill, North Carolina, United States Public Symposium, 2010-2013International Journal of Digital Curation2006开始http:/ 8, No 1 (2013),1、Digital Curation的兴起,以Curation命名的机构The Greek Digital Curation Unit (DCU) at the Athena Research Centre(2007)UC3,University of California
6、 Curation Center (2010)The Digital Research and Curation Center at The Johns Hopkins Universitys Sheridan LibrariesThe University of Torontos iSchool established The Digital Curation Institute ( 2010 )Purdue University Librarys Distributed Data Curation Center (D2C2) (2009).,1、Digital Curation的兴起,与C
7、uration相关的教育培训DigCCurr I (2006-09),DigCCurr II (2008-13)School of Information and Library Science (SILS) University of North Carolina at Chapel Hill,NARA Preserving Access to Our Digital Future: Building an International Digital Curation Curriculum. Extending an International Digital Curation Curric
8、ulum to Doctoral Students and Practitioners International Data curation Education Action (IDEA) Working GroupDeveloping an International Curation and Preservation Training and Education RoadmapEducation for Digital Stewardship: Librarians, Archivists or Curators? Masters Programme in Digital Curatio
9、n, Lule University of TechnologyIFLA, 2011“ Education for Digital Curation” Board on Research Data and InformationSymposium on Digital Curation in the Era of Big Data:Career Opportunities and Educational Requirements,1、Digital Curation的兴起,相关技术工具Data Asset Framework (DAF)enumerating and auditing data
10、 holdingsDRAMBORAself-assessment of possible riskTRACTrustworthy Repositories Audit & Certification, Criteria and ChecklistDigital Preservation Suitepreservation plansDROIDidentifies file formats.,提纲,Digital Curation的兴起Digital Curation是什么?Digital Curation和Preservation不同?大数据科研带来的Digital Curation挑战、问题
11、及应用措施结论,2、Digital Curation是什么,先说一下数字保存(Digital Preservation)数字是一把的双刃剑优点方便易用、可复制、易传输、大量携带.问题脆弱性删除、盗取、修改、失真.依赖性技术、系统、标准、软件、上下文(元数据)、组织、经济.飞速退化性(obsolescence)媒体、硬件、软件、格式.,2、Digital Curation是什么,Digital Preservation1996年5月1日,成为重要关注内容Preserving Digital Information: Report of the Task Force on Archiving of
12、 Digital InformationCommission on Preservation and AccessResearch Libraries Group. Inc. (RLG)目标:“continued access indefinitely into the future of records stored in digital electronic form.”,http:/www.clir.org/pubs/reports/pub63/reports/pub63watersgarrett.pdf,2、Digital Curation是什么,21世纪初数字保存(DP)已经成为数字
13、图书馆的一个重要领域主要研究内容保存策略和方法、保存元数据、存储体系、保存仓储、保存工作流、Web存档、保存信息模型主要标准规范:开放档案信息系统(OAIS2002)、主要数字保存系统和服务体系e-Depot DIAS, NDIIPP, LOCKSS, Portico, CDL DPR,FCLA DAITSS.,2、Digital Curation是什么,为什么还会出现Digital Curation?已经数字保存已经有两个接受的术语了数字保存(Digital Preservation)数字存档(Digital Archiving)为什么还要提出Digital Curation?Digital
14、 Curation是什么?与Digital Preservation 有什么不同的思路和方法?,2、Digital Curation是什么,Digital Curation:被创造的新词Digital Data Curation Task ForceReport of the Task Force Strategy Discussion DayTuesday, 26th,November 2002,Centre Point, London WC1,January 2003 e-Science Curation ReportData curation for e-Science in the U
15、K: an audit to establish requirements for future curation and provision,2003 JCSR(the Joint Information Systems Committees Committee for the Support of Research,JISC研究支持委员会),2、Digital Curation是什么,Digital Data Curation Task Force由Tony Hey,当时JCSR的主席召集 目标:明确和构建英国原始研究数据的Curation战略会议日期 2002年11月26日The app
16、lication of the term “curation” is new, and in several ways the meeting found itself grappling with questions of scope, with frequent overlap with questions relating to digital preservation.It did not reach a definition of the term.,2、Digital Curation是什么,Digital Data Curation Task ForceWhat is curat
17、ion? Dr John Taylor, Director General of the Research CouncilsTony Hey, distinguish the actions involved in caring for digital data beyond its original use, from digital preservation.Seamus Ross, “curation in the museum sense” covers three core concepts: conservation, preservation and accessAlison A
18、llden “curation” implied in an active management of information, involving planning. re-use of data is a core issue. If data is to be re-used, then it needs special treatmentRolf Apweiler, curation is when people add value to dataJeremy Frey, curation is research work in itself - managing, improving
19、, enhancing data,2、Digital Curation是什么,e-Science Curation Report“curation” 来源于 “curator”somebody who keeps something for the public good, whose value often needs to be brought out by the curator. 两个重要特点more support for explicit policies with regard to data sharingdigital curator is store-keeper, but
20、 he should take an active role in promoting and adding value to his holdings,2、Digital Curation是什么,e-Science Curation Report此前“curation” is commonly used to refer to the work done on genomic and proteomic databases, annotating and managing annotations现在It covers a wider context than just archiving;
21、it embraces the care of the record within scientific context and environment,2、Digital Curation是什么,e-Science Curation ReportWorking definitionsCuration: The activity of, managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for
22、 discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of curation will also involve maintaining links with annotation and with other published materialsArchiving: A curation activity which ensures that data is properly s
23、elected, stored, can be accessed and that its logical and physical integrity is maintained over time, including security and authenticityPreservation: An activity within archiving in which specific items of data are maintained over time so that they can still be accessed and understood through chang
24、es in technology,2、Digital Curation是什么,e-Science Curation ReportThat the objective of digital curation of primary research data isto keep data which is valuable, potentially valuable or which is required to be kept; and in such a way that it is accessible and usable by others (while observing releva
25、nt restrictions), that its value is maintained and, where possible, enhanced; and that this activity and service should be provided at affordable and justifiable cost.,2、Digital Curation是什么,JISC通讯定义JISC circular 6/03 (Revised), July 2003The term digital curation is increasingly being used for the ac
26、tions needed to maintain and utilise digital data and research results over their entire life-cycle for current and future generations of users.,2、Digital Curation是什么,DDC定义1DCC Approach to Digital Curation, 15 Aug 2004curation : general term - taking care of things data curation : looking after and
27、adding value to data digital curation : looking after and somehow adding value to digital data. This probably implies creating some new data from the existing, in order to make the latter more useful and fit for purpose.,2、Digital Curation是什么,DDC定义2 DCC Charter and Statement of PrinciplesWhat is dig
28、ital curation?Digital curation is maintaining and adding value to a trusted body of digital research data for current and future use; it encompasses the active management of data throughout the research lifecycle.,http:/www.dcc.ac.uk/about-us/dcc-charter/dcc-charter-and-statement-principles,2、Digita
29、l Curation是什么,DDC定义3Digital curation involves maintaining, preserving and adding value to digital research data throughout its lifecycle.The active management of research data reduces threats to their long-term research value and mitigates the risk of digital obsolescence. Meanwhile, curated data in
30、 trusted digital repositories may be shared among the wider UK research community.As well as reducing duplication of effort in research data creation, curation enhances the long-term value of existing data by making it available for further high quality research,http:/www.dcc.ac.uk/digital-curation/
31、what-digital-curation,2、Digital Curation是什么,DDC定义4DCC Briefing PapersDigital curation is the management and preservation of digital data over the long-term.All activities involved in managing data from planning its creation, best practice in digitisation and documentation, and ensuring its availabil
32、ity and suitability for discovery and re-use in the future are part of digital curation.Digital curation can also include managing vast data sets for daily use, for example ensuring that they can be searched and continue to be readable.Digital curation is therefore applicable to a large range of pro
33、fessional situations from the beginning of the information life-cycle to the end; digitisers, metadata creators, funders, policy-makers, and repository managers to name a few examples,http:/www.dcc.ac.uk/resources/briefing-papers/introduction-curation,提纲,Digital Curation的兴起Digital Curation是什么?Digita
34、l Curation和Preservation不同?大数据科研带来的Digital Curation挑战、问题及应用措施结论,3、Curation和Preservation不同?,JISC Preservation和Curation对比JISC Digital Preservation briefing paperDigital preservationactions and interventions ensure continued and reliable access to authentic digital objects for as long as they are deemed
35、 to be of value. Digital curationmaintaining and adding value to a trusted body of digital information for future and current use; active management and appraisal of data over the entire life cycle. builds upon the underlying concepts of digital preservationemphasising opportunities for added value
36、and knowledge through annotation and continuing resource management.,http:/sitecore.jisc.ac.uk/publications/briefingpapers/2006/pub_digipreservationbp.aspx,3、Curation和Preservation不同?,ARL的两者对比New Roles for New Times: Digital Curation for Preservation, March 2011Digital curation refers to the actions
37、people take to maintain and add value to digital information over its lifecycle, including the processes used when creating digital content.Digital preservation focuses on the “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” intersect
38、ion of these actions, digital curation facilitate the preservation.,3、Curation和Preservation不同?,Digital Curation: The Emergence of a New Discipline中的对比digital preservation efforts originally focussed on ensuring that material survived technical obsolescence and organisational mismanagement. Preservat
39、ion implied a passive state, where material would be mothballed in an inaccessible “dark archive”, with only a few authorised users, to ensure that it retained its integrity and authenticityensuring that digital material is managed throughout its lifecycle so that it remains accessible to those who
40、need to use it. Metadata is used to both improve accessibility and discoverability; and to control authentication procedures, creating audit trails to ensure that material cannot be accessed or altered by those not authorised to do so. Digital material is actively preserved, used and reused for new
41、purposes, creating new materials. This is Digital Curation: the management and preservation of digital material to ensure accessibility over the long-term,3、Curation和Preservation不同?,应对的问题不同Preservation应对技术退化和组织失效CurationFrom Data Deluge to Data Curation, Data volumes, complexity of the data itself,3
42、、Curation和Preservation不同?,行动的目的不同Preservation以数据的生存为目的保证数据完整性、可信赖、真实性Curation以数据能够被科研利用为目的实现数据管理并使数据增值,3、Curation和Preservation不同?,达成的目标Preservation使数据可访问、可理解、可应用Curation对数据的整个生命周期进行管理,包括数据的创建和在旧数据之上新生成的新数据,实现数据利用和再生,3、Curation和Preservation不同?,为什么人服务?Preservation为了未来后世能够利用Curation为了当前和未来可用,3、Curation
43、和Preservation不同?,行为模型PreservationOAIS参考模型CurationDCC Curation Lifecycle Model,3、Curation和Preservation不同?,OAIS参考模型6项功能活动、3类信息包、3种角色,3、Curation和Preservation不同?,DCC Curation Lifecycle ModelFull Lifecycle ActionsDescription and Representation InformationPreservation PlanningCommunity Watch and Participa
44、tionCurate and PreserveSequential ActionsConceptualiseCreate or ReceiveAppraise and SelectIngestPreservation ActionStoreAccess, Use and ReuseTransformOccasional ActionsDisposeReappraiseMigrate,3、Curation和Preservation不同?,活动参与成员Preservation数据提供者、数据保存者、受权使用者Curation数据创造者、数据提供者、数据存档者、数据消费者,3、Curation和Pr
45、eservation不同?,保存的周期Preservation从数据提供开始,一直到所要求的未来时段,保证数据生存Curation从数据的产生开始,数据整个生命周期,中间有丢弃,1、从数字保存到数字保管,数据应用范围Preservation受权访问Curation数据共享、数据重用,3、Curation和Preservation不同?,思路方法Preservation迁移、仿真 Curationcreation and managementadd value to generate new sources of information and knowledg,3、Curation和Prese
46、rvation不同?,保存中的主观能动性PreservationPreservation implied a passive stateCurationDigital material is actively preservedactive management of data throughout the research lifecycle.active management and appraisal of data over the entire life cycle.,3、Curation和Preservation不同?,保存的地方Preservationinaccessible “
47、dark archive”CurationOpen Trusted Repositories,提纲,Digital Curation的兴起Digital Curation是什么?Digital Curation和Preservation不同?大数据科研带来的Digital Curation挑战、问题及应对措施结语,4、Digital Curation挑战,e-Science Curation Report,4、Digital Curation挑战,e-Science Curation Report,4、Digital Curation挑战,e-Science Curation Report,4
48、、Digital Curation挑战,Data Tsunami、Data deluge、超规模数据CERN(欧洲核能研究组织)ESA(欧洲航天局)未来数据规模将更大,数据增长将更快天文观测数据Sloan Digital Sky Survey,2008年的前10年,产生25 terabytes数据2014,Large Synoptic Survey Telescope每晚20 terabytes2019年,Square Kilometre Array radio telescope将产生50 TB已处理的数据,如果以裸数据为计,每秒7000TB,4、Digital Curation挑战,Big
49、 Databig data science“大数据科研”的时代已经来临不仅限于大装置或部分领域的科学大数据科研是一种新的科学发现范式Data-intensive Science,Data-intensive Discovery存在于所有科研领域观测、试验和计算机产生数据日益增长的价值不论是物理科学、人文科学,还是社科科学。,4、Digital Curation挑战,Data as the Infrastructure European Union“In a sense, the physical and technical infrastructure becomes invisible and the data themselves become the infrastructure a valuable asset, on which science, technology, the economy and society can advance”GRDI2020项目将构建Research Data Infrastructure 促进数据管理系统、数字图书馆、研究图书馆、数字仓储、工具及研究团队的集成,第8框架项目将data e-infrastructure作为优先领域(Horizon 2020 programme),