NTCIR

NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates
 

Document Collection (NTCIR-11 Temporalia)


NTCIR-11 Temporalia uses a web corpus, called"LivingKnowledge news and blogs annotated subcollection",constructed by the LivingKnowledge project and distributed by Internet Memory.

The collection is 20G uncompressed and over 5G zipped. It spans from May 2011 to March 2013 and contains around 3.8M documents collected from about 1500 different blogs and news sources.The data is split into 970 files, named after the date of that day and some information about its sources (there might be more than one file per day).

Each file contains a number of text documents. For each document the following information is available


<doc id=lk-20130223040102_592>

<meta-info>

<tag name="host">www.spiegel.de</tag>

<tag name="date">2013-02-22</tag>

<tag name="url">http://www.spiegel.de/international/business/eu-widens-libor-scandal-investigation-and-threatens-heavy-fines-a-884948.html#ref=rss</tag>

<tag name="sourcerss">http://www.spiegel.de/international/index.rss</tag>

<tag name="title">EU Widens LIBOR Scandal Investigation and Threatens Heavy Fines</tag>

</meta-info>

<text>

...

<text>



The “doc id” refers to a unique document identifier in the collection. The “host” contains the hostname the text was pulled from, the ”date” the publishing data of the document, the “url” the url the text was pulled from, the “sourcerss” the rss that was accessed to retrieve the page, and finally, the “title” the title of the page.

Between the<text>tags, there’s the content of the page. This collection also provides three kinds of annotations: Sentence splitting, Named Entities, and Time annotations.
Each sentence in the content of the page is surrounded by<SE>tags.

Each identified named entity is surrounded by the<E>tags. The type of the entity is included inside the tag, for instance<E type="E:ORGANIZATION:CORPORATION">YouWalkAway.com</E>.

Each time reference identified in the text is surrounded by the tag<T> <T val="2012">the end of 2012</T>which contains a“val”element referring to the estimated point in time the annotation is referring to.

An example of documents can be found below.


We will provide a script to remove these annotations and leave only the textual part of each page, if the participants wish to only use that.

How to get a copy of the document collection

Please contact us via Google Groups or email (tc4fia at googlegroups dot com)first with your name, affiliation, and registered group id. We will then let you know the contact address ofLeïla Medjkoune, Head of Web Archiving, Internet Memory, who manages the distribution of the document collection.

How to remove tags or correct markup inconsistency


Please read How to and FAQ.