NTCIR

NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates
 

CrossLingual Link Discovery Task

Last Update: 08 September 2011

ChangLog:
07/09/2011    Crosslink evaluation results released
08/08/2011    Crosslink evaluation tool with ground-truth released
05/08/2011    Participant paper submission page ready
18/07/2011    Crosslink submission open
15/07/2011    Crosslink evaluation tool for training released
31/05/2011    Crosslink validation tool v2.2 released
18/05/2011    Crosslink test topics released
02/02/2011    Crosslink training topics released
31/01/2011    Crosslink validation tool released
11/01/2011    Wikipedia CJK XML collections released
 

1.    Introduction

Cross-lingual link discovery (CLLD) is a way of automatically finding potential links between documents in different languages. It is not directly related to traditional cross-lingual information retrieval (CLIR) because CLIR can be viewed as a process of creating a virtual link between the provided cross-lingual query and the retrieved documents; but CLLD actively recommends a set of meaningful anchors in the source document and uses them as queries with the contextual information from the text to establish links with documents in other languages.

Wikipedia is an online multilingual encyclopaedia that contains a very large numbers of articles covering most written languages and so it includes extensive hypertext links between documents of same language for easy reading and referencing. However, the pages in different languages are rarely linked except for the cross-lingual link between pages about the same subject. This could pose serious difficulties to users who try to seek information or knowledge from different lingual sources. Therefore, cross-lingual link discovery tries to break the language barrier in knowledge sharing. With CLLD users are able to discover documents in languages which they either are familiar with, or which have a richer set of documents than in their language of choice.

For English there are several link discovery tools, which assist topic curators in discovering prospective anchors and targets for a given document.  No such tools yet exist, that support the cross linking of documents from multiple languages. This task aims to incubate the technologies assisting CLLD and enhance the user experience in viewing or editing documents in cross-lingual manner. The language difference, ambiguities and other language issues such as Chinese segmentation could all make this task even more challenging. Researchers who have interest in cross-lingual link discovery are all welcome to join us. Particularly, researchers from either the CLIR or the link discovery community are encouraged to participate in this exciting task.


To participate, please visit one of the following registration pages:
(English version)
http://research.nii.ac.jp/ntcir/ntcir-9/howto.html
(Japanese version)
http://research.nii.ac.jp/ntcir/ntcir-9/howto-ja.htm


2.    Task Definition

Generally, the link between documents can be classified as either outgoing or incoming, but in this task we mainly focus on the outgoing link starting from English source documents and pointing to Chinese, Korean, or Japanese target documents. The CLLD task is comprised of following three subtasks:
•    English to Chinese CLLD
•    English to Japanese CLLD
•    English to Korean CLLD
Participants can choose one or more of the above three subtasks to participate in.
 
The English topics and the target corpus consist of actual Wikipedia pages in xml format with rich structured information.  To submit a run for a given task, participants are required to choose the most suitable anchors in the English topic document, and for each anchor identify the most suitable documents in the target language corpus. For each topic we will allow up to 250 anchors, each with up to 5 targets, so there is a total of up to 1250 outgoing links per topic.


Extra task (Optional):

When a target document for an anchor is decided, by default the place for starting reading is the beginning of the article. However, in focused information retrieval, a reading entry point rather than just the beginning of the target text can be provided to user for getting the information wanted more efficiently. So in the extra task, a best entry point (BEP) should be specified in the target document for starting to read the referenced material. This will be particularly useful for picking up a snippet of meaningful information in a lengthy document.
 
To differentiate the main task and the extra task, the main task is called as anchor-to-file task; the extra task is called anchor-to-bep task.


3.    Topic and Document Collection

3.1 Training and Test Topics
Training topics
 #  Title  ID
 1  Australia
 4689264
 2  Femme fatale
 299098
 3
Martial arts 19501
Only three topics are used for system training. Participants may use these topics to create dry runs. To download a copy of the training topics, please click here.

Test topics
A set of 25 articles will be
randomly chosen from the English Wikipedia and used as test topics.  Participants should use these topics to create runs for formal final submission. These topics should be orphaned by removing all hyperlinks to and from these documents by participants.  The corresponding topic pages in Chinese, Japanese and Korean should be also removed from document collections.

3.2 Document Collection
The training and test collections for the three subtasks are exactly the same. The collections are formed by search engine friendly xml files created from Wikipedia mysql database dumps taken on June 2010.  The details of the collections are given as follows (the language of the corpus, the number of articles, the size of the corpus, and date of dump):
Chinese       318,736        2.7G     27/06/2010
Japanese     716,088        6.1G    24/06/2010
Korean        201,596        1.2G     28/06/2010

4.    Submission
Please refer to the Submission page for detailed information.
 
5.    Evaluation
Please refer to the Evaluation page for assessment methods and evaluation metrics.
 
6.     Task Organizers

Shlomo Geva
 Queensland University of Technology, Australia

Andrew Trotman  University of Otago, New Zealand

Yue Xu
 Queensland University of Technology, Australia

Kelly Itakura  Queensland University of Technology, Australia

Eric Tang
(l4.tang AT qut.edu.au

 Queensland University of Technology, Australia

 Darren Huang  Queensland University of Technology, Australia

   
Task mailing list:
crosslink


7. Paper  

Participant paper submission page: https://www.easychair.org/conferences/?conf=ntcir9crosslink

8. Schedule

30/09/2010    Call for task participation
05/10/2010    Kick‐off event in Tokyo
20/12/2010    Task registration due
20/01/2011    Task registration due
05/01/2011    Document set release
10/01/2011    Document set release
31/01/2011    Crosslink validation tool release
02/02/2011    Crosslink training topics release
01~05/2011   Dry run
03~07/2011   Formal run
18/05/2011    Crosslink test topics release
01/07/2011    Crosslink test topics release
15/07/2011    Crosslink evaluation tool for training release
31/07/2011    Submission deadline for Crosslink results
01/08/2011    Crosslink assessment tool release
08/08/2011    Crosslink evaluation tool with ground-truth release
21/08/2011    Crosslink evaluation tool release
22/08/2011    Evaluation results due
22/08/2011    Task overview partial release
07/09/2011    Evaluation results due
07/09/2011    Task overview partial release
20/09/2011    Participant paper submission due
20/10/2011    Task overview paper submission due
04/11/2011    All camera-‐ready copy for the Proceedings due
06~09/12/2011    NTCIR‐9 Meeting  NII  Tokyo  Japan

8. Resources

1)  Crosslink Kick-off slides:  NTCIR-9-CLLD-KICK-OFF-3.pptx

2)  Assessment and evaluation tool-kits project site: http://code.google.com/p/crosslink/ 



 
Visitors Statistics16393