NTCIR-10 Crosslink-2: Task Description

1.    Introduction

Cross-lingual link discovery (CLLD) is concerned with automatically finding potential links between documents in different languages. In contrast to traditional information retrieval tasks where queries are not attached to explicit context, or only loosely attached to context, cross language link discovery algorithms actively recommend a set of meaningful anchors in the context of a source document and establish links to documents in an alternative language. CLLD is helpful for complimentary knowledge discovery in different language domains.

Currently in a knowledge base such as Wikipedia, the articles indifferent languages are rarely cross-linked except for direct equivalent pages(on the same subject) in different languages. This could pose serious difficulties to users seeking information or knowledge from different lingual sources, or where there is no equivalent page in one language or another.

Figure 1: Lost in translation

Figure 1 shows several different language versions of the page on “Custard”. Note that: 1) anchors are largely linked to articles in the source languages; 2) not all cross-language equivalent link sexist – the English article “Custard” is not linked to the Italian custard article “Crema pasticcera”, andvice versa; 3) some cross-language equivalent links are incorrect – the Chinese custard article “奶黄” is incorrectly linkedto the Italian pudding article “Budino”, and vice versa.

Therefore, the job of CLLD is to help identify a set of meaningful anchors in the context of a source document and establish links to documents in an alternative language in user's preference.

Crosslink as a pilot task of NTCIR-9 has been successfully held in 2011. At the end of experimentation season, in total 57 runs from 11 teams were received.

2.    Task Definition

For the Crosslink task at NTCIR-10, we are planning to have three similar but different subtasks with opposite link direction. The new subtasks are:

    Chinese to English CLLD (C2E)

    Japanese to English CLLD (J2E)

    Korean to English CLLD (K2E)

The new subtasks are not simple replicas of previous Crosslink tasks. These subtasks will allow the CLLD approaches proposed at NTCIR-9 for suggesting good links in English documents to Chinese / Japanese / Korean ones can be re-examined in a different linking environment.

Plus, this time participants will have to deal with an extra problem when trying to cross link documents as there are no word boundaries in Chinese / Japanese text,  and in Korean eojeol. Natural language processing needs for CJK language to English document linking could make this task even more challenging.

As huge efforts were committed by the participating teams of NTCIR-9 Crosslink task in creating various CLLD systems from scratch and to further evaluate those systems with different topics, runs are also allowed for the subtasks in previous evaluation round:
•    English to Chinese CLLD
•    English to Japanese CLLD
•    English to Korean CLLD

Having the same subtasks in the new evaluation round will allow seeing continuous improvement of the existing CLLD systems.

3.    Topic and Document Collection

3.1 Training and Test Topics
Training topics
For system training, please download the test topics and the document collections for the NTCIR-9 Crosslink task when they are available. Participants may use these topics to create dry runs.

Test topics
Four sets of  topics, 25 articles each in four languages (
ECJK) have been chosen from the new Wikipedia document collections and used as test topics.  Participants should use these topics to create runs for formal final submissions. These topics will be provoided as orphaned Wikipedia articles by removing all hyperlinks to and from these documents.  The corresponding topic pages in English, Chinese, Japanese and Korean will not be contained in the document collections.

3.2 Document Collection
For the Crosslink-2 task, an English Wikipedia collection along with the new CJK collections were created for the evaluation of the new subtasks. 

The collections are formed by search engine friendly xml files created from recent dumps of Wikipedia mysql database.  The details of the collections are given as follows (the language of the corpus, the number of articles, the size of the corpus, and date of dump):

English         3,581,772     33G     04/01/2012   
Chinese       404,620        3.6G    
Japanese     858,610        9.8G     
Korean         297,913         2.2G     22

4.    Submission
5.    Evaluation
6.   Task Organizers

Shlomo Geva                Queensland University of Technology, Australia
In-Su Kang                     Kyungsung University, South Korea
Fuminori Kimura           Ritsumeikan University, Japan
Yi-Hsun, Lee                  Academia Sinica, Taiwan
Eric Tang                       Queensland University of Technology, Australia
Andrew Trotman           Universityof Otago, New Zealand
Yue Xu                           Queensland University of Technology, Australia

7. Paper   

8. Schedule

~~/02/2012          Call for Task Participation
31/08/2012          Task registration due
07~11/2012         Dry run
09~12/2012         Formal run
01/07/2012          Crosslink test topics release
15/07/2012          Crosslink validation tool for test topics release
27/08/2012          Crosslink validation tool for test topics release  
31/11/2012          Run submissions due
21/12/2012          Run submissions due
01/12/2012          Crosslink assessment tool release
01/12/2012          Crosslink evaluation tool with ground-truth release
31/01/2013          Evaluation results due
01/02/2013          Task overview partial release
31/03/2013          Participant paper submission due
20/04/2013          Task overview paper submission due
01/05/2013          All camera-‐ready copy for the Proceedings due

18-21/06/2013    EVIA 2013/NTCIR-10 Meeting

9. Resources

1)  The overview paper for the NTCIR-9 Crosslink task: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/01-NTCIR9-OV-CROSSLINK-TangL.pdf 
2)  NTCIR-9 Crosslink Website: http://ntcir.nii.ac.jp/CrossLink
4)  Assessment and evaluation tool-kits project site: http://code.google.com/p/crosslink/ 
