Cross-lingual link discovery (CLLD) is concerned with automatically finding potential links between documents in different languages. In contrast to traditional information retrieval tasks where queries are not attached to explicit context, or only loosely attached to context, cross language link discovery algorithms actively recommend a set of meaningful anchors in the context of a source document and establish links to documents in an alternative language. CLLD is helpful for complimentary knowledge discovery in different language domains.
Currently in a knowledge base such as Wikipedia, the articles indifferent languages are rarely cross-linked except for direct equivalent pages(on the same subject) in different languages. This could pose serious difficulties to users seeking information or knowledge from different lingual sources, or where there is no equivalent page in one language or another.
Figure 1: Lost in translation
Figure 1 shows several different language versions of the page on “Custard”. Note that: 1) anchors are largely linked to articles in the source languages; 2) not all cross-language equivalent link sexist – the English article “Custard” is not linked to the Italian custard article “Crema pasticcera”, andvice versa; 3) some cross-language equivalent links are incorrect – the Chinese custard article “奶黄” is incorrectly linkedto the Italian pudding article “Budino”, and vice versa.
Therefore, the job of CLLD is to help identify a set of meaningful anchors in the context of a source document and establish links to documents in an alternative language in user's preference.
Crosslink as a pilot task of NTCIR-9 has been successfully held in 2011. At the end of experimentation season, in total 57 runs from 11 teams were received. To participate, please visit one of the following registration pages:(English version)http://research.nii.ac.jp/ntcir/ntcir-10/howto.html(Japanese version)http://research.nii.ac.jp/ntcir/ntcir-10/howto-ja.html
2. Task Definition
For the Crosslink task at NTCIR-10, we are planning to have three similar but different subtasks with opposite link direction. The new subtasks are:
• Chinese to English CLLD (C2E)
• Japanese to English CLLD (J2E)
• Korean to English CLLD (K2E)
The new subtasks are not simple replicas of previous Crosslink tasks. These subtasks will allow the CLLD approaches proposed at NTCIR-9 for suggesting good links in English documents to Chinese / Japanese / Korean ones can be re-examined in a different linking environment.
Plus, this time participants will have to deal with an extra problem when trying to cross link documents as there are no word boundaries in Chinese / Japanese text, and in Korean eojeol. Natural language processing needs for CJK language to English document linking could make this task even more challenging.As huge efforts were committed by the participating teams of NTCIR-9 Crosslink task in creating various CLLD systems from scratch and to further evaluate those systems with different topics, runs are also allowed for the subtasks in previous evaluation round:• English to Chinese CLLD
• English to Japanese CLLD
• English to Korean CLLDHaving the same subtasks in the new evaluation round will allow seeing continuous improvement of the existing CLLD systems.
3. Topic and Document Collection3.1 Training and Test Topics
Training topicsFor system training, please download the test topics and the document collections for the NTCIR-9 Crosslink task when they are available. Participants may use these topics to create dry runs.
Four sets of topics, 25 articles each in four languages (ECJK) have been chosen from the new Wikipedia document collections and used as test topics. Participants should use these topics to create runs for formal final submissions. These topics will be provoided as orphaned Wikipedia articles by removing all hyperlinks to and from these documents. The corresponding topic pages in English, Chinese, Japanese and Korean will not be contained in the document collections.
3.2 Document CollectionFor the Crosslink-2 task, an English Wikipedia collection along with the new CJK collections were created for the evaluation of the new subtasks.
The collections are formed by search engine friendly xml files created from recent dumps of Wikipedia mysql database. The details of the collections are given as follows (the language of the corpus, the number of articles, the size of the corpus, and date of dump):
English 3,581,772 33G 04/01/2012
Chinese 404,620 3.6G 11/01/2012
Japanese 858,610 9.8G 04/01/2012
Korean 297,913 2.2G 22/01/2012
Please refer to the Submission page for detailed information.
Please see the evaluation page for the Crosslink-1. Click here to download a copy of the latest evaluation tool.
6. Task Organizers
Shlomo Geva Queensland University of Technology, Australia
In-Su Kang Kyungsung University, South Korea
Fuminori Kimura Ritsumeikan University, Japan
Yi-Hsun, Lee Academia Sinica, Taiwan
Eric Tang Queensland University of Technology, Australia
Andrew Trotman Universityof Otago, New Zealand
Yue Xu Queensland University of Technology, Australia
Task mailing list:
7. Paper Participant paper submission page: https://www.easychair.org/conferences/?conf=crosslink2
8. Schedule~~/02/2012 Call for Task Participation
31/08/2012 Task registration due
07~11/2012 Dry run
09~12/2012 Formal run
01/07/2012 Crosslink test topics release
15/07/2012 Crosslink validation tool for test topics release
27/08/2012 Crosslink validation tool for test topics release
31/11/2012 Run submissions due
21/12/2012 Run submissions due
01/12/2012 Crosslink assessment tool release
01/12/2012 Crosslink evaluation tool with ground-truth release
31/01/2013 Evaluation results due
01/02/2013 Task overview partial release
31/03/2013 Participant paper submission due
20/04/2013 Task overview paper submission due
01/05/2013 All camera-‐ready copy for the Proceedings due
5) Open source link discovery tools and library: