ChangLog: 07/09/2011 Crosslink evaluation results released 08/08/2011 Crosslink evaluation tool with ground-truth released 05/08/2011 Participant paper submission page ready 18/07/2011 Crosslink submission open 15/07/2011 Crosslink evaluation tool for training released 31/05/2011 Crosslink validation tool v2.2 released 18/05/2011 Crosslink test topics released 02/02/2011 Crosslink training topics released 31/01/2011 Crosslink validation tool released 11/01/2011 Wikipedia CJK XML collections released
1. Introduction
Cross-lingual link discovery (CLLD) is a way of automatically finding potential links between documents in different languages. It is not directly related to traditional cross-lingual information retrieval (CLIR) because CLIR can be viewed as a process of creating a virtual link between the provided cross-lingual query and the retrieved documents; but CLLD actively recommends a set of meaningful anchors in the source document and uses them as queries with the contextual information from the text to establish links with documents in other languages.
Wikipedia is an online multilingual encyclopaedia that contains a very large numbers of articles covering most written languages and so it includes extensive hypertext links between documents of same language for easy reading and referencing. However, the pages in different languages are rarely linked except for the cross-lingual link between pages about the same subject. This could pose serious difficulties to users who try to seek information or knowledge from different lingual sources. Therefore, cross-lingual link discovery tries to break the language barrier in knowledge sharing. With CLLD users are able to discover documents in languages which they either are familiar with, or which have a richer set of documents than in their language of choice.
For English there are several link discovery tools, which assist topic curators in discovering prospective anchors and targets for a given document. No such tools yet exist, that support the cross linking of documents from multiple languages. This task aims to incubate the technologies assisting CLLD and enhance the user experience in viewing or editing documents in cross-lingual manner. The language difference, ambiguities and other language issues such as Chinese segmentation could all make this task even more challenging. Researchers who have interest in cross-lingual link discovery are all welcome to join us. Particularly, researchers from either the CLIR or the link discovery community are encouraged to participate in this exciting task.
Generally, the link between documents can be classified as either outgoing or incoming, but in this task we mainly focus on the outgoing link starting from English source documents and pointing to Chinese, Korean, or Japanese target documents. The CLLD task is comprised of following three subtasks: • English to Chinese CLLD • English to Japanese CLLD • English to Korean CLLD Participants can choose one or more of the above three subtasks to participate in. The English topics and the target corpus consist of actual Wikipedia pages in xml format with rich structured information. To submit a run for a given task, participants are required to choose the most suitable anchors in the English topic document, and for each anchor identify the most suitable documents in the target language corpus. For each topic we will allow up to 250 anchors, each with up to 5 targets, so there is a total of up to 1250 outgoing links per topic.
Extra task (Optional):
When a target document for an anchor is decided, by default the place for starting reading is the beginning of the article. However, in focused information retrieval, a reading entry point rather than just the beginning of the target text can be provided to user for getting the information wanted more efficiently. So in the extra task, a best entry point (BEP) should be specified in the target document for starting to read the referenced material. This will be particularly useful for picking up a snippet of meaningful information in a lengthy document.
To differentiate the main task and the extra task, the main task is called as anchor-to-file task; the extra task is called anchor-to-bep task.
3. Topic and Document Collection
3.1 Training and Test Topics Training topics
#
Title
ID
1
Australia
4689264
2
Femme fatale
299098
3
Martial arts
19501
Only three topics are used for system training. Participants may use these topics to create dry runs. To download a copy of the training topics, please click here. Test topics
A set of 25 articles will be randomly chosen from the English Wikipedia and used as test topics. Participants should use these topics to create runs for formal final submission. These topics should be orphaned by removing all hyperlinks to and from these documents by participants. The corresponding topic pages in Chinese, Japanese and Korean should be also removed from document collections.
3.2 Document Collection The training and test collections for the three subtasks are exactly the same. The collections are formed by search engine friendly xml files created from Wikipedia mysql database dumps taken on June 2010. The details of the collections are given as follows (the language of the corpus, the number of articles, the size of the corpus, and date of dump): Chinese 318,736 2.7G 27/06/2010 Japanese 716,088 6.1G 24/06/2010 Korean 201,596 1.2G 28/06/2010
4. Submission
Please refer to the Submission page for detailed information.
5. Evaluation
Please refer to the Evaluation page for assessment methods and evaluation metrics.
30/09/2010 Call for task participation 05/10/2010 Kick‐off event in Tokyo 20/12/2010 Task registration due 20/01/2011 Task registration due 05/01/2011 Document set release 10/01/2011 Document set release 31/01/2011 Crosslink validation tool release 02/02/2011 Crosslink training topics release 01~05/2011 Dry run 03~07/2011 Formal run 18/05/2011 Crosslink test topics release 01/07/2011 Crosslink test topics release 15/07/2011 Crosslink evaluation tool for training release 31/07/2011 Submission deadline for Crosslink results 01/08/2011 Crosslink assessment tool release 08/08/2011 Crosslink evaluation tool with ground-truth release 21/08/2011 Crosslink evaluation tool release 22/08/2011 Evaluation results due 22/08/2011 Task overview partial release 07/09/2011 Evaluation results due 07/09/2011 Task overview partial release 20/09/2011 Participant paper submission due 20/10/2011 Task overview paper submission due 04/11/2011 All camera-‐ready copy for the Proceedings due 06~09/12/2011 NTCIR‐9 Meeting NII Tokyo Japan