NTCIR

NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates
 

CrossLingual Link Discovery: Submission

Last Update: 29 July 2011

ChangLog:
31/05/2011    A new Crosslink validation tool(version 2.2) released
20/05/2011    Scripts for topic removal in corpora released
23/03/2011    A new set of sample runs released
07/03/2011    An English-to-Chinese sample run released
07/03/2011    A new Crosslink validation tool(version 2.1) released
 
1.    Tasks
Main task (anchor-to-file):
•    For each topic we will allow up to 250 anchors, each anchor with up to 5 targets. So there are up to 1250 outgoing links per topic.
•    up to 5 runs

Extra task (anchor-to-bep) (Optional):

Ifyou decide to participate in the extra task, you may submit anchor-to-bep runs only without needing to provide runs for the main task because the runs of this extra task can be also treated by the evaluation tool as the main task runs with the offsets of BEPs set to zero (beginning of the text) automatically.

•    For each topic we will allow up to 250 anchors, each anchor with up to 5 beps, each in different target documents. So there are up to 1250 outgoing links per topic.
•    up to 5 runs
 
2.    Rules
  • Since the topics are not orphaned documents, the link mark-ups remaining in topics are for your own reference. Participants must not use those existing links in the topic documents to locate anchors and suggest target documents.

  • Topic files including their CJK counterparts should be excluded from Wikipedia document collections either physically or virtually.
Please refer to Section 4.2 for an assistant tool that helps removing the CJK topic counterparts from the corpora.
  • Finding cross-lingual number, month-day, or year links (i.e.1, 2, 2012, 1900s,February 24) is not a big issue in link discovery, and Wikipedia also has special rules on the chronological items1. In order to have more high quality links, the recommendation of these kinds of links is so not required in the task. If those links are included in the submissions, they will be rejected and not considered in either assessment or evaluation.
 
3.    Result Submission
We will use XML for the output results. The specification of result submission is similar with that of the INEX Link-the-Wiki task2.The formats for both anchor-to-file and anchor-to-bep run submission are exactly the same. The only difference between anchor-to-file and anchor-to-bep submission is in whether the BEP is specified. In the main anchor-to-file task, the offset of the BEP should be set to zero by default; in the anchor-to-bep task, a BEP should be given and its offset in the target document should be specified.

3.1    Submission XML File DTD
<!ELEMENT crosslink-submission (details, description, collections, topic+)>
<!ATTLIST crosslink-submission
   participant-id CDATA #REQUIRED
   run-id CDATA #REQUIRED
   task (A2F| A2B) #REQUIRED
   default_lang (zh|ja|ko) ) #REQUIRED
>
<!ELEMENT details (machine, time)>
<!ELEMENT machine (cpu, speed, cores, hyperthreads, memory)>
<!ELEMENT cpu (#PCDATA)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)>
<!ELEMENT hyperthreads (#PCDATA)>
<!ELEMENT memory (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT collections (collection+)>
<!ELEMENT collection (#PCDATA)>
<!ELEMENT topic (outgoing)>
<!ATTLIST topic
   file CDATA #REQUIRED
   name CDATA #REQUIRED
>

<!ELEMENT outgoing (anchor+)>

<!ELEMENT anchor (tofile+)>
<!ATTLIST anchor
   name CDATA #REQUIRED
   offset CDATA #REQUIRED
   length CDATA #REQUIRED
>
<!ELEMENT tofile (#PCDATA)>
<!ATTLIST tofile
   bep_offset CDATA #REQUIRED
   lang (zh|ja|ko)#REQUIRED
   title CDATA #REQUIRED
>

The root element crosslink-submission should contain information about participant's ID, run ID (which should include your university affiliation), the task which should be either A2F or A2B and the default target language (zh, ja, or ko).  The linking algorithm should be described in description node.  The collections element contains a list of document collections used in the run. Generally, the collection element should contain text from one of the following: Chinese Wikipedia, Japanese Wikipedia or Korean Wikipedia. Each topic should be contained in a topic element which should contain an anchor element for each anchor-text that should be linked.  Each anchor element should include offset, length and name attributes for detailed information of the recommended anchor, and should also have one or more tofile sub-elements with the target document ID contained within them. The tofile element should contain following information: language id, title and bep (specified in lang, title
, and bep_offset attributes separately) of the linked document.

CAUTION:
  • The position (zero-based offset) of an anchor or BEP is calculated by counting the number of bytes (NOT the number of characters) from the beginning of the entire file (the original topic file). Similarly, the length of an anchor is the number of bytes occupied by that anchor text.

  • If an anchor has a name that doesn't match the text in the topic with the given offset and length, it will be discarded in the pooling,

  • It is OK to include a tag in the anchor. For example, an anchor "A Sample <it>Anchor" is found, the name attribute of that anchor must look like the following:
          <anchor offset="768" length="19" name="A Sample Anchor"> 
But if an anchor having an incomplete XML tag will be discarded in the pooling too.  For example,
          an incorrect anchor: "A Sample <i"
<anchor offset="768" length="11" name="A Sample">

The anchor length should be calculated by counting every character in bytes including all XML tags in the extracted anchor from the topic file, but the name attribute should be only the text after removing all XML tags (including the incomplete ones) due to the XML Well-formedness  issue.

3.2    Run Timings
Along with the submission of links participants are asked to submit details of the time taken to generate the results (excluding the output to the submission file, but including the reading and parsing of the orphan) and details of the hardware used to generate the result. Time should be CPU time in seconds (to 2 decimal places). The details of the computer are difficult to specify as such details as cache size differs from machine to machine, so the details of the CPU, core numbers, hyper-threading, and main memory are requested.
An example of run details is showed below.
<details>
   <machine>
      <cpu>Intel Celeron</cpu>
      <speed>1.06GHz</speed>
      <cores>1</cores>
      <hyperthreads>1</hyperthreads>
      <memory>128MB</memory>
   </machine>
   <time>3.04 seconds</time>
</details>


3.3    Submission Samples
ANCHOR-TO-FILE SAMPLE:
English-to-Chinese
<outgoing>
   <anchor offset="768" length="8" name="Balloons">
      <tofile bep_offset="0" lang=”zh”  title=”气球” >4424</tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”气球炸弹” >442489</tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”气球男孩事件” >64424</tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”红气球之旅” >14424 </tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”猫与气球” >43224</tofile>
   </anchor>
   ...
</outgoing>


ANCHOR-TO-BEP SAMPLE:
English-to-Chinese
<outgoing>
   <anchor offset="768" length="8" name="Balloons">
      <tofile bep_offset="637" lang=”zh”  title=”气球” >4424</tofile>
      <tofile bep_offset=“238343” lang=”zh”  title=”气球炸弹” >442489</tofile>
      <tofile bep_offset=“23438” lang=”zh”  title=”气球男孩事件” >64424</tofile>
      <tofile bep_offset=“8997” lang=”zh”  title=”红气球之旅” >14424 </tofile>
      <tofile bep_offset=“334” lang=”zh”  title=”猫与气球” >43224</tofile>
   </anchor>
   ...
</outgoing>


An example of an English-to-Chinese submission is given below:

<crosslink-submission participant-id="QUT"
   run-id="QUT_E2C_A2B_01"
   task="A2F"
   default_lang=”zh”>
   <details>
      <machine>
         <cpu>Intel Celeron</cpu>
         <speed>1.06GHz</speed>
         <cores>1</cores>
         <hyperthreads>1</hyperthreads>
         <memory>128MB</memory>
      </machine>
      <time>3.04 seconds</time>
   </details>
   <description>Describe the approach here, NOT in the run-id.</description>
   <collections>
      <collection>Chinese Wikipedia</collection>
   </collections>
   <topic file="9638" name=" 99 Luftballons">
<outgoing>
   <anchor offset="768" length="8" name="Balloons">
      <tofile bep_offset="637" lang=”zh”  title=”气球” >4424</tofile>
      <tofile bep_offset=“238343” lang=”zh”  title=”气球炸弹” >442489</tofile>
      <tofile bep_offset=“23438” lang=”zh”  title=”气球男孩事件” >64424</tofile>
      <tofile bep_offset=“8997” lang=”zh”  title=”红气球之旅” >14424 </tofile>
      <tofile bep_offset=“334” lang=”zh”  title=”猫与气球” >43224</tofile>
   </anchor>
   ...
</outgoing>
   </topic>
</crosslink-submission>

Three complete sample runs (English-to-Chinese, English-to-Japanese and English-to-Korean)  using a very simple link discovery algorithm on the three training topics
were created for your reference. They can be downloaded from here. Also, it is contained in the submission validation software package.For more information about the validation tool please read on.

4.    Submission Validation Tool

1) To ensure participants have anchor offset and length, and BEP offset correctly specified, an assistant tool is provided for submission validation. Click here to download the latest release. Feedback, comment or bug report are welcome (
crosslink).  Please refer to Instruction.pdf contained in the package for details of usage.

The validation tool package includes three programs: CrosslinkValidation,  RunChecker, xml2txt.

  • CrosslinkValidation
For visual validation of submission run, the suggested anchors in the topics and their target documents will be located and  displayed in the screen.
  • RunChecker
This script will help you to check if you have all anchors correctly specified with the given offset and length. 
  • xml2txt
This tool helps return the text in the input file for the given offset and length, or replace all XML tags with spaces.


2) As stated in the rules, the English topics including their CJK counterparts should not be linked. Here is the scripts for removing the topic files
from the document collections for this purpose. Instructions on using the scripts are given in the readme.txt.   


5.    References
[1]  Wikipedia:Manual of Style (linking). http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style_(linking)#Specific_cases
[2]  INEX 2010 Link-The-Wiki Track. http://www.inex.otago.ac.nz/tracks/wiki-link/wiki-link.asp