NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates

Crosslink 2: Submission Specification

Last Update: 12 December 2011

12/12/2012    Number of allowed runs for the English-to-CJK task changed.
07/11/2012    Submission site now open

1.    Tasks

Chinese-to-English, Japanese-to-English, and Korean-to-English subtasks:
•    Up to 5 runs
•    For each topic we will allow up to 250 anchors, each anchor with up to 5 targets. So there are up to 1250 outgoing links per topic.

English-to-Chinese, English-to-Japanese, and English-to-Korean subtasks:
•    Up to 2 runs. 
•    For each topic we will allow up to 250 anchors, each anchor with up to 5 targets. So there are up to 1250 outgoing links per topic.

2.    Run Naming 

For easy identification and performance comparison, each run needs to be assigned with a unique run id which should be formed in the following format:

RunID = (Group ID)-(Source Language Code)2(Target Language Code)-A2F-(two-digit run number)-(Algorithm Acronym)

The CEJK language codes are as follows:

  • C (Chinese)
  • E (English)
  • J (Japanese)
  • K (Korean)
For example, team SAM uses an algorithm (GDA) to make their first submission for Korean-to-English subtask, and the run id should be SAM-K2E-A2F-01-GDA; and the file name of the run is  SAM-K2E-A2F-01-GDA.xml.

3.    Rules 

  • Please link the main text of the topics only (i.e. text starts after <bdy> tag and ends before any reference or external links section, e.g. <st>References</st>). Any links outside this scope would not be assessed and evaluated.
  • Chronological items such as number, month-day, or year links (i.e.1, 2, 2012, 1900s,February 24) should not be included in your submission(s). If such links exist, they will be rejected and not considered in either assessment or evaluation.
4.    Result Submission

4.1    Submission XML File DTD
<!ELEMENT crosslink-submission (details, description, collections, topic+)>
<!ATTLIST crosslink-submission
   participant-id CDATA #REQUIRED
   task (A2F) #REQUIRED
   source_lang (zh|en|ja|ko) ) #REQUIRED
   default_lang (zh|en|ja|ko) ) #REQUIRED

<!ELEMENT details (machine, time)>
<!ELEMENT machine (cpu, speed, cores, hyperthreads, memory)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)>
<!ELEMENT hyperthreads (#PCDATA)>
<!ELEMENT memory (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT collections (collection+)>
<!ELEMENT collection (#PCDATA)>
<!ELEMENT topic (outgoing)>
<!ATTLIST topic

<!ELEMENT outgoing (anchor+)>

<!ELEMENT anchor (tofile+)>
<!ATTLIST anchor
<!ELEMENT tofile (#PCDATA)>
<!ATTLIST tofile
   bep_offset CDATA #REQUIRED
   lang (zh|en|ja|ko)#REQUIRED

The root element crosslink-submission should contain information about participant's ID, run ID (which should include your university affiliation), the task which should be A2F,  source language (
zh, en, ja, or ko), and the default target language (zh, en, ja, or ko).  The linking algorithm should be described in description node.  The collections element contains a list of document collections used in the run. Generally, the collection element should contain text from one of the following: Chinese Wikipedia, English Wikipedia, Japanese Wikipedia or Korean Wikipedia. Each topic should be contained in a topic element which should contain an anchor element for each anchor-text that should be linked.  Each anchor element should include offsetlength and name attributes for detailed information of the recommended anchor, and should also have one or more tofile sub-elements with the target document ID contained within them. The tofile element should contain following information: language id, title and bep (specified in langtitleand bep_offset attributes separately) of the linked document.

  • The position (zero-based offset) of an anchor is calculated by counting the number of bytes (NOT the number of characters) from the beginning of the entire file (the original topic file). Similarly, the length of an anchor is the number of bytes occupied by that anchor text.

  • If an anchor has a name that doesn't match the text in the topic with the given offset and length, it will be discarded in the pooling,

  • It is OK to include tag(s) inside an anchor, but the extra tag(s) must be removed from the anchor name specification. For example,  if an anchor "A Sample <it>Anchor" is found, the specification of that anchor should be:
          <anchor offset="768" length="19" name="A Sample Anchor">  

However, if an anchor contains an incomplete XML tag, it will be discarded in the pooling too.  For example,
          The specification of an anchor - "A Sample <iis incorrect, even the anchor text is given correctly: 

<anchor offset="768" length="11" name="A Sample">

To sum up, the anchor length should be calculated by counting every character in bytes including all XML tags in the extracted anchor from the topic file, but the name attribute of an anchor should be only given with the text after removing all XML tags (including the incomplete ones) due to the XML Well-formedness  issue. 

4.2    Run Timings
Along with the submission of links participants are asked to submit details of the time taken to generate the results (excluding the output to the submission file, but including the reading and parsing of the orphan) and details of the hardware used to generate the result. Time should be CPU time in seconds (to 2 decimal places). The details of the computer are difficult to specify as such details as cache size differs from machine to machine, so the details of the CPU, core numbers, hyper-threading, and main memory are requested.
An example of run details is showed below.
      <cpu>Intel Celeron</cpu>
   <time>3.04 seconds</time>

4.3    Submission Samples

A sample of an English-to-Chinese submission is given below:

<crosslink-submission participant-id="QUT"
         <cpu>Intel Celeron</cpu>
      <time>3.04 seconds</time>
   <description>Describe the approach here, NOT in the run-id.</description>
      <collection>Chinese Wikipedia</collection>
   <topic file="9638" name=" 99 Luftballons">
   <anchor offset="768" length="8" name="Balloons">
      <tofile bep_offset="0" lang=”zh”  title=”气球” >4424</tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”气球炸弹” >442489</tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”气球男孩事件” >64424</tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”红气球之旅” >14424 </tofile>
      <tofile bep_offset=“0” lang=”zh”  title=”猫与气球” >43224</tofile>

5.    Submission Validation Tool

To ensure participants have anchor offset and length correctly specified, an assistant tool is provided for submission validation. Click here to download the latest release. Feedback, comment or bug report are welcome (crosslink).  Please refer to Instruction.pdf contained in the package for details of usage.

The validation tool package includes three programs: CrosslinkValidation,  RunChecker, xml2txt.

  • CrosslinkValidation
For visual validation of your submission(s), the suggested anchors in the topics and their target documents will be located and displayed in the screen.
  • RunChecker
This script helps you to check if you have all anchors correctly specified with the given offset and length. 
  • xml2txt
This tool helps return the text in the input file for the given offset and length, or replace all XML tags with spaces.

6.    Run Submission

To submit your runs when they are ready, please go to the submission site:  http://www.inex.otago.ac.nz/crosslinkand follow the instructions. After submission, please click on "View Runs" to make sure all your runs are successfully accepted.