Main task (anchor-to-file):
• For each topic we will allow up to 250 anchors, each anchor with up to 5 targets. So there are up to 1250 outgoing links per topic.
• up to 5 runs
Extra task (anchor-to-bep) (Optional):
Ifyou decide to participate in the extra task, you may submit
anchor-to-bep runs only without needing to provide runs for the main
task because the runs of this extra task can be also treated by the
evaluation tool as the main task runs with the offsets of BEPs set to
zero (beginning of the text) automatically.
• For each topic we will allow up to 250 anchors, each anchor with up to 5 beps, each in different target documents. So there are up to 1250 outgoing links per topic.
• up to 5 runs
Since the topics are not orphaned documents, the link mark-ups remaining in topics are for your own reference. Participants must not use those existing links in the topic documents to locate anchors and suggest target documents.
Topic files including their CJK counterparts should be excluded from Wikipedia document collections either physically or virtually.
Please refer to Section 4.2 for an assistant tool that helps removing the CJK topic counterparts from the corpora.
Finding cross-lingual number, month-day, or year links (i.e.1, 2, 2012, 1900s,February 24) is not a big issue in link discovery, and Wikipedia also has special rules on the chronological items1. In order to have more high quality links, the recommendation of these kinds of links is so not required in the task. If those links are included in the submissions, they will be rejected and not considered in either assessment or evaluation.
3. Result Submission
We will use XML for the output results. The specification of result
submission is similar with that of the INEX Link-the-Wiki task2.The formats for both anchor-to-file and anchor-to-bep run submission are exactly the same. The only difference between anchor-to-file and anchor-to-bep submission is in whether the BEP is specified. In the main anchor-to-file task, the offset of the BEP should be set to zero by default; in the anchor-to-bep task, a BEP should be given and its offset in the target document should be specified.
<!ELEMENT anchor (tofile+)>
name CDATA #REQUIRED
offset CDATA #REQUIRED
length CDATA #REQUIRED
<!ELEMENT tofile (#PCDATA)>
bep_offset CDATA #REQUIRED
title CDATA #REQUIRED
The root element crosslink-submission should contain information about participant's ID, run ID (which should include your university affiliation), the task which should be either A2F or A2B and the default target language (zh, ja, or ko). The linking algorithm should be described in description node. The collections element contains a list of document collections used in the run. Generally, the collection element should contain text from one of the following: Chinese Wikipedia, Japanese Wikipedia or Korean Wikipedia. Each topic should be contained in a topic element which should contain an anchor element for each anchor-text that should be linked. Each anchor element should include offset, length and name attributes for detailed information of the recommended anchor, and should also have one or more tofile sub-elements with the target document ID contained within them. The tofile element should contain following information: language id, title and bep (specified in lang, title, and bep_offset attributes separately) of the linked document.
The position (zero-based offset) of an anchor or BEP is calculated by counting the number of bytes (NOT the number of characters) from the beginning of the entire file (the original topic file). Similarly, the length of an anchor is the number of bytes occupied by that anchor text.
If an anchor has a name that doesn't match the text in the topic with the given offset and length, it will be discarded in the pooling,
It is OK to include a tag in the anchor. For example, an anchor "A Sample <it>Anchor" is found, the name attribute of that anchor must look like the following:
But if an anchor having an incomplete XML tag will be discarded in the pooling too. For example,
an incorrect anchor: "A Sample <i"
<anchor offset="768" length="11" name="A Sample">
The anchor length should be calculated by counting every character in bytes including all XML tags in the extracted anchor from the topic file, but the name attribute should be only the text after removing all XML tags (including the incomplete ones) due to the XML Well-formedness issue.
3.2 Run Timings
Along with the submission of links participants are asked to submit
details of the time taken to generate the results (excluding the output
to the submission file, but including the reading and parsing of the
orphan) and details of the hardware used to generate the result. Time
should be CPU time in seconds (to 2 decimal places). The details of the
computer are difficult to specify as such details as cache size differs
from machine to machine, so the details of the CPU, core numbers,
hyper-threading, and main memory are requested.
An example of run details is showed below.
Three complete sample runs (English-to-Chinese, English-to-Japanese and
English-to-Korean) using a very simple link discovery algorithm on the
three training topics were created for your reference. They can be downloaded from here. Also, it is contained in the submission validation software package.For more information about the validation tool please read on.
4. Submission Validation Tool
1) To ensure participants have anchor offset and length, and BEP offset
correctly specified, an assistant tool is provided for submission
validation. Click here to download the latest release. Feedback, comment or bug report are welcome (crosslink). Please refer to Instruction.pdf contained in the package for details of usage.
The validation tool package includes three programs: CrosslinkValidation, RunChecker, xml2txt.
For visual validation of submission run, the suggested anchors in the topics and their target documents will be located and displayed in the screen.
This script will help you to check if you have all anchors correctly specified with the given offset and length.
This tool helps return the text in the input file for the given offset and length, or replace all XML tags with spaces.
2) As stated in the rules, the English topics including their CJK counterparts should not be linked. Here is the scripts for removing the topic files from the document collections for this purpose. Instructions on using the scripts are given in the readme.txt.