NTCIR

NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates
 

Automatic Evaluation Procedure at NTCIR-10


Tools

The tools used for the automatic evaluation.

For all subtasks

For EJ subtask

  • Mecab: version 0.994
    http://code.google.com/p/mecab/downloads/detail?name=mecab-0.994.tar.gz
  • Dictionary for Mecab: mecab-ipadic-2.7.0-20070801.tar.gz (installed with --with-charset=utf8 option)
    http://sourceforge.net/projects/mecab/files/mecab-ipadic/

The path containing all the scripts is ${pathName}.

Procedure for CE and JE subtasks

The procedure of character normalization and entity conversion for English sentences:
cat e.txt | \
${pathName}/z2h-utf8.pl | \
${pathName}/convert_uspto_xml_entities.pl -f ${pathName}/USPTO-ST32-US-Grant-025xml.ent | \
${pathName}/z2h-utf8.pl > e.standardized.txt

We used "mteval-v13a.pl" for tokenization and the calculation of the BLEU and NIST scores. We used the "mteval-v13a.pl" tokenization function for tokenization, and "RIBES.py" for the calculation of the RIBES score. The scores are case sensitive. The default parameters of the tools, except for case sensitivity, were used.

Procedure for EJ subtask

We used the following tokenization:
(1) Removing all white spaces (single-byte) from the submitted files.
(2) Converting single-byte letters, numbers, and special symbols into multibyte characters for standardization purposes.
(3) Tokenizing all Japanese sentences by Mecab (version 0.994) with mecab-ipadic-2.7.0-20070801 encoded in UTF-8. We concatenated a sequence of arabic numerals into a word.

More specifically, we used the following procedure of standardization and tokenization, and used "mteval-v13a.pl" for the BLEU and NIST scores, and "RIBES.py" for the RIBES score. The default parameters of the tools were used.
The procedure of character normalization and tokenization for Japanese sentences:
cat j.txt | \
perl -pe 's/ +//g;' | \
${pathName}/z2h-utf8.pl | \
${pathName}/convert_uspto_xml_entities.pl -f ${pathName}/USPTO-ST32-US-Grant-025xml.ent | \
${pathName}/h2z-utf8-without-space.pl | \
mecab -O wakati | \
perl -Mencoding=utf8 -pe 'while(s/([0-9]) ([0-9])/$1$2/g){} s/ $//;' > j.tok.txt