NTCIR

NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates
 

Data preparation for the CE subtask


We assume the following:
There are text files at corpus.org directory.
train-all.zh.txt is training data in Chinese (UTF-8).
train-all.en.txt is training data in English.
dev.*.txt is development data.
test.*.txt is test data.
These files contain only sentences and do not contain ID information.

export SCRIPTS_ROOTDIR=${pathName}/moses-scripts/scripts-XXXXXXXX-XXXX
export SCRIPTS_DIR=${pathName}/scripts
export PATH_CHINESE_SEG=${pathName}/stanford-chinese-segmenter-2008-05-21
mkdir corpus.tok
cd corpus.tok

Training data

${PATH_CHINESE_SEG}/segment.sh ctb ../corpus.org/train-all.zh.txt UTF-8 0 | \
perl -Mencoding=utf8 -pe 'tr/[]/[]/; s/\|/|/g;' > train-all.tok.lower.zh

cat ../corpus.org/train-all.en.txt | ${SCRIPTS_DIR}/tokenizer.perl -l en | perl -pe 's/\|/|/g;' > train-all.tok.en

cat train-all.tok.en | perl -pe '$_=lc($_);' > train-all.tok.lower.en

  • Filter out long sentences.
  • ${SCRIPTS_ROOTDIR}/training/clean-corpus-n.perl train-all.tok.lower zh en train-all.clean1-40 1 40

    Development data

    ${PATH_CHINESE_SEG}/segment.sh ctb ../corpus.org/dev.zh.txt UTF-8 0 | \
    perl -Mencoding=utf8 -pe 'tr/[]/[]/; s/\|/|/g;' > dev.tok.zh

    cat ../corpus.org/dev.en.txt | ${SCRIPTS_DIR}/tokenizer.perl -l en | perl -pe 's/\|/|/g;' | perl -pe '$_=lc($_);' > dev.tok.en

  • The baseline system uses 500 sentences to reduce MERT time.
  • head -n 500 dev.tok.zh > dev.tok.500.zh
    head -n 500 dev.tok.en > dev.tok.500.en

    Test data

    ${PATH_CHINESE_SEG}/segment.sh ctb ../corpus.org/test.zh.txt UTF-8 0 | \
    perl -Mencoding=utf8 -pe 'tr/[]/[]/; s/\|/|/g;' > test.tok.zh