NTCIR

NII Testbeds and Community for Information access Research
  • NTCIR Official site
  • Conference
  • Organizers
  • Data
  • Important Dates
 

Data preparation for the JE and EJ subtask


We assume the following:
There are text files at corpus.org directory.
train-all.ja.txt is training data in Japanese (EUC) (1993-2005).
train-all.en.txt is training data in English (1993-2005).
dev.*.txt is development data.
test.*.txt is test data.
These files contain only sentences and do not contain ID information.

export SCRIPTS_ROOTDIR=${pathName}/moses-scripts/scripts-XXXXXXXX-XXXX
export SCRIPTS_DIR=${pathName}/scripts
mkdir corpus.tok
cd corpus.tok

Training data

cat ../corpus.org/train-all.ja.txt | \
mecab -O wakati | \
nkf -Ew | \
perl -Mencoding=utf8 -pe 'tr/0-9A-Za-z[]/0-9A-Za-z[]/; while(s/([0-9]) ([0-9])/$1$2/g){} while(s/([A-Za-z]) ([A-Za-z])/$1$2/g){} s/ +$//; s/\|/|/g;' > train-all.tok.lower.ja

cat ../corpus.org/train-all.en.txt | ${SCRIPTS_DIR}/tokenizer.perl -l en | perl -pe 's/\|/|/g;' > train-all.tok.en

cat train-all.tok.en | perl -pe '$_=lc($_);' > train-all.tok.lower.en

  • Filter out long sentences.
  • ${SCRIPTS_ROOTDIR}/training/clean-corpus-n.perl train-all.tok.lower ja en train-all.clean1-40 1 40

    Development data

    cat ../corpus.org/dev.ja.txt | \
    mecab -O wakati | \
    nkf -Ew | \
    perl -Mencoding=utf8 -pe 'tr/0-9A-Za-z[]/0-9A-Za-z[]/; while(s/([0-9]) ([0-9])/$1$2/g){} while(s/([A-Za-z]) ([A-Za-z])/$1$2/g){} s/ $//; s/\|/|/g;' > dev.tok.ja

    cat ../corpus.org/dev.en.txt | ${SCRIPTS_DIR}/tokenizer.perl -l en | perl -pe 's/\|/|/g;' | perl -pe '$_=lc($_);' > dev.tok.en

  • The baseline system uses 500 sentences to reduce MERT time.
  • head -n 500 dev.tok.ja > dev.tok.500.ja
    head -n 500 dev.tok.en > dev.tok.500.en

    Test data

  • For the JE subtask
  • cat ../corpus.org/test.ja.txt | \
    mecab -O wakati | \
    nkf -Ew | \
    perl -Mencoding=utf8 -pe 'tr/0-9A-Za-z[]/0-9A-Za-z[]/; while(s/([0-9]) ([0-9])/$1$2/g){} while(s/([A-Za-z]) ([A-Za-z])/$1$2/g){} s/ $//; s/\|/|/g;' > test.tok.ja

  • For the EJ subtask
  • cat ../corpus.org/test.en.txt | ${SCRIPTS_DIR}/tokenizer.perl -l en | perl -pe 's/\|/|/g;' | perl -pe '$_=lc($_);' > test.tok.en