Patent Machine Translation Task at NTCIR-9
- A Chinese to English subtask has been added
- Human evaluations will be carried out
- Parallel corpora consisting of 1 million Chinese-English and 3 million Japanese-English sentence pairs will be provided
MotivationPatent information is information that is important to society around the world. There is a large need for translations to understand patent information written in foreign languages and to apply for patents in foreign countries. Patents are one of the challenging domains for machine translation because patent sentences can be quite long and contain complex structures. We have organized a patent machine translation task (PatentMT) to address this significant practical need and to develop this challenging research further.
GoalsPatentMT is not competition-oriented, but the eventual goal is to foster cooperative work and scientific exchange. In this respect, the organizers propose a research task and an open experimental infrastructure for the scientific community working on machine translation research. The goals of PatentMT are as follows:
- To develop challenging and significant practical research into patent machine translation.
- To investigate the performance of state-of-the-art machine translation in terms of patent translations involving Japanese, English, and Chinese.
- To compare the effects of different methods of patent translation by applying them to the same test data.
- To create publicly available parallel corpora of patent documents and human evaluations of MT results for patent information processing research.
- To drive machine translation research, which is an important technology for cross-lingual information access to understand information written in unknown languages.
Subtasks Parallel corpus Chinese to English 1 million patent description sentence pairs Japanese to English 3 million patent description sentence pairs English to Japanese
- Participants choose the subtasks in which they would like to participate.
- Resources planned to be provided
- Chinese to English subtask: A parallel corpus consisting of 1 million Chinese-English patent description sentence pairs, a large-scale monolingual patent corpus in English, and a test set of patent descriptions
- Japanese to English subtask: A parallel corpus consisting of 3 million Japanese-English patent description sentence pairs, a large-scale monolingual patent corpus in English, and a test set of patent descriptions
- English to Japanese subtask: A parallel corpus consisting of 3 million Japanese-English patent description sentence pairs, a large-scale monolingual patent corpus in Japanese, and a test set of patent descriptions
- Participants are requested to machine translate the test sets.
- The submitted translation results will be evaluated through human evaluation and automatic evaluation. The primary evaluation is human evaluation. Human evaluation criteria will be adequacy and acceptability. In this task, acceptability is defined in Fig. 1. Evaluation of acceptability will be applied for some selected systems because of budgetary limitations. We will select systems basically using the following criteria: (i) For the inclusion of many types of methods and (ii) systems with high adequacy are given priority if the type of methods are the same.
- Task definition (PDF) (updated Section 5 on 2011.4.21)
The task definition is almost the same as that for the NTCIR-8 Patent Translation Task.
The submission format of the translated results and the submission method of the translated results are shown in this document.
- Case recovery and de-tokenization must be done on the translation results. (added 2011.4.21)
- Participants are requested to submit a paper describing the MT system, the utilized resources, and their results using the provided test data, and are requested to present their papers at the workshop.
Fig. 1 Acceptability
2010.12.20 2011.1.20: Registration due (extended)
2011.1.5: Training data release
2011.5.9: Test data release
2011.5.22: Translation results submission due (UTC)
2011.8.19: Evaluation results release
2011.9.20: MT system description due
2011.11.4: Camera-ready due
2011.12.6-9: NTCIR-9 workshop
If participants register by 2010.12.20 and NII/HKIED receives a user agreement by 2011.1.4, NII/HKIED will provide data to participants on 2011.1.5.
If participants register after 2010.12.20, NII/HKIED will provide data to participants after 2011.1.5 and after NII/HKIED receives a user agreement.
(NII will release the NTCIR-8 PATMT training data to the public for research use before 2011.1.5. Participants are allowed to use the data before the 2011.1.5 release date.)
Registration forms are available at the official NTCIR-9 page:HERE
- Benjamin K. Tsou (Hong Kong Institute of Education/City University of Hong Kong)
- Kapo Chow (Hong Kong Institute of Education)
- Bin Lu (City University of Hong Kong)
- Isao Goto (NICT)
- Eiichiro Sumita (NICT)
If you have any question or suggestion about the task, please feel free to send an email to the organizers.