Building Vietnamese WordNet-annotated corpus for advanced tasks in NLP
With available WordNet annotated corpus in English side, one can use aligments provided by GIZA toolkit to project the WordNet tags into Vietnamese side. Once this goal achieved, the new generated corpus is expected to contain many useful semantic information such that the SMT system could reach a better performance. The problem is that alignments vary in many forms: 1-1, 1-n, m-1, and m-n. Thus, we proposed some heuristics in combining ovelapped alignments in order to obtain the best projection result, which is then evaluated on a hand-labeled test set.