Bilingual lexicons are important resources for machine translation and automatic alignment systems. This paper presents a simple and effective model for extracting bilingual lexicons automatically from pre-aligned parallel texts by using information retrieval techniques. The model is based on the assumption that two words/phrases are likely to be translations if they are aligned to the same word/phrase in a third language.
As a use case we used the Greek-English, Latin-English, Persian-English aligned parallel texts available in Perseus Digital Library to produce Greek-Latin, Greek-Persian and Latin-Persian dynamic lexicons.
The size of our datasets (104 Ancient Greek/English works, 59 Latin/English 494 works, and Persian/Englis poems), the Ancient Greek/English dataset consists approximately of 210k sentence pairs with 4320k millions Ancient-Greek words, and the Latin/English dataset consists approximately of 123k sentence pairs with 2330k millions Latin words, whereas the Persian/English dataset consists of 64 thousand translation pairs, (23k of them are unique)