LABORATORY 15
Laboratory of Computational Linguistics

Head of Laboratory – Dr.Sc. (Linguistics) Igor Boguslavsky

Tel.: (095) 299-49-27; E-mail: bogus@iitp.ru

The leading researchers of the laboratory include:

Full member of the Russian Academy of Sciences, Dr.Sc. (Linguistics) Jury Apresjan
Dr. Sc. (Linguistics).	Vladimir Sannikov	Svetlana Grigorieva
Dr.	Leonid Iomdin	Alexander Lazursky
Dr.	Leonid Mitjushin	Irina Sagalova
Dr.	Leonid Tsinman	Victor Sizov
	Nikolay Grigoriev

RESEARCH ACTIVITIES

The main problem area of the Laboratory is the functioning of natural language as a means of information transmission.

Basic research activities pursued in the laboratory are oriented towards the development of a fully operational formal model of language of the Meaning Û Text class. This model simulates human linguistic behavior, including the basic ability of man to produce and comprehend natural language texts.

PRINCIPAL RESULTS

In 1999, the following results were achieved.

1. A deconversion module was constructed that transforms structures of the Universal Networking Language (UNL) into correct sentences of Russian. The activity was pursued within the scope of an international project implemented under the auspices of the United Nations. The ultimate goal of the project is to overcome the language barrier in Internet by granting Web users from different countries an opportunity to communicate with each other so that everyone is using his or her native tongue. The project's main idea is to use a specially designed interlingua, the UNL, for the exchange of information within the network. Any meaning conveyed by a text written in any natural language can be represented in UNL. For every natural language two reciprocal procedures have been developed: the conversion procedure that (interactively) translates a text written in this language into a UNL text, and the deconversion procedure that translates a UNL expression into a text in the given natural language. Both procedures are made available to any user through an Internet server. The laboratory's task in the project is to create both procedures for Russian as a new module of the ETAP-3 system. The current prototype version of the deconverter is available at http://proling.iitp.ru/Deco.

2. The laboratory continued to develop the ETAP-3 machine translation system. In particular,

Combinatorial dictionaries of English and Russian were drastically expanded and now count 48-50 thousand entries each.

Over 1,500 entries in each dictionary have been supplied with information on lexical functions, which are used to increase the quality of translation and as a means of quasi-synonymous paraphrasing. The list of lexical functions used counts over 100 lexical functions.

The format of an entry of the combinatorial dictionary was modified in order to tune the translation of texts belonging to different subject areas.

The software of the ETAP system, originally programmed on a VAX computer working under VMS operating system, was ported into the Windows NT environment.

An innovative mechanism was developed to operate with the so-called weakened syntactic rules, used to increase parser robustness.

A new parsing algorithm, based on gradual extraction of increasingly larger treelike fragments of syntactic structures of sentences, was developed and tested.

A system of grammar checking of Russian texts was developed. The system detects and corrects a broad class of errors in grammatical agreement and case government.

A series of experiments in integrating various machine translation strategies within a single MT system were carried out. The ETAP-3 system relying on rule-based strategy (deductive approach) was supplemented by inductive approach engines which were dynamically introduced to the system (statistic processing of parallel text corpora, using translation memories).

An experimental version of the ETAP-3 system is available at http://proling.iitp.ru

3. An in-depth theoretical study of Russian word formation was undertaken, which served as basis for a computer implemented model. In particular,

An optimal strategy for the introduction of the word formation component into NLP systems was proposed and tested in a series of computer experiments. The strategy takes into account

the productivity of a word formative model;

the degree of implementational complexity;

for MT systems – the complexity of lexical transfer rules involving units formed with the help of the model in the translation from and into Russian.

To implement the word formation analysis, a set of computer programs was written, including a new morphology compiler and an analyzer that allows to represent word formative information.

4. A new version of the apparatus of lexical functions and paraphrasing was developed.

A new model of paraphrasing includes three sets of rules:

identification of lexical functions in an arbitrary sentence (transition form surface syntactic structures to deep syntactic structures);
canonization of syntactic structures;
paraphrasing proper.

A new version of the lexical function inventory was developed (definitions and representative lists for each functions);

Complex experiments were performed to test the paraphrasing model.

5. The development of a computer aide for learners of Russian and English was continued. The aide helps the learners master their command of Russian and English vocabulary. In particular,

The update of government patterns was started. The patterns are unified on the basis of a detailed semantic and syntactic classification of predicate words;

A new version of the semantic language for analytic definitions of lexical units was created and used as basis for writing new lexicographic definitions.

A language of semantic features for overlapping vocabulary classification was created. The information on the features was included into the entries of the combinatorial dictionaries of Russian and English.

6. The compilation of a morphologically and syntactically tagged corpus of Russian texts was continued. Each sentence in this corpus is supplied with a full morphological structure and a syntactic dependency tree. A corpus of Russian texts containing ca. 1 million words was formed and prepared for tagging. Morphological and syntactic tagging was made for a part of the corpus comprising about 4200 sentences, or 56 thousand words.

GRANTS From:

Russian Foundation of Basic Research (No. 99-06-80277): "Development of an Operation Meaning Û Text Linguistic Model (third release)".

Russian Foundation of Basic Research (No. 98-07-90072): "Creation of an Annotated Corpus of Russian Texts".

Russian Foundation of Basic Research (No. 99-06-80292): "A Formal Model of Paraphrasing of Sentences for Natural Text Processing".

Russian Foundation of Basic Research (No. 99-06-80276): "Theory and Practice of Introducing Word Formative Component into Automatic Text Processing Systems for Russian".

Russian State Scientific Foundation (No. 99-04-00318): "Computer Assisted Learning of Language Vocabulary".

Publications in 1999

Апресян Ю.Д. Отечественная теоретическая семантика в конце ХХ столетия // Изв. АН, сер. лит. и яз. 1999. № 4. С. 39-53.

Апресян Ю.Д. Принципы системной лексикографии и толковый словарь // Поэтика. История литературы. Лингвистика. Сборник к 70-летию Вячеслава Всеволодовича Иванова. М.: 1999. С. 634-650.

Апресян Ю.Д. Основные ментальные предикаты состояния в русском языке // Славянские этюды. Сборник к юбилею С. М. Толстой. М.: 1999. С. 44-58.

Богуславский И.М., Иомдин Л.Л. Семантика быстроты // Вопросы языкознания. 1999. № 6. С. 13-30.

Boguslavsky I. Translation to and from Russian: the ETAP-3 System // Proceedings of the Workshop of the European Association for Machine Translation (in print).

Григорьев Н.В. Восходящий алгоритм построения дерева зависимостей для системы ЭТАП-3 // Труды Международного семинара Диалог’99, с. 28-33, 1999.

Iomdin L., Streiter O. Learning from Parallel Corpora: Experiments in Machine Translation // Труды Международного семинара Диалог’99, с. 79-88, 1999.

Iomdin L., Carl M., Pease C., Streiter O. Towards a Dynamic Linkage of Example-Based and Rule-Based Machine Translation // Machine Translation. 2000, issue 5 (in print).

Iomdin L., Streiter O. et al. Learning, Forgetting and Remembering: Statistical Support for Rule-Based MT // Proceedings of the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI99), 1999.

Цинман Л.Л., Сизов В.Г. Система ЭТАП: процедуры ослабления синтаксических правил и их использование // Труды Международного семинара Диалог’99, с. 321-326, 1999.