LABORATORY 15

Laboratory of Computational Linguistics

Head of Laboratory – Dr.Sc. (Linguistics) Igor Boguslavsky

Tel.: (095) 299-49-27; E-mail: bogus@iitp.ru

 

 

The leading researchers of the laboratory include:

 

Full member of the Russian Academy of Sciences, Dr.Sc. (Linguistics) Jury D. Apresjan

Dr. Sc. (Linguistics).

Vladimir Z. Sannikov

Nikolay V. Grigoriev

Dr.

Leonid L. Iomdin

Alexander V. Lazursky

Dr.

Leonid G. Mitjushin

Irina E. Kayali

Dr.

Leonid L. Tsinman

Leonid G. Kreidlin

Dr.

Svetlana A. Grigorieva

Nadezhda E. Frid

 

 

RESEARCH ACTIVITIES

 

The main problem area of the Laboratory is the study of natural language functioning as a means of information transmission. Basic research activities pursued in the laboratory are aimed at the development of a full operational formal model of language of the Meaning Û Text type. This model simulates human linguistic behavior, that is, the human’s ability to produce and comprehend natural language texts.

 

 

PRINCIPAL RESULTS

 

All scientific results obtained in 2001 bear upon the enhancement of the scope of functional possibilities of the multipurpose NLP system, ETAP-3. A demo version of the system is available over the Internet at http://proling.iitp.ru .

1.        New versions of combinatorial dictionaries of Russian and English have been developed. Each of the two dictionaries now counts up to 53,000 lexical entries, which is comparable in size with large traditional general-purpose bilingual dictionaries. Both dictionaries have undergone not only quantitative but qualitative changes as well. Up till quite recently the general strategy of the Laboratory’s lexicographic work has been to reduce to the minimum the polysemy of lexical items by aggregating within one entry several meanings of the word. Such aggregated lexical entries were supplied with the respectively aggregated patterns of government, which enhanced the probability of mistakes in machine translation and other NLP systems. The greater speed of text-processing in modern computers allowed to give up that strategy, while the introduction of more sophisticated lexicographic information in the dictionary entries required that this strategy should be abandoned. The new lexicographic strategy allows to a greater extent to take into account the polysemy of a word and to handle each of its meanings as a separate lexical item, with its own pattern of government, its own set of lexical functions, and special translations in various cases of its use. As a result, the Russian and English combinatorial dictionaries have come to reflect the real structure of the lexical systems of both languages much more accurately, which led to an improvement in machine translation quality.

2.        The Russian morphological dictionary now counting up to 120,000 lexical units has been further replenished by proper and geographical names and has been stripped of the doublets of the "бомбовый – бомбовой" type which appeared in it as a result of integration with Zaliznyak’s Grammatical Dictionary of Russian.

3.        The new algorithm of morphological analysis developed in 2000 and participating in every ETAP-based NLP application has been programmed on the basis of the finite state automaton technologies (FST). The basic characteristics of the new system of morphological analysis are:

·           a high speed of operation (several thousand words a second),

·           bidirectionality (the same set of data can be used both for analysis and  generation),

·           compactness (it requires very little RAM and disc space).

4.        The capacities of ETAP-3’s parsing module have been expanded by  introducing in it a weighting mechanism aimed at generating the most probable syntactic structure for each processed sentence. A prototype syntactic parser has been created which takes into account the results of statistical analysis of large corpora in producing the syntactic structure for a particular sentence.

5.        The algorithm of grammar and functional ambiguity resolution of Russian words using morphological data and linear context, which was designed in 2000, has been programmed and is being run for collecting material intended to improve the performance of morphological and syntactic analyzers.

6.        A series of experiments have been staged to integrate the ETAP-3 system with an online question answering system, IAW (I Ask Web), carrying on a dialogue with the user in a natural language. The ETAP was replenished with a semantic dictionary including the following three domains: the Internet shop, immovables and taxes. Apart from that, a module was written which transforms the syntactic structure generated by ETAP into the corresponding semantic representation. The use of the ETAP facilities made possible a more accurate choice of the answer to the question asked. Essential progress has also been made in syntactic ambiguity resolution. The work on the integration of ETAP and IAW was fully done, the integrated system has been put into action and can be found at the address www.iaskweb.com.

7.        The deconversion module for the UNL (Universal Networking Language) generating target Russian texts from the input UNL semantic representations has been further developed. In cooperation with partners from Spain, Italy, France and India a series of experiments on the simultaneous generation of texts in the five languages on computers located in the five respective countries has been prepared and staged. Within this framework concrete recommendations aimed at further improvement of UNL and methods of recording information in that language have been worked out. The system can be looked up at the site (http://www.unl.ru).

8.        Work on the project "Computer-Aided Learning of Lexica" has been completed. Learner’s dictionaries of Russian and English counting up to 2500 entries each have been created. They store the following types of information about the lexemes: a) part of speech, b) translation or translations into the other working language, c) the analytical definition of the lexeme, d) its semantic features, or descriptors, e) its pattern of government, f) the values of the lexical functions it has. The total number of lexical functions is 107. On this basis several computer lexical games have been designed, for example: guess the word from its analytical definition, supply the values of the lexical functions offered by the computer for the given word, supply the values of a concrete lexical function for the words offered by the computer and so on. This product is equipped with a system of numerically assessing and scoring up the user’s answers depending on the number of correct answers and the degree of linguistic complexity of the questions.

9.        Work on the project "A Formal Model of Paraphrazing for the NLP systems" has been completed. Apart from the paraphrazing rules sketched out in the classical version of the "Meaning – Text" theory, a great number of new rules have been introduced bearing on the synonymic relations in the derivational and syntactic subsystems of the language, in particular:

-rules working with the so-called Aktionsarten of the Russian verb (transformations of the inceptive, finitive, causative and liquidative verbs into the respective verbal phrases, e. g., «Зал зашумел – В зале поднялся шум»);

-rules working with the so-called indefinite personal constructions of Russian (transformations of the type «Его обманули – Он был обманут»).

In this connection about two dozen new lexical functions missing in the classical version of the "Meaning – Text" model were introduced in the paraphrazing system.

10.    The first round of work on the second part of the tagged corpus of Russian texts (collection and primary processing of a corpus of sentences) has been completed. The textual material used as the second part of the corpus are the so-called news tapes of the Internet – sets of brief pieces of information issued by the news agencies. The texts were borrowed from the sites www.yandex.ru, www.lenta.ru, www.rbc.ru, www.polit.ru and some others. On the one hand, this material is stylistically and syntactically fit for automatic processing because it requires a very modest amount of post-editing. On the other hand, the results of tagging this kind of texts are extremely useful for improving the performance of information retrieval systems. Alongside of the collection of texts work on tagging the sentences has been started.

 

 

GRANTS From:

 

·        Russian Foundation of Basic Research (No. 99-06-80277): "Development of an Operation Meaning Û Text Linguistic Model (third release)".

·        Russian Foundation of Basic Research (No. 99-06-80292): "A Formal Model of Paraphrasing of Sentences for Natural Text Processing".

·        Russian Foundation of Basic Research (No. 01-06-80453): "Development of a Compound Parsing Algorithm for the Linguistic Processor ETAP-3".

·        Russian Foundation of Basic Research (No. 01-07-90405): "Creation of an Annotated Corpus of Russian Texts (second release)".

·        Russian State Scientific Foundation (No. 99-04-00318): "Computer Assisted Learning of Language Vocabulary".

 

Publications in 2001

 

1.      Jurij D. Apresjan. Semantyka leksykalna. Synonimiczne środki języka. Przeł. Zofia Kozlowska i Andrzej Markowski. Drugie wydanie polskie przygotowały Zofia Kozłowska i Elżbieta Janus. Wrocław – Warszawa – Kraków: Ossolineum, 2000 (реально вышла в 2001).

2.      Boguslavsky I., On the scales and implicatures of EVEN // Pragmatics and Flexibility of Word Meaning. Ed. by E. Németh t., K. Bibok. Current Research in the Semantics/Pragmatics Interface, 8, Elsevier Science, 2001.

3.      Апресян Ю.Д. Смыслы ‘знать’ и ‘считать’ в системе русского языка // Међународни научни скуп о лексикографиjи и лексикологиjи «Дескриптивна лексикографиjа стандартног jезика и њене теориjске основе. Резимеи. Београд – Нови Сад, 2001, 1-2.

4.      Апресян Ю.Д. Глагол заставлять: семантический класс, синонимия, многозначность // Жизнь языка. Сборник статей к 80-летию Михаила Викторовича Панова. М.: 2001, 13-27.

5.      Апресян Ю.Д. Системообразующие смыслы ‘знать’ и ‘считать’ в русском языке // Русский язык в научном освещении. 2001, № 1, 5-26.

6.      Апресян Ю.Д. Значение и употребление // ВЯ. 2001, № 4, 3-22.

7.      Апресян Ю.Д. Синонимия предикатов группы ждать // Слово. Юбилеен сборник, посветен на 70-годишнината на проф. Ирина Червенкова. София, 2001, 16-32.

8.      Апресян Ю.Д. «Русский синтаксис в научном освещении» в контексте современной лингвистики // А. М. Пешковский. Русский синтаксис в научном освещении. Издание 8-е. Языки славянской культуры, М.: 2001, III-XXXIII.

9.      Апресян Ю.Д. Восхищение и восторг: сходства и различия // Традиционное и новое в русской грамматике. Сборник статей памяти В. А. Белошапковой. М.: «Индрик», 2001, 94-106.

10. Апресян Ю.Д. От значений к несемантическим свойствам лексем: знание и мнение // Русский язык: пересекая границы. Дубна, 2001, 7-18.

11. Апресян Ю.Д., Ботякова В.В., Латышева Т.Э. и др. Англо-русский синонимический словарь. М.: Русский язык, 2001, изд. 6-е, стереотипное, 543 с.

12. Апресян Ю.Д., Иомдин Л.Л., Медникова Э.М., Петрова А.В. и др. Новый большой англо-русский словарь. М.: Русский Язык, 2001. Изд. 6-е, стереотипное. T. I, 832 c., T. II, 828 c., T. III, 824 c.

13. Богуславский И.М. Об одной загадке языка Пушкина // A. S. Puškin und die kulturelle Identität Russlands / Gerhard Ressel (Hrsg.). – Frankfurt am Main; Berlin; Bern; Bruxelles; New York; Oxford; Wien: Lang, 2001, S. 133-144.

14. Богуславский И.М. Модальность, сравнительность и отрицание. // Русский язык в научном освещении. 2001, № 1, 27-51.

15. Григорьева С.А. Степень и количество // Труды Международного семинара Диалог'2001 по компьютерной лингвистике, Аксаково, 2001, с. 68-75.

16. Григорьева С.А., Григорьев Н.В., Крейдлин Г.Е. Словарь языка русских жестов. Языки русской культуры // Wiener Slawistischer Almanach Sonderband 49, Москва-Вена, 2001, 230 с.

17. Крейдлин Г.Е., Фрид Н.Е. Вслух про себе (семантика и синтаксис одной русской частицы) // Лингвистика на рубеже эпох: идеи и топосы. Сборник статей. М.: РГГУ, 2001. с. 46-67.

 

In print

 

1.      Jurij D. Apresjan. Principles of Systematic Lexicography // In Honour of B. T. S. Atkins (in print).

2.      Boguslavsky I. UNL from the linguistic point of view (in print).

3.      Boguslavsky I. Even in discourse: Interaction of lexical meanings and interpretation strategies (in print).

4.      Iomdin L., Carl M., Pease C., Streiter O. Towards a Dynamic Linkage of Example-Based and Rule-Based Machine Translation // MT (in print).

5.      Апресян Ю.Д. О лексических функциях семейства REAL – FACT // Сборник в честь Z. Saloni (в печати).

6.      Апресян Ю.Д. Наказание в языковой картине мира // Сборник в честь Анджея Богуславского (в печати).

7.      Апресян Ю.Д. Системность лексики: семантические парадигмы и семантические альтернации // Сборник в честь С. Кароляка (в печати).

8.      Григорьева C.А. Словарная статья синонимического ряда ПОЛНОСТЬЮ (в печати).

9.      Григорьева С.А. Словарная статья синонимического ряда ПОЧТИ (в печати).

10. Григорьева C.А. Словарная статья синонимического ряда ЧАСТИЧНО (в печати).

11. Григорьева C.А. Словарная статья синонимического ряда ВРЯД ЛИ (в печати).

12. Иомдин Л.Л. Синтаксические особенности фразеологических единиц: новые подробности // Сборник статей в честь 70-летия проф. А. Богуславского (в печати).