Speaker verification package and its use for pronounciation training

Viatcheslav Makhonin

Laboratory of data analysis,error correction codes and cryptology
Insitute for Information transmission problems (IPPI RAN)
19,Bol.Karetny pereulok
GSP-4 Moscow , 101447
RUSSIAN FEDERATION

The method employed to obtain the pronunciation repre- sensation is based on visualization through PC display informa- tion concerning the similarities between polytonal wavelets modulation spectra of current speech signal and nearest etalon from some memorized set of pronunciation etalones.

The Purpose of researches - revealing of optimum ways of subject interaction with training computer at computer assisted pronunciation training to foreign languages through the multime- dia virtual environment,that admits simultaneous visualization of synthetic images of templates pronounced by announcers in perfection by of by possessing investigated language and synthe- tic images sounds pronounced by subject together with displayed by instructions and recommendations by formed training program.

The complexity of problems arising thus, in comparison with conventional approaches of fixture by tests and misses to excite the formation of current spectra, similar to sonogrammes, resul- ted by successful examples of selected words and phrases of anot- her's language pronunciations, is connected with necessity of choosing the most of successful attempt from dialogue of person and machine, as at training with templates, as at imitations of foreign speech sounding on memory.

The first circle of technological problems - choosing of interacting senses, since the interface of virtual environment is not obliged to limit man-machine interaction by sounds and displays of faces as for lips reading , but can appeal to vibra- tory,or to electrosomatical or, even, to not yet inspected sen- ses of person.

Second circles of technological problems - choosing of dimensions of signals and codes of their consecutive discrete displays by transcoding,since the range is here utterly wide, beginning from arrays of binary codes of sequences of samples, further through phonemes and syllables units down to gradations of significances.

In proposed approach system uses polytonally modulated for- mant characteristics of personal utterances to calculation of in- dividual physiological articulation abilities parameters. Synthe- sized in multimedia templates will take into account these indi- vidual wavelets parameters, should increase the training perfor- mance and reduce computational expenditures.

The hardware of first stage of experiments can require ma- nufacturing of several of unique designs of masks with conver- ters built - in in them of forms of representation of informati- on. The converters will be connected to computer; the computer on this stage also can be required enough powerful one.

At second stage, after processing of results first, will be directed on preparation hardware and software for mass personal computers. Walsh wavelets polytonal modulation spectra filtered througth two layers perceptron , some of recieptive fields are shown below.In these pictures numbers means current numbers in recieptive field , encircled elements are recieptive fields ele- ments whose modulation products are collected from. At the second perceptron layer only ganglions collected more than their neigh- bors recieve marking correspondent to domination recieptive field nomber,otherwise them not marked by any integer.

 __________________________     __________________________
|  _____                   |   |              __________  |
| |  13 |  14   15   16    |   |   13    14  | 15    16 | |
| |     |                  |   |        _____|__________| |
| |  9  |  10   11   12    |   |    9  | 10  | 11    12   |
| |_____|_____             |   |   ____|_____|            |
|    5  |   6 |  7    8    |   |  | 5  |  6     7     8   |
|       |_____|__________  |   |  |    |                  |
|    1     2  |  3    4  | |   |  | 1  |  2     3     4   |
|             |__________| |   |  |____|                  |
|__________________________|   |__________________________|
   1-rst recieptive field        2-nd recieptive field
 __________________________     __________________________
|                   ____   |   |  __________              |
|    13    14   15 | 16 |  |   | | 13    14 |  15    16   |
|                  |    |  |   | |__________|_____        |
|    9     10   11 | 12 |  |   |    9    10 |  11 |  12   |
|               ___|____|  |   |            |_____|_____  |
|    5      6  | 7 |  8    |   |    5     6     7 |   8 | |
|   ___________|___|       |   |                  |     | |
|  | 1     2   | 3    4    |   |    1     2     3 |   4 | |
|  |___________|           |   |                  |_____| |
|__________________________|   |__________________________|
   3-rd recieptive field         4-rth recieptive field

There are two examples so processed modulation spectral rep- resentation showh forth. First represent short russian phrase [lena mila malinu] pronounced by male, second - short franch one "maurice berce son enfant" pronounced by female. Cells collected modulation products are smaller then threshold marked as "#", up- per line consist of intensities accumulated through different walsh functions and different tones,taken from speakers register. lines at the middle indicates selected fields nombers those up correspondent thresholds and by sign"#" for those are lower than thresholds.Analogously at the down part of table represent modu- laton products collected from different tones.

                      Inten. 3321121133224223422321431##
   Reasons of Walsh
spectral representa-         ###########################
tons are connected    Walsh  ###########################
not only with their          ###########################
abilities to accele-         ##1######55##5#####1#######
rate  calculations           7#4#5577#127427##5#77#75###
but with abilities           6###47#####2#######77##5###
some  of them to             ###########################
represent of dis-     Spect. ###########################
continuos  speech            ###########################
acoustical events            ###########################
firstly described            ###########################
in ULB  R.A.12/1 [1]         ###########################

  These methods of           ###########################
representation dis-          ###########################
continuos speech      Hight  ###########################
acoustical events            ###########################
described in detail          ###########################
in software package      T   ###########################
and in patent[3,2].      O   #######3#2#################
                         N   #########5#################
                         E   ######3##1#################
                         S   #####4#####################
                             #####7###########5#########
                             ###########################
                       Low   ###########################
                             5##########################
                              beg.      TIME        end

Images recieved after acoustic signal processing are compa- red with images formed before and saved in hard disc memory. Ne- arest one , selected by features displayed together with the new one. That gives the possibility of package user to estimate his pronunciation samples in comparing with etalones ones.

 Int.  1#########1###############
                                   In the second example the da-
       ##########################  ta representation is the same
 Walsh ##########################  only tones range is shifted
       ##########################  according to the higher voice
       #####5##1#####1###########  of the dictor. Speaker Veri-
       74###5#84#8527############  fication Package used by the
       ##264##377562#############  philologist Khatemlyanskaya
       ########2646##############  H.A. in her research concer-
       ##########################  ning main of the French con-
       ##########################  sonant system. We tried to
 Spect.##5#######################  lead experimental phonetic
       4#5#######################  research for the determining
       742#######################  the prosody factor importan-
                                   ce because the above-mentio-


       ##########################  ned factor influences on as-
 Hight ##########################  similation process. The ex-
       ##########################  periment data testify that
     T ##########################  our Speaker Verification Pa-
     O ##########################  ckage could be used in com-
     N ##########################  plex experiment phonetic re-
     E ##########################  search work,e.g.what place
     S ##########################  the prosody factor occupies
       ##########################  in assimillation process, or
       ##########################  what the  varietiveness  of
       ##########################  consonants is.
       ##5#######################
       ##########################
   Low ##########3###############
       ######5##3#7##############
       2##3######26##############
        beg.     TIME        end

At detailed planning of these stages the experience of French researchers in relations cognitive attitudes in situation problem solving will also be used. Their researches have been undertaken last year in Institut de la communication parlee in Grenoble ( Les Cahiers de l'ICP Rapport de Recherche N 3, Atti- tude cognitives et actes de langage en situation de communicati- on homme / machine Jean Caelen, Anne-Lise Frechet p.137-152 ). The results of swedish researchers ( Bjorn Granstrom et al, Ro- yal Institute of Technology ( KTH ), STL-QPSR Oct.15,1994, p.93-111 ) will be also used.

It should be notice, however, that stability of perception and prerecognition of these intermediate messages, generated at continuous decreasing of threshold levels of transmitted signals modulations, should serve that by base at choosing of signal di- mensions of intermediate displays. That displays are suitable for construction of messages to be recognized, as it was obser- ved in mass experience on listening of speech of separate anno- uncer, allocated at intensive noisy background. Such choosing corresponds to perception law by N.N.LANGE, already successfully used at processing of optical information with fixation prere- cognition images.

The learning of speaking skill can be multimedially model- led as cognitive psychology events by specifying how neural tis- sue carries out computations while students attempt to imitate short phrase pronunciation acoustically reproduced by computer as its template. Visual cues of students pronunciation presented on display screen together with template features are an impor- tant factor in sensed similarity judgments of pronounced senten- ces to its templates using graphics at walsh polytonally modula- ted walsh wavelets . The problem is to be investigated how to find the best cue for optic/acoustic/articulative students acti- vity on its memorizing while education and on remembering while examination.

The researches on cognitive multimedia model for the computer assisted learning of a speaking skill are planned to fulfill in Laboratory of data analysis,error correction codes and cryptology IPPI RAN together with researchers from Moscow Pedagogical University and these researches are opened for any scientific cooperation with western researchers .

References

V.Makhonin . On the representation of discontinues spe- ech acoustical events. Rapport d'activites de l`institut de pho- netique.ULB, R.A. 12/1 ,Bruxelles 1978.
Patent of Russian federation N 1700584 on invention "the method of measurement of index of transmission of speech", date of patent August 2, 1993, assignee IPPI acad.scien.Russia.
Dialogue pack of programs the verification of pronuncia- tion, registration N 50930000119, developers V.A.Makho- nin,I.A.Orlov, A.N.Pirojenko, S.N.Krinov and I.V.Shleifman.

[Current Projects] [IITP, The Homepage]