Cilj ovog rada predstavljao je razvoj nove metode za slijepu estimaciju vremena odjeka koja će uzimati u obzir i značajke govornog signala. Budući da se svaki algoritam za slijepu estimaciju vremena odjeka sastoji od dvije osnovne komponente: metode pronalaska odgovarajućih segmenata zaprimljenog govornog signala koji iskazuju opadanje energije, te optimalnog estimatora parametra, u ovom radu pokušalo se poboljšati obje. Shodno tome, provedena je usporedba rada postojećeg maximum-likelihood estimatora vremena odjeka s predloženim estimatorom temeljenim na metodi najmanjih kvadrata upotrebom tri seta ispitnih signala. Rezultati usporedbe su pokazali da uz podjednaku razinu točnosti i preciznosti, predloženi estimator omogućava značajno brži proračun vrijednosti parametra. S druge strane, kako bi se mogao poboljšati način odabira segmenata govornog signala pogodnih za estimaciju vremena odjeka, provedena je, do sada u literaturi ne-prezentirana, analiza veze izmedu točnosti estimacije vremena odjeka i fonetskog sadržaja riječi na temelju ćije sekvence opadanja energije se estimacija provodi. Upotrebom govornog korpusa logatoma te mjerenih odziva, impulsnih odziva generiranih prema statističkom modelu kasnog odjeka i impusnih odziva dobivenih ray-tracing metodom, osim utjecaja fonetskog sadržaja, ispitao se i utjecaj načina izgovora riječi te spola govornika na točnost estimacije. Na temelju uvida dobivenog kroz analize, te na temelju kriterija linearnosti krivulje opadanja energije postavljenog u ISO 3382 standardu, razvijen je poboljšani način odabira segmenata opadanja energije iz kojih se procjenjuje vrijednost vremena odjeka. Usporedbom rezultata dobivenih unaprijedenim načinom odabira segmenata s dobivenima baznom metodom odabira, uz korištenje predloženog estimatora, ustanovljeno je povećanje točnosti estimacije upotrebom i sintetiziranih i mjerenih impulsnih odziva.
|Abstract (english)|| |
Nowadays, an estimate of reverberation time (RT) value of an enclosed space is needed in many applications such as the systems for automatic speech recognition, hearing aids and teleconference systems in order to enhance, either objectively or subjectively, the speech signal acquired at electroacoustic sensors. In practice, due to a large spectrum of possible geometries and acoustic qualities of boundaries in an enclosure, it is necessary to obtain the RT value blindly, i.e. from the received speech signal only. Current methods for blind RT estimation based on time-domain statistical model of late reverberation do not take into account the departure of speech signal from standardised signals used for RT value measurement. Therefore, the first goal of the present research was to analyse whether there exists a systematic relationship between the blindly obtained RT estimate and phonetic contents of uttered words as well as the speaking style used. Our second goal was to increase the accuracy and precision of a state-of-the-art method for blind RT estimation based on the conclusions drawn from results of the analyses. In Chapter 1, the problematic of blind RT estimation from received speech signals is presented as well as the motivation for the research performed. Finally, the main goals and scientific contributions to be achieved are stated. In Chapter 2, a concise introduction to acoustic characterisation of enclosures as well as to standardised methods for reverberation time measurement is given. The chapter ends with examples showing the relationship between RT value and change in the modulation transfer function (MTF) that is directly related to the level of speech signal degradation. A comprehensive review of the methods used for blind reverberation time estimation is presented in Chapter 3, with methods grouped by the level of their complexity. The methods are discussed form the standpoints of signal pre-processing step used, the estimator chosen for RT estimation and, most significantly, the accuracy of the methods as well as the speech corpora and acoustic impulse databases utilised for algorithms assessment. Finally, the limitations of current methods for algorithms' accuracy assessment are discussed, leading to the conclusion that a systematic analysis of speech signal influence on RT estimation accuracy is needed. In Chapter 4, the database consisting of both synthesised and measured acoustic impulse responses is presented, and the characteristics of logatome speech corpus used for assessment of influence of phonetic contents and speaking style on RT estimation error explained in detail. Furthermore, based on the knowledge of the differences in production mechanisms for articulation of different phoneme types, the grouping of logatomes into mutually exclusive classes has been carried out, and long term average spectra (LTAS) of logatome terminating phonemes determined. The method used for obtaining segments of energy decay from recorded speech signal as well as the estimator of RT based on least squares have been explained in detail. Finally, based on limitations stemming from both the acoustic properties of rooms and time-frequency distribution of speech signal energy a restriction upon the lowest octave band in which the RT estimation could be carried out was set. The results of the analyses conducted, revealing a systematic relationship between the logatome end phonetic composition and speaking style used on the one hand, and estimation accuracy and precision on the other, are presented and discussed in Chapter 5. This relationship has been assessed both on sample of speakers and single speaker level, for synthesised and measured impulse responses separately, showing compliance in-between cases. In Chapter 6, additional analyses are presented, indicating that speech sounds preceding the last vowel in a logotome have no influence on RT estimate accuracy obtained upon the termination of the word uttered and, secondly, that short RTs can be estimated with high accuracy inside a word during certain phoneme transitions. In Chapter 7, the improved method for selection of speech signal segments used for RT estimation that is based on the observations obtained in preceding chapters is presented. In Chapter 8 the results of accuracy improvement using the modified method of segment selection are presented. In this thesis, a database of single and two syllable words recordings uttered by speakers of both sex has been utilised in order to assess the influence that word phonetic contents and speaking style (loud, soft, fast, slow, normal and questioning) have on the level of RT estimation error for short reverberation times (lower than 0.5 s). An estimator based on least squares has been proposed and compared with maximum-likelihood estimator based on the merits of accuracy, precision and processing time. It was observed that both the order and type of consonants following the final vowel in a word, the subglottal pressure (correlated with speaking style) used as well as the fundamental frequency of vocal folds vibration all influence the RT estimation accuracy. The logatomes ending with a combination of a vowel followed by a plosive enabled high accuracy estimation of RT regardless of speaking style or octave band in question. On the other hand, words ending with a combination of a vowel followed by a fricative produced large estimation errors in octave bands from 1 kHz upwards. Based on the observations obtained, the method for selecting speech segments from which the RT would be estimated was modified in accordance with ISO 3382 standard, enabling the reduction of the estimation error level.