The intellectual property of the theses is jointly owned by the University of Crete and FORTH, where indicated.
2023
Panagiotis Pantalos,
Exploration of Non-Stationary Speech Protection
for Highly Intelligible Time-Scale Compression [PDF] - funded by IACM-FORTH
Abstract: Speech recordings are everywhere, from social media, YouTube, and online learning to
podcasts and audiobooks. In today’s fast-paced world, it is sometimes necessary to speed
up speech recordings in order to promote faster information consumption. A population
group that benefits the most from such technologies is visually impaired individuals who
employ screen reading on their mobile phones. A series of algorithms have been developed
for the time-scale expansion or compression of speech recordings. It is well known that
fast speech, also known as time-scale compressed speech, is less intelligible due to a loss
of speech parts that are important in distinguishing syllables and words. The majority of
these parts are non-stationary in nature, such as transient sounds, plosives, and fricatives.
In this work, we investigate algorithms for non-stationary speech protection in order
to provide highly intelligible time-scale compression. We base our experiments on the so-called Waveform
Similarity Overlap-and-Add (WSOLA) method of time-scale compression.
WSOLA is capable of providing both uniform and non-uniform time-scale compression. We
propose to characterize speech waveforms according to their non-stationarity using simple
time and frequency domain criteria. Utilizing a frame-by-frame analysis, the first criterion
(C1) is based on the RMS energy of each frame. Additionally, we implement a Line Spectral
Frequency (LSF)-based criterion, named C2, and in combination with C1, we end up with
a hybrid non-stationarity detection criterion named C3. C1 and C3 are implemented on
dataset of Greek speech recordings named GrHarvard.
The latter consists of 720 sentences
from both genders that form 72 phonemically balanced lists of 10 sentences each.
Intelligibility and preference experiments were performed on four of the GrHarvard
lists involving both sighted and visually impaired individuals. Subsequently, a statistical
analysis was carried out to assess the significance of the differences in the results obtained
from both experiments’ tests. In the first experiment, we conducted a comparative analysis
involving uniform WSOLA, non-uniform C1-based WSOLA, and non-uniform C3-based
WSOLA. The principal objective was to assess whether the incorporation of protective
measures had a positive or negative impact on the intelligibility of speech signals. The
findings consistently demonstrated that C1-based WSOLA outperformed the others in both
intelligibility and user preference. It was followed by C3-based WSOLA, with uniform
WSOLA ranking last. In this experiment, characterized by substantial differences, the
majority of observed variations were found to be statistically significant. In the second
experiment, our objective was to assess the same three methods under equal words per
minute (WPM) conditions. This made it challenging for users to distinguish between
different methods and resulted in more uniform outcomes. Differences primarily stemmed
from variations within the signals, related to the sizes of their stationary and non-stationary
parts. Even though the C1-based method tended to achieve the highest intelligibility
(in most cases except at 0.25), it remained challenging to definitively determine which
method was superior in both preference and intelligibility tests. Yet, despite our initial
expectations of better performance in the results of the visually impaired group compared
to the control group, such variations did not materialize, mainly due to the limited number
of visually impaired participants willing to participate in our tests. Consequently, all of
these challenges led the majority of observed results not to attain statistical significance,
even though a discernible pattern was occasionally evident among the methods.
Future work may include further parameter tuning of the stationarity detection algorithm.
As an example, different lengths of analysis and hop frames can be used, as
well as pitch-synchronous analysis in stationary parts of speech. Furthermore, the base method used for time-scale compression can be replaced by other more complex models
for time-scale compression (such as the Harmonic+Noise model). Finally, further experiments
- including a larger sample of visually impaired people - could strengthen statistical
conclusions about the performance of each method.
See More
2019
Irene Sisamaki,
End-to-End Neural based Greek Text-to-Speech Synthesis [PDF]
Abstract: Text-to-speech (TTS) synthesis is the automatic conversion of written text to spoken
language. TTS systems play an important role in natural human-computer
interaction. Concatenative speech synthesis and statistical parametric speech synthesis
were the prominent methods used for decades. In the era of Deep learning,
end-to-end TTS systems have dramatically improved the quality of synthetic
speech. The aim of this work was the implementation of an end-to-end neural
based TTS system for the Greek Language. The neural network architecture of
Tacotron-2 is used for speech synthesis directly from text. The system is composed
of a recurrent sequence-to-sequence feature prediction network that maps character
embeddings to acoustic features, followed by a modified WaveNet model acting
as a vocoder to synthesize time-domain waveforms from the predicted acoustic features.
Developing TTS systems for any given language is a significant challenge and
requires large amount of high quality acoustic recordings.
Because of this, these systems are only available for the most commonly and widely spoken languages.
In this work, experiments are described for various languages and databases which
are freely available. A Greek database, initially created for speech recognition, has
been obtained from ILSP (Institute for Language and Speech Processing). In our
first experiment, only 3 hours of recorded speech in Greek have been used. Then
the technique of language adaptation has been applied, using 3 hours in Greek and
18 hours in Spanish. We also have applied speaker adaptation in order to produce
speech with specific speakers from our database. Our TTS system for Greek can
generate good quality of speech with very natural prosody. An evaluation with a
listening test by 30 volunteers gave a score in MOS (Mean Opinion Score) of 3.15
to our model and 3.82 to the original recordings.
See More
2015
Theodora Yakoumaki,
Expressive speech analysis and classification using adaptive sinusoidal modeling [PDF] - funded by ICS-FORTH
Abstract: Emotional (or stressed/expressive) speech can be defined as the speech style produced by
an emotionally charged speaker. Speakers that feel sad, angry, happy and neutral put a certain
stress in their speech that is typically characterized as emotional. Processing of emotional speech
is assumed among the most challenging speech styles for modelling, recognition, and classifi-
cations. The emotional condition of speakers may be revealed by the analysis of their speech,
and such knowledge could be effective in emergency conditions, health care applications, and as
pre-processing step in recognition and classification systems, among others.
Acoustic analysis of speech produced under different emotional conditions reveals a great
number of speech characteristics that vary according to the emotional state of the speaker. Therefore
these characteristics could be used to identify and/or classify different emotional speech
styles. There is little research on the parameters of the Sinusoidal Model (SM), namely amplitude,
frequency, and phase as features to separate different speaking styles. However, the
estimation of these parameters is subjected to an important constraint; they are derived under
the assumption of local stationarity, that is, the speech signal is assumed to be stationary inside
the analysis window. Nonetheless, speaking styles described as fast or angry may not hold this
assumption. Recently, this problem has been handled by the adaptive Sinusoidal Models (aSMs),
by projecting the signal onto a set of amplitude and frequency varying basis functions inside the
analysis window. Hence, sinusoidal parameters are more accurately estimated.
In this thesis, we propose the use of an adaptive Sinusoidal Model (aSM), the extended adaptive
Quasi-Harmonic Model (eaQHM), for emotional speech analysis and classification. The
eaQHM adapts the amplitude and the phase of the basis functions to the local characteristics of
the signal. Firstly, the eaQHM is employed to analyze emotional speech in accurate, robust, continuous,
time-varying parameters (amplitude and frequency). It is shown that these parameters
can adequately and accurately represent emotional speech content. Using a well known database
of pre-labeled narrowband expressive speech (SUSAS) and the emotional database of Berlin, we
show that very high Signal to Reconstruction Error Ratio (SRER) values can be obtained, compared
to the standard Sinusoidal Model (SM). Specifically, eaQHM outperforms SM in average
by 100% in SRER. Additionally, formal listening tests,on a wideband custom emotional speech
database of running speech, show that eaQHM outperforms SM from a perceptual resynthesis
quality point of view. The parameters obtained from the eaQHM models can represent more
accurately an emotional speech signal. We propose the use of these parameters in an application
based on emotional speech, the classification of emotional speech. Using the SUSAS and
Berlin databases we develop two separate Vector Quantizers (VQs) for the classification, one for
amplitude and one for frequency features. Finally, we suggest a combined amplitude-frequency
classification scheme. Experiments show that both single and combined classification schemes
achieve higher performance when the features are obtained from eaQHM.
See More
Sofia-Elpiniki Yannikaki,
Voicing detection in spontaneous and real-life recordings from music lessons [PDF] - funded by ICS-FORTH
Abstract: Speech is one of the most important abilities that we have, since it is one of the principal
ways of communication with the world. In the past few years a lot of interest has been
shown in developing voice-based applications. Such applications involve the isolation of
speech from an audio file. The algorithms that achieve this are called Voice Detection
algorithms. From the analysis of a given input audio signal, the parts containing voice are
kept while the other parts (noise, silence, etc) are discarded. In this way a great reduction
of the information to be further processed is achieved.
The task of Voice Detection is closely related with Speech/Nonspeech Classification. In
addition, Singing Voice Detection and Speech/Music Discrimination can be seen as subclasses
of what we generally call Voice Detection. When dealing with such tasks, an audio
signal is given as an input to a system and is then processed. The signal is usually analysed
in frames, from which features are extracted. The frame duration depends mostly on the
application and sometimes on the features being used. Many features have been proposed
until now. There are two categories in which the features could be divided, time domain
and frequency domain features. In time domain the short time energy, the zero-crossing
rate and autocorrelation based features are most often used. In frequency domain cepstral
features are most frequently used, due to the useful information about speech presence.
To be more specific, in Singing Voice Detection and in Speech/Music Discrimination the
state-of-the-art feature are the Mel-Frequency Cepstral Coefficients. It has been reported,
that this particular feature provides the best performance in the majority of the cases.
In this thesis an algorithm is developed that performs voice detection in spontaneous
and real-life recordings from music lessons. The content of the recordings was such that
the proposed algorithm was challenged to discriminate both speech and singing voice from
music and other noises. A classic approach for this problem would use MFCCs as the
discrimination feature and an SVM classifier for the classification into “speech” or “nonspeech”.
In our work the methodology of this approach is expanded by preserving the
MFCCs as the main feature and incorporating three other features namely, the Cepstral
Flux, the Clarity and the Harmonicity. Cepstral Flux is extracted from the Cepstrum,
while Clarity and Harmonicity are time-domain autocorrelation-based features. The goal
is to improve with these additional features the performance of the system that uses only
the MFCCs. So, different combination of the three additional features with the MFCCs
were examined and evaluated. A 10-fold cross-validation is applied on segments, which are
labelled as “speech” or “nonspeech”. The database used for the training and the testing
purposes of our algorithm consists of three seminars. Two of them concern traditional
cretan music classes with lira and the third one traditional cretan music classes with lute.
Each recording has been carried out under different environmental conditions.
Performance evaluation was conducted using the Detection Error Tradeoff (DET) and Receiver
Operating Characteristic (ROC) curves as a visual evaluation tool. Also, the Equal
Error Rate (EER), the Efficiency and the Area Under the Curve (AUC) were computed in
each case. Each seminar was evaluated separately, as well as all together. A combination
of training and testing sets from different seminars was also done, to be able to provide
reliable results. It is shown that the use of the additional features significantly enhances
the performance of the classic algorithm that uses only the MFCCs from about 0.5% to
20%. Specifically, it is observed that three out of the five combinations stand out, by
reducing about 20% the miss probability given a false alarm probability equal to 5%.
See More
2014
Olympia Simantiarki,
Voice quality assessment using phase information : application on voice pathology [PDF] - funded by ICS-FORTH
Abstract: One of the most important human abilities is speech along with hearing. Speech is the primary
way in which we attune to the society. Our voice can uncover several information about us
to other people. It reveals our energy level, our emotions, our personality and our artistry. Voice
abnormalities may cause social isolation or may create problems in the professional field. Due to
this significance of the voice, the early detection of a voice pathology is essential.
A well-known voice abnormality is called Spasmodic Dysphonia (SD). SD is a neurological
disease primarily affecting the regular contraction of the muscles around vocal cords, causing
their undesirable vibration. This abnormal vibration of muscles of the glottis has an impact on
speech. One that suffers from SD speaks more tremulous and makes disruptions during speech.
Similar indications appear also to normophonic speakers usually related to stress, voice fatigue,
etc. Even for the normophonic cases, these indications may be a first symptom of a neurological
disease, so an early diagnosis is necessary. Therefore, algorithms that measure the intensity of
the symptoms are very useful.
Traditional methods that detect and quantify voice pathologies use the amplitude information
of the speech signal. More refined approaches make essential the isolation of the glottal source
signal as the glottis is related to voice abnormalities. However, in both cases the amplitude based
methods are not very reliable because the amplitude spectrum cannot capture characteristics of
the glottis. A better indicator of voice irregularities is the phase information. Nevertheless, very
few studies use the phase information because of its difficulty in the manipulation. Moreover,
studies which work with the phase information, use inverse filtering techniques for extracting the
glottal source signal and then they extract features from the phase spectrogram of the glottal
source.
In this thesis, an innovated phase-based method for voice quality assessment is presented.
The proposed method is less complex than the state-of-the-art methods which use the
inverse filtering for extracting the glottal source. Firstly, the instantaneous amplitudes, phases
and frequencies are estimated from the speech signal by an adaptive harmonic model. From the
instantaneous phases of the speech signal through mathematical formulas, a new phase spectrum,
the Phase Distortion (PD) spectrum, is extracted, highly correlated with the shape of the glottal
source. From the time variance of the PD spectrum (PDD), a new metric called Regularity Ratio
(RR) is proposed to capture the irregularities of the glottal source.
Finally, the efficiency of our method is validated on a database containing speakers with SD
before and after the botulinum toxin injection. The results show that the obtained ranking is
highly correlated with the subjective evaluations provided by medical doctors not only on the
overall severity of SD but also on other features like tremor and jitter, revealing that our proposed
feature, the RR, can be applied on other voice pathologies.
See More
Myron Apostolakis,
Development of interactive user interfaces for voice training aimed to children with hearing loss using web technologies in real time [PDF] - funded by ICS-FORTH
Abstract: Διεθνείς έρευνες και παγκόσμιες στατιστικές μετρήσεις έχουν δείξει ότι 1,5 % των παιδιών
μέχρι την ηλικία των 20 ετών έχουν μειωμένη ακουστική ικανότητα ενώ 1 σε 22 παιδιά σχολικής ηλικίας έχουν προβλήματα
ακοής. Το γεγονός αυτό φανερώνει ότι σήμερα στην Ευρώπη υπάρχουν περίπου ένα εκατομμύριο παιδιά με προβλήματα ακοής,
ενώ στις ΗΠΑ 12.000 παιδιά γεννιούνται ετησίως με απώλεια ακοής. Στην Ελλάδα τα βαρήκοα παιδιά υπολογίζονται σε
περίπου 80.000. Τα στοιχεία αυτά κατατάσσουν την απώλεια ακοής στην πρώτη θέση μεταξύ των ασθενειών των νεογνών.
Είναι συχνό φαινόμενο, τα άτομα με απώλεια ακοής να έχουν προβλήματα σε επικοινωνιακό επίπεδο. Λόγω της έλλειψης
της ηχητικής ανατροφοδότησης του εγκεφάλου των παιδιών, το σύστημα παραγωγής ομιλίας τους δεν αναπτύσσεται κανονικά.
Δεδομένου ότι τα κωφά άτομα δεν μπορούν να ακούσουν την ομιλία τους, δεν μπορούν να συντονίσουν τις φωνές τους σε ένα
πιο «σωστό» ήχο. Στην πραγματικότητα αδυνατούν να ελέγξουν τα όργανα παραγωγής λόγου (γλώσσα, δόντια κλπ.) σωστά,
επειδή δεν μπορούν να συνειδητοποιήσουν ποιος είναι ο σωστός τρόπος για να το κάνουν. Ως εκ τούτου μιλούν πολύ δυνατά
για τα φωνήεντα ή παράγουν λάθος τα σύμφωνα. Ωστόσο, ένα πρόσωπο που έχασε την ακοή του σε μεγαλύτερη ηλικία, έχει
μεγαλύτερη πιθανότητα να μιλήσει πιο σωστά. Έτσι καταλήγουμε στο γενικότερο συμπέρασμα ότι τα πάντα είναι θέμα
ανατροφοδότησης. Ο σκοπός αυτής της διατριβής είναι να εισάγει μια νέα προσέγγιση των εργαλείων λογοθεραπείας με
βάση την χρήση πολυμεσικών διαδικτυακών τεχνολογιών λαμβάνοντας υπόψη τα ιδιαίτερα χαρακτηριστικά των ατόμων με
προβλήματα ακοής, ώστε να αποκτήσουν καλύτερες δεξιότητες επικοινωνίας.
Η παρούσα προσέγγιση αξιοποιεί τα ακουστικά χαρακτηριστικά του λόγου,
όπως την ένταση, το ύψος και τα σπεκτρογράμματα χρησιμοποιώντας τα ως οπτική ανατροφοδότηση,
προκειμένου να διδάξει ένα άτομο με απώλεια ακοής πώς να βελτιώσει τον έλεγχο της φωνής του. Πιο συγκεκριμένα έχουμε
αναπτύξει κατάλληλο διαδικτυακό χώρο, όπου ο χρήστης μπορεί να συνδεθεί και να εξασκηθεί με τη συλλογή από διαδικτυακά
παιχνίδια του λόγου. Οι τεχνολογίες που χρησιμοποιήθηκαν για την υλοποίηση των παιχνιδιών είναι η Java, Javascript,
HTML5, CSS3 και frameworks όπως το Apache Shiro και το Hibernate. Η βάση δεδομένων που χρησιμοποιήθηκε είναι η MySQL
και ως διαδικτυακός εξυπηρετητής ο XAMPP . Τα παιχνίδια αυτά εκτελούνται μέσω του προγράμματος φυλλομετρητή και αλληλεπιδρούν
με τον χρήστη αναλύοντας μια ξεχωριστή ιδιότητα της φωνής του σε πραγματικό χρόνο. Κάθε παιχνίδι θα μπορούσε να εκτελεστεί
υπό την εποπτεία μιας ομάδας λογοθεραπευτών ή ακόμα και από το χρήστη τον ίδιο από οποιαδήποτε τοποθεσία. Οι βαθμολογίες
κάθε παιχνιδιού, υπολογίζονται και αποστέλλονται στο διαδικτυακό εξυπηρετητή μας για να αποθηκευτούν και να επεξεργαστούν
στατιστικά. Στη συνέχεια, η απόδοση των χρηστών στο πέρασμα του χρόνου εμφανίζεται σε πραγματικό χρόνο μέσω γραφημάτων.
Οι επόπτες λογοθεραπευτές, θα μπορούσαν να χρησιμοποιήσουν αυτά τα ειδικά γραφήματα για να εντοπίσουν πιθανές αδυναμίες
και να τροποποιήσουν τους στόχους του παιχνιδιού καταλλήλως με απώτερο στόχο την ακόμα μεγαλύτερη βελτίωση του χρήστη.
Ακόμη η συλλογή των παιχνιδιών μας παρουσιάστηκε και αξιολογήθηκε από έμπειρους χρήστες αντίστοιχου λογισμικού
(λογοθεραπευτές). Τέλος, πραγματοποιούμε σύγκριση των τεχνολογιών αιχμής (HTML5, JavaScript) οι οποίες χρησιμοποιήθηκαν
κατά τη διάρκεια της ανάπτυξης όμοιων παιχνιδιών της παρούσας εργασίας και παλιότερων, όπως η Java όσον αφορά την
ευελιξία τους και την απόδοση τους στην παρούσα χρονική στιγμή.
See More
2010
Christina-Alexandra Lionoudaki,
Determining glottal closure and opening instants in speech [PDF] - funded by ICS-FORTH
Abstract: Voice quality is a complex attribute of voice but one important aspect arises from the regularity and duration of the
closed phase from vocal fold cycle to cycle. The determination of closed phase requires tha accurate detection of glottal
closure (GCI) and glottal opening (GOI) instant. In literature, many methods have been suggested on this direction employing
either the Electoglottographic (EGG) or the speech signal.
This work presents a robust algorithm for the detection of glottal instants from the EGG signal and a study on the interaction between Amplitude-Frequency components of speech and glottal phases.
The determination of GCIs and GOIs, is quite straightforward using Electroglottographic (EEG) signals, The derivative of EGG
offers a simple way in detecting the important instances during the production of speech; the glottal closing and opening instants.
In this thesis we suggest an alternative method to the simple derivative which is based on the spectral methods. Spectral methods
provide an elegant way to conduct first and higher order derivatives on discrete time data, with high accuracy.
Furthermore, we introduce a new way to differentiate the EGG signal for estimating the main glottal instants. The gradient of electoglottogrpahic
signal is performed with a method referred to as "Slope Filtering". This approach shows to be robust in revealing the major peaks
in the slope filtered EGG signal, even in cases where the quality of the EGG recordings is not good. Contrary to the simple
derivative of the EGG signal, the peaks can be well distinguished and uniquely specified in the slope filtered signal.
The proposed method exhibits high accuracy of voiced segments, including the onset and offset regions. The derivation of
glottal phases from speech signal has drawn great attention in recent years. A novel approach, relying on the speech signal,
is proposed nased on a Quasi-Harmonic (QHM) representation of speech. The adaptive QHM algorithm estimates the instantaneous
AM-FM components of the speech signal. The extracted components, which are used for the reconstruction of the signal, are
correlated to the glottal phases generated from the EGG signal. The AM component follows a steady pattern for each glottal
phasem whereas the FM component shows low variations for various speakers.
See More
Maria Astrinaki,
Real time voice pathology detection using autocorrelation pitch estimation and short time Jitter estimator [PDF] - funded by ICS-FORTH
Abstract: Voice is the result of the coordination of the whole pneumophonoarticulatory apparatus.
Voice pathologies have become a social concern, as voice and speech play an important role in certain professions,
and in the general population quality of life. The analysis of the voice allows the identification of the diseases
of the vocal apparatus and currently is carried out from an expert doctor through methods based on the auditory analysis.
In these last years emphasis has been placed in early pathology detection, for which classical perturbation measurements
(jitter, shimmer, HNR, etc.) have been used. Going one step ahead the present work is aimed to implement a real time voice
pathology detection system, combined with a Java interface.
George P. Kafentzis,
On the inverse filtering of speech [PDF] - funded by ICS-FORTH
Abstract: In all proposed source-filter models of speech production, Inverse Filtering (IF) is a well known technique
for obtaining the glottal flow waveform, which acts as the source in the vocal tract system. The estimation of glottal flow is of high
interest in a variety of speech areas, such as voice quality assessment, speech coding and synthesis as well as speech modifications.
A major obstacle in comparing and/or suggesting improvements in the current state of the art approaches is simply the lack of real data
concerning the glottal flow. In other words, the results obtained from various inverse filtering algorithms, cannot be directly evaluated
because the actual glottal flow waveform is simply unknown. To this direction, suggestions on the use of synthetic speech that has been
created using artificial glottal waveform are widely used in the literature. This kind of evaluation, however, is not truly objective because
speech synthesis and IF are typically based on similar models of the human voice production apparatus, in our case, the traditional
source-filter model.
This thesis presents three well-known IF methods based on Linear Prediction Analysis (LPA), and a new method, and
its performance is compared to the others. The first one is based on the conventional autocorrelation LPA, and the second one on the
conventional closed phase covariance LPA. The closed phase is identified using Plumpe and Quatieri’s suggested method based on using
statistics on the first formant frequencies during a pitch period. The third one is based the work of Alku et al, which proposed an IF
method based on a Mathematically Constrained Closed Phase Covariance LPA, in which mathematical constraints are imposed on the conventional
covariance analysis. This results in more realistic root locations of the model on the z-plane. Finally, Magi et al suggested a new method
for extracting the vocal tract filter, called Stabilized Weighted LP Analysis (SWLP), in which a short time energy window controls the
performance ofthe LP model. This method is suggested for IF due to its interesting property of applying emphasis on speech samples which
typically occur during the closed phase region of the speech signal. This is expected to yield a more robust, in the acoustic sense, vocal
tract filter estimate than the conventional autocorrelation LP. The three IF approaches along with the suggested new one are applied on a
database of physically modeled speech signals. In this case, the glottal flow and the speech signal are available and direct evaluation of
IF methods can be performed.
Robust time and frequency parametrization measures are applied on both the actual glottal flow and the estimated
ones, in order to evaluate the performance of the methods. These measures include the Normalized Amplitude Quotient (NAQ), the difference between
the first two harmonics (H1-H2) of the glottal spectrum, and the Harmonic Richness Factor (HRF), along with the Signal to Reconstruction Error ratio
(SRER). Experiments conducted on physically modeled sustained vowels (/aa/, /ae/, /eh/, /ih/) of a wide range of frequencies (105 to 255 Hz) for both
male and female speech. Glottal flow estimates were produced, using shorttime pitch synchronous analysis and synthesis for the covariance based methods,
whereas for the autocorrelation methods, a long analysis window and a short synthesis window was used. The results and measures are compared and discussed,
showing the prevalence of the covariance methods, but the suggested method typically produces better results than the conventional autocorrelation LP,
according to our metrics.
See More
George Tzedakis,
Fast least-squares solution for harmonic and sinusoidal models [PDF] - funded by ICS-FORTH
Abstract: The sinusoidal model and its variants are commonly used in speech processing. In the literature,
there are various methods for the estimation of the unknown parameters of the sinusoidal model. Among them, the most known
methods are the ones based on the Fast Fourier Transform (FFT), on Analysis-By-Synthesis (ABS) approaches and through Least
Squares (LS) methods. The LS methods are more accurate and actually optimum for Gaussian noise, and thus, more appropriate
for high quality estimations. In addition, LS methods prove to be able to cope with short analysis windows. On the contrary,
the FFT and the ABS- based methods cannot handle overlapping frequency responses, in other words, they cannot handle short
analysis windows. This is important since in the case of short analysis windows the stationary assumption for the signal is
more valid. However, LS solutions are in general slower compared to FFT-based algorithms and optimized implementations of ABS
schemes.
In the present thesis, our goal is to alleviate the computational burden that the LS-based techniques bear, such that
both the increased accuracy and the faster computational implementation can be achieved. The four models of which the amplitude
coefficients will be estimated, namely the Harmonic, Sinusoidal, Quasi-Harmonic and Generalized Quasi-Harmonic models, are re-
introduced. Then, each model is studied individually and the straightforward LS solution for the amplitude estimation is presented.
The sources of computational load in the case of an LS solution are indicated and various computational improvements are introduced
for each model in terms of its computational complexity and execution time. The first speed up process includes performing matrix
multiplications manually, which yields a direct formula for every element of the result. For the next accelerating method, we show
how we can calculate a certain matrix of exponentials using primarily multiplications. As a final acceleration, having realized that
certain elements of a matrix, which is needed to be calculated and then inverted, play a less important role in the process of deriving
the solution, we allow certain approximations of the matrix by omitting the calculation of the less important elements.
Finally, it is demonstrated that by following the suggested steps, the complexity of LS-based solution along with the execution time, are reduced.
The methods are evaluated by analyzing and re-synthesizing randomly created synthetic signals and calculating the Mean Square Error,
Signal-to-Reconstruction Error Ratio and CPU time improvement for each step. Next, in an effort to test the robustness of our hastening
methods, we illustrate their competence in analyzing noisy synthetic signals. Furthermore, as a final test we check the ability of our
amplitude estimation mechanisms to analyze and synthesize real-world voiced speech signals.
See More
George Grekas,
On speaker interpolation and speech conversion for parallel corpora [PDF] - funded by ICS-FORTH
Abstract: In daily speech the linguistic information plays a major role in the communication between people.
However, voice quality and individuality are important in speech recognition and understanding. For instance, it is exceptionally
significant to understand and discriminate between two or more speakers in a radio or a television program. Voice individuality,
part from providing the aforementioned advantages in communication, enriches our daily life with variety. For a number of modern
applications it is important to create and maintain data bases for different speakers, for example, in gaming, in text-to-speech
synthesis and in cartoon movies. This may be time consuming and expensive, depending on the requirements of the application.
Speaker interpolation (SI) is the process of producing an intermediate voice between two or more speakers, while voice conversion
(VC) is the technique of processing the voice of one person, namely the source speaker, such that his/her voice resembles the voice
of another person, namely the target speaker. Moreover, the converted or interpolated speech should sound natural and intelligible.
Despite the extended research in VC, high-quality voice conversion has not been achieved yeet. A number of reasons explain this current
shortcoming, with the main ones being a) the oversmoothing effect by using of statistical modeling b) inaccurate estimation of the
speaker-depended features and c)the inadequacy of the used synthesis methods. Voice conversion methods are based on spectral envelope
information, which represents the vocal tract, since it has an important role on speech individuality. In conventional VC the excitation
signal of the source speaker is ex- tracted first by inverse filtering. Then this excitation signal is filtered from the vocal tract of
the target speaker. In speech interpolation the excitation signal is filtered from an interpolated vocal tract of the given speakers.
The scope of this thesis is to deal with this research gap and achieve high quality speech interpolation and voice conversion of parallel
corpora using accurate meth- ods for spectral envelope estimation (true envelope), time and frequency alignment (piecewise linear time and
frequency warping), and speech synthesis (interpolated lattice filter or overlap and add). With the use of precise methods in each processing
step it was expected to reduce the artifacts currently met in voice conversion. In speech interpolation the produced vocal tract is not just an
interpolation between the given speakers, but the vocal tract length can be altered, producing a broad range of voices. Hence, given a limited
data base a substantially larger one that contains individual speakers for every use can be created.
See More
Maria C. Koutsogiannaki,
Voice tremor detection using adaptive quasi-harmonic model [PDF] - funded by ICS-FORTH
Abstract: Speech along with hearing is the most important human ability. Voice does not only audibly represents
us to the world, but also reveals our energy level, personality, and artistry. Possible disorders may lead to social isolation or
may create problems on certain profession groups. Most singers seek professional voice help for vocal fatigue, anxiety, throat tension,
and pain. All these symptoms must be quickly addressed to restore the voice and provide physical and emotional relief. Normophomic and
dysphonic speakers have a mutual voice characteristic. Tremor, a rhythmic change in pitch and loudness, appears both in healthy subjects
and in subjects with voice disorders. Physiological tremor or microtremor appears to be a derivative of natural processes. Pathological
tremor, however, is distinguishable and characterized by strong periodical patterns of large amplitude that affect the quality of voice
and influence the ability of patient's communication.
However, researches examine not only pathological but physiological tremor as well,
since they believe that it may be the first or only symptom of a neurological disease, or may indicate vocal fatigue. Therefore, the
analysis of vocal microtremor in normophonic speakers is also important. Traditional methods of vocal tremor detection involve visual
inspection of oscillograms or spectrograms. A more accepted approach is the estimation of the fundamental frequency of the voice signal
and then the extraction of the attributes of the signal that modulates the fundamental frequency, namely its amplitude and frequency.
However, current methods for vocal tremor estimation are characterized by three limitations: a) the extraction only of the first harmonic
for analysis, b) the short duration of the analyzed sustained vowel, and c) the use of a single value to represent a time-varying signal
as tremor.
This thesis presents and validates a novel accurate method for the estimation of the vocal tremor characteristics on sustained
vowels uttered by normophonic and dysphonic subjects and defines the attributes that define vocal tremor, that is, the leveled modulation
amplitude of the harmonics of the signal and its deviation. The extraction of vocal tremor characteristics is performed in three steps. The
first step consists of the estimation of the instantaneous amplitude and instantaneous frequency of every sinusoid-component of the speech
signal using a recently proposed AM-FM decomposition algorithm, the so-called Adaptive Quasi-Harmonic Model. AQHM is an adaptive algorithm
which is able to represent accurately multi-component AM-FM signals like speech. Moreover, AQHM estimates all the instantaneous components
of speech, and thus, in contrast to previous studies, any of the instantaneous components can be used for the analysis of vocal tremor and
not only the _rst harmonic. The second step concerns the subtraction from the instantaneous component of the very slow modulations that are
derived from the pulsation of the heart. This is achieved by _ltering the instantaneous component using a Savitzky-Golay smoothing filter.
Finally, at the third step the modulation frequency and the modulation level of the analyzed instantaneous component are estimated. The
analyzed instantaneous component is assumed to contain time-varying features, since the modulations are primarily non-stationary. The
estimation is performed by employing the AQHM algorithm and two distinct evaluation approaches namely, the Extended Kalman Smoother and the
Hilbert transform. Finally, the efficiency of the algorithm is validated on four databases containing normophonic and dysphonic speakers.
See More
2009
Miltiadis Vasilakis,
Spectral based short-time features for voice quality assessment [PDF] - funded by ICS-FORTH
Abstract: In the context of voice quality assessment, phoniatricians are aided by the measurement of several
phenomena that may reveal the existence of pathology in voice. Of the most prominent among such
phenomena are these of jitter and shimmer. Jitter is defined as perturbations of the glottal cycle and
shimmer is defined as perturbations of the glottal excitation amplitude. Both phenomena occur
during voice production, especially in the case of vowel phonation. Acoustic analysis methods are
usually employed to estimate jitter using the radiated speech signal as input. Most of these methods
measure jitter in the time domain and are based on pitch period estimation, consequently, they
are sensitive to the error of this estimation. Furthermore, the lack of robustness that is exhibited
by pitch period estimators, it makes the use of continuous speech recordings as input problematic,
and essentially limits jitter measurement to sustained vowel signals. Similarly for shimmer, time
domain acoustic analysis methods are usually called to estimate the phenomenon in speech signals,
based on estimation of peak amplitude per period. Moreover, these methods, for both phenomena,
are affected by averaging and explicit or implicit use of low-pass information. The use of mathematical
descriptions for jitter and shimmer, in order to transfer the estimation from the time domain
to the frequency domain, may alleviate these problems.
Using a mathematical model that couples two periodic events to achieve the local aperiodicity,
allows jitter to be modeled as the shift of one of the two periodic events with respect to the
other. Said model, when transformed to the frequency domain, displays interesting spectral trends
between the harmonic and subharmonic subspectra. The two spectral parts are shown to form a
beat spectrum, with the number of intersections between them directly dependent on the shift related
to jitter. This behavior was exploited to develop a short-time Spectral Jitter Estimator (SJE).
Experiments with synthetic signals of jittered phonation showed that SJE provides accurate local
estimates of jitter. Further evaluation was conducted on two databases of actual sustained vowel
recordings from healthy and pathological voices.
Comparison with corresponding estimations from
the Multi-Dimension Voice Program (MDVP) and the Praat system revealed that SJE outperforms
both in normal versus pathological voice discrimination accuracy by at least 4%, as this was judged
using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC) index.
Examination of the short-time statistics of SJE showed that there is a higher correlation with the
existence of pathology in voice, due to the fact that SJE takes into account the full spectrum.
SJE was also shown to be robust against errors in pitch period estimations, which combined
with the ability of jitter estimation over short time intervals, deemed SJE a very good candidate for
measuring jitter in continuous speech. Through cross-database validation a threshold of pathology
for SJE has been determined. By applying this threshold to a database of reading text recordings
from normophonic and dysphonic speakers, a second threshold and new features were established,
especially for monitoring jitter in continuous speech. In terms of AUC, the suggested features for
reading text provide a discrimination score of about 95%, while the second threshold provides a
Classification Rate (CR) of 87.8%. Furthermore, estimated short-time jitter values from reading
text were found to confirm the studies showing the decrease of jitter with increasing fundamental
frequencies, and the more frequent presence of high jitter values in the case of pathological voices
as time increases.
A mathematical model that combines two periodic events, allows also for modeling of shimmer
by applying different amplitude deviations on the two events. Again, by transforming the model
from the time domain to the frequency domain, notable spectral properties are observed. Using this
properties four features indicative of shimmer were created to evaluate the model. Experiments with
synthetic shimmered phonation signals, as well as the two afore-mentioned databases of sustained
vowel recordings, showed that the model captures correctly the shimmer phenomenon and further
development should be pursued.
See More
2007
Andre Holzapfel,
A Component Based Music Classification Approach [PDF] - funded by ICS-FORTH
Abstract: This thesis introduces a new feature set based on a Non-negative Matrix Factorization approach for the
classification of musical signals into genres, only using synchronous organization of music events (vertical dimension of music). This
feature set generates a vector space to describe the spectrogram representation of a music signal. The space is modeled statistically by
a mixture of Gaussians (GMM). A new signal is classified by considering the likelihoods over all the estimated feature vectors given these
statistical models, without constructing a model for the signal itself. Cross-validation tests on two commonly utilized datasets for this
task show the superiority of the proposed features compared to the widely used MFCC type of representation based on classification accuracies
(over 9% of improvement), as well as on a stability measure introduced in this thesis for GMM. Furthermore, we compare results of Non-negative
Matrix Factorization and Independent Component Analysis when used for the approximation of spectrograms, documenting the superiority of
Non-negative Matrix Factorization. Based on our findings we give a concept for a complete musical genre classification system using matrix
factorization and Support Vector Machines.
Yannis Sfakianakis,
A statistical approach for intrusion detection [PDF] - funded by ICS-FORTH
Abstract: Since the Internet’s growth, network security plays a vital role in the computer industry.
Attacks are becoming much more sophisticated and this fact lead the computer community
to look for better and advanced anti-measures. Malicious users existed far before the
Internet was created, however the Internet gave intruders a major boost towards their
potential compromisations. Naturally, the Internet provides convenience and comfort to
every users and “bad news” is merely an infelicity. Clearly the Internet is a step forward;
it must be used for the correct reasons and towards the right cause, nevertheless.
As computer technology becomes more elaborate and complex, programme vulnerabilities
are more frequent and compromisations effortless. A means of attack containment are
the so called “Intrusion detection systems” (IDS).
In this thesis we built a network anomaly IDS, using statistical properties from the
network’s traffic. We were interested in building general purpose, adaptive and data independent
system with as few parameters as possible. The types of attacks it can detect are
Denial of Service attacks and probing attacks. We used three models for our experiments;
Fisher’s Linear Discriminant, Gaussian mixture model and Support vector machines.
In our experiments we found that the most important part of statistical intrusion
detection is the feature selection. Better results can be achieved when both classes are
modeled (attack and normal traffic). Best results were achieved using Fisher’s Linear
Discriminant method, that is 90% detection rate with 5% false alarm rate.
2006
Yannis Pantazis,
Detection of Discontinuities in Concatenative Speech Synthesis [PDF] - funded by ICS-FORTH
Abstract: Last decade, unit selection synthesis became a hot topic in speech synthesis research.
Unit selection gives the greatest naturalness due to the fact that it does not apply a large amount of digital signal processing
to the recorded speech, which often makes recorded speech sound less natural. In order to find the best units in the database, unit
selection is based on two cost functions, /target cost /and /concatenation cost/. Concatenation cost refers to how well adjacent units
can be joined. The problem of finding a concatenation cost function is broken into two subproblems; into finding the proper
parameterizations of the signal and into finding the right distance measure.
Recent studies attempted to specify which concatenation distance measures are able to predict audible discontinuities and thus, highly correlates
with human perception of discontinuity at concatenation point. However, none of the concatenation costs used so far, can measure the similarity
(or, (dis-)continuity) of two consecutive units efficiently. Many features such as line spectral frequencies (LSF) and Mel frequency cepstral coefficients (MFCC)
have been used for the detection of discontinuities.
In this study, three new sets of features for detecting discontinuities are introduced.
The first set of features are obtained by modeling the speech signal as a sum of harmonics with time varying complex amplitude, which
yield a nonlinear speech model. The second set of features is based on a nonlinear speech analysis technique which tries to decompose
speech signals into AM and FM components. The third feature set exploits the nonlinear nature of the ear. Using Lyon’s auditory model,
the behaviour of the cochlea is measured by evaluating neural firing rates. To measure the difference between two vectors of such parameters,
we need a distance measure. Examples of such measures are absolute distance (L1 norm) and Euclidean distance (L2 norm). However,
these measures are naive and provide rather poor results. We further suggest using Fisher’s linear discriminant as well as a quadratic
discriminant as discrimination functions. Linear regression, which employs a least-squares method, was also tested as a discrimination
function. The evaluation of theobjective distance measures (or concatenation costs) as well as the training of the discriminant functions
was performed on two databases. To build a database, a psychoacoustic listening experiment is performed and listener’s opinions are obtained.
The first database was created by Klabbers and Veldhuis in Holland while, the second database was created by Stylianou and Syrdal at ATT Labs.
Therefore, we are able to compare same approaches on different databases and obtain more robust results.
Results obtained from the two different psychoacoustic listening tests showed that nonlinear harmonic model using Fisher’s linear discriminant or linear regression performed very well in
both tests. It was significantly better than MFCC separated with Euclidean distance which a common concatenation cost in modern TTS systems.
Another good concatenation cost, but less good than nonlinear harmonic model, is AM-FM decomposition again with Fisher’s linear discriminant or linear
regression. These results indicate that *a concatenation cost which is based on nonlinear features separated by a statistical discriminant function *is a good choice.
See More