Completed M.Sc. theses



The intellectual property of the theses is jointly owned by the University of Crete and FORTH, where indicated.

2023 2019 2015 2014 2010 2009 2007 2006



2023

Panagiotis Pantalos, Exploration of Non-Stationary Speech Protection for Highly Intelligible Time-Scale Compression [PDF] - funded by IACM-FORTH

Abstract: Speech recordings are everywhere, from social media, YouTube, and online learning to podcasts and audiobooks. In today’s fast-paced world, it is sometimes necessary to speed up speech recordings in order to promote faster information consumption. A population group that benefits the most from such technologies is visually impaired individuals who employ screen reading on their mobile phones. A series of algorithms have been developed for the time-scale expansion or compression of speech recordings. It is well known that fast speech, also known as time-scale compressed speech, is less intelligible due to a loss of speech parts that are important in distinguishing syllables and words. The majority of these parts are non-stationary in nature, such as transient sounds, plosives, and fricatives. In this work, we investigate algorithms for non-stationary speech protection in order to provide highly intelligible time-scale compression. We base our experiments on the so-called Waveform Similarity Overlap-and-Add (WSOLA) method of time-scale compression. WSOLA is capable of providing both uniform and non-uniform time-scale compression. We propose to characterize speech waveforms according to their non-stationarity using simple time and frequency domain criteria. Utilizing a frame-by-frame analysis, the first criterion (C1) is based on the RMS energy of each frame. Additionally, we implement a Line Spectral Frequency (LSF)-based criterion, named C2, and in combination with C1, we end up with a hybrid non-stationarity detection criterion named C3. C1 and C3 are implemented on dataset of Greek speech recordings named GrHarvard.
See More

2019

Irene Sisamaki, End-to-End Neural based Greek Text-to-Speech Synthesis [PDF]

Abstract: Text-to-speech (TTS) synthesis is the automatic conversion of written text to spoken language. TTS systems play an important role in natural human-computer interaction. Concatenative speech synthesis and statistical parametric speech synthesis were the prominent methods used for decades. In the era of Deep learning, end-to-end TTS systems have dramatically improved the quality of synthetic speech. The aim of this work was the implementation of an end-to-end neural based TTS system for the Greek Language. The neural network architecture of Tacotron-2 is used for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to acoustic features, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from the predicted acoustic features. Developing TTS systems for any given language is a significant challenge and requires large amount of high quality acoustic recordings.
See More

2015

Theodora Yakoumaki, Expressive speech analysis and classification using adaptive sinusoidal modeling [PDF] - funded by ICS-FORTH

Abstract: Emotional (or stressed/expressive) speech can be defined as the speech style produced by an emotionally charged speaker. Speakers that feel sad, angry, happy and neutral put a certain stress in their speech that is typically characterized as emotional. Processing of emotional speech is assumed among the most challenging speech styles for modelling, recognition, and classifi- cations. The emotional condition of speakers may be revealed by the analysis of their speech, and such knowledge could be effective in emergency conditions, health care applications, and as pre-processing step in recognition and classification systems, among others. Acoustic analysis of speech produced under different emotional conditions reveals a great number of speech characteristics that vary according to the emotional state of the speaker. Therefore these characteristics could be used to identify and/or classify different emotional speech styles. There is little research on the parameters of the Sinusoidal Model (SM), namely amplitude, frequency, and phase as features to separate different speaking styles. However, the estimation of these parameters is subjected to an important constraint; they are derived under the assumption of local stationarity, that is, the speech signal is assumed to be stationary inside the analysis window. Nonetheless, speaking styles described as fast or angry may not hold this assumption. Recently, this problem has been handled by the adaptive Sinusoidal Models (aSMs), by projecting the signal onto a set of amplitude and frequency varying basis functions inside the analysis window. Hence, sinusoidal parameters are more accurately estimated.
See More

Sofia-Elpiniki Yannikaki, Voicing detection in spontaneous and real-life recordings from music lessons [PDF] - funded by ICS-FORTH

Abstract: Speech is one of the most important abilities that we have, since it is one of the principal ways of communication with the world. In the past few years a lot of interest has been shown in developing voice-based applications. Such applications involve the isolation of speech from an audio file. The algorithms that achieve this are called Voice Detection algorithms. From the analysis of a given input audio signal, the parts containing voice are kept while the other parts (noise, silence, etc) are discarded. In this way a great reduction of the information to be further processed is achieved.
The task of Voice Detection is closely related with Speech/Nonspeech Classification. In addition, Singing Voice Detection and Speech/Music Discrimination can be seen as subclasses of what we generally call Voice Detection. When dealing with such tasks, an audio signal is given as an input to a system and is then processed. The signal is usually analysed in frames, from which features are extracted. The frame duration depends mostly on the application and sometimes on the features being used. Many features have been proposed until now. There are two categories in which the features could be divided, time domain and frequency domain features. In time domain the short time energy, the zero-crossing rate and autocorrelation based features are most often used. In frequency domain cepstral features are most frequently used, due to the useful information about speech presence.
See More

2014

Olympia Simantiarki, Voice quality assessment using phase information : application on voice pathology [PDF] - funded by ICS-FORTH

Abstract: One of the most important human abilities is speech along with hearing. Speech is the primary way in which we attune to the society. Our voice can uncover several information about us to other people. It reveals our energy level, our emotions, our personality and our artistry. Voice abnormalities may cause social isolation or may create problems in the professional field. Due to this significance of the voice, the early detection of a voice pathology is essential. A well-known voice abnormality is called Spasmodic Dysphonia (SD). SD is a neurological disease primarily affecting the regular contraction of the muscles around vocal cords, causing their undesirable vibration. This abnormal vibration of muscles of the glottis has an impact on speech. One that suffers from SD speaks more tremulous and makes disruptions during speech. Similar indications appear also to normophonic speakers usually related to stress, voice fatigue, etc. Even for the normophonic cases, these indications may be a first symptom of a neurological disease, so an early diagnosis is necessary. Therefore, algorithms that measure the intensity of the symptoms are very useful.
See More

Myron Apostolakis, Development of interactive user interfaces for voice training aimed to children with hearing loss using web technologies in real time [PDF] - funded by ICS-FORTH

Abstract: Διεθνείς έρευνες και παγκόσμιες στατιστικές μετρήσεις έχουν δείξει ότι 1,5 % των παιδιών μέχρι την ηλικία των 20 ετών έχουν μειωμένη ακουστική ικανότητα ενώ 1 σε 22 παιδιά σχολικής ηλικίας έχουν προβλήματα ακοής. Το γεγονός αυτό φανερώνει ότι σήμερα στην Ευρώπη υπάρχουν περίπου ένα εκατομμύριο παιδιά με προβλήματα ακοής, ενώ στις ΗΠΑ 12.000 παιδιά γεννιούνται ετησίως με απώλεια ακοής. Στην Ελλάδα τα βαρήκοα παιδιά υπολογίζονται σε περίπου 80.000. Τα στοιχεία αυτά κατατάσσουν την απώλεια ακοής στην πρώτη θέση μεταξύ των ασθενειών των νεογνών. Είναι συχνό φαινόμενο, τα άτομα με απώλεια ακοής να έχουν προβλήματα σε επικοινωνιακό επίπεδο. Λόγω της έλλειψης της ηχητικής ανατροφοδότησης του εγκεφάλου των παιδιών, το σύστημα παραγωγής ομιλίας τους δεν αναπτύσσεται κανονικά. Δεδομένου ότι τα κωφά άτομα δεν μπορούν να ακούσουν την ομιλία τους, δεν μπορούν να συντονίσουν τις φωνές τους σε ένα πιο «σωστό» ήχο. Στην πραγματικότητα αδυνατούν να ελέγξουν τα όργανα παραγωγής λόγου (γλώσσα, δόντια κλπ.) σωστά, επειδή δεν μπορούν να συνειδητοποιήσουν ποιος είναι ο σωστός τρόπος για να το κάνουν. Ως εκ τούτου μιλούν πολύ δυνατά για τα φωνήεντα ή παράγουν λάθος τα σύμφωνα. Ωστόσο, ένα πρόσωπο που έχασε την ακοή του σε μεγαλύτερη ηλικία, έχει μεγαλύτερη πιθανότητα να μιλήσει πιο σωστά. Έτσι καταλήγουμε στο γενικότερο συμπέρασμα ότι τα πάντα είναι θέμα ανατροφοδότησης. Ο σκοπός αυτής της διατριβής είναι να εισάγει μια νέα προσέγγιση των εργαλείων λογοθεραπείας με βάση την χρήση πολυμεσικών διαδικτυακών τεχνολογιών λαμβάνοντας υπόψη τα ιδιαίτερα χαρακτηριστικά των ατόμων με προβλήματα ακοής, ώστε να αποκτήσουν καλύτερες δεξιότητες επικοινωνίας.
See More

2010

Christina-Alexandra Lionoudaki, Determining glottal closure and opening instants in speech [PDF] - funded by ICS-FORTH

Abstract: Voice quality is a complex attribute of voice but one important aspect arises from the regularity and duration of the closed phase from vocal fold cycle to cycle. The determination of closed phase requires tha accurate detection of glottal closure (GCI) and glottal opening (GOI) instant. In literature, many methods have been suggested on this direction employing either the Electoglottographic (EGG) or the speech signal.
This work presents a robust algorithm for the detection of glottal instants from the EGG signal and a study on the interaction between Amplitude-Frequency components of speech and glottal phases. The determination of GCIs and GOIs, is quite straightforward using Electroglottographic (EEG) signals, The derivative of EGG offers a simple way in detecting the important instances during the production of speech; the glottal closing and opening instants. In this thesis we suggest an alternative method to the simple derivative which is based on the spectral methods. Spectral methods provide an elegant way to conduct first and higher order derivatives on discrete time data, with high accuracy.
See More

Maria Astrinaki, Real time voice pathology detection using autocorrelation pitch estimation and short time Jitter estimator [PDF] - funded by ICS-FORTH

Abstract: Voice is the result of the coordination of the whole pneumophonoarticulatory apparatus. Voice pathologies have become a social concern, as voice and speech play an important role in certain professions, and in the general population quality of life. The analysis of the voice allows the identification of the diseases of the vocal apparatus and currently is carried out from an expert doctor through methods based on the auditory analysis. In these last years emphasis has been placed in early pathology detection, for which classical perturbation measurements (jitter, shimmer, HNR, etc.) have been used. Going one step ahead the present work is aimed to implement a real time voice pathology detection system, combined with a Java interface.

George P. Kafentzis, On the inverse filtering of speech [PDF] - funded by ICS-FORTH

Abstract: In all proposed source-filter models of speech production, Inverse Filtering (IF) is a well known technique for obtaining the glottal flow waveform, which acts as the source in the vocal tract system. The estimation of glottal flow is of high interest in a variety of speech areas, such as voice quality assessment, speech coding and synthesis as well as speech modifications. A major obstacle in comparing and/or suggesting improvements in the current state of the art approaches is simply the lack of real data concerning the glottal flow. In other words, the results obtained from various inverse filtering algorithms, cannot be directly evaluated because the actual glottal flow waveform is simply unknown. To this direction, suggestions on the use of synthetic speech that has been created using artificial glottal waveform are widely used in the literature. This kind of evaluation, however, is not truly objective because speech synthesis and IF are typically based on similar models of the human voice production apparatus, in our case, the traditional source-filter model.
See More

George Tzedakis, Fast least-squares solution for harmonic and sinusoidal models [PDF] - funded by ICS-FORTH

Abstract: The sinusoidal model and its variants are commonly used in speech processing. In the literature, there are various methods for the estimation of the unknown parameters of the sinusoidal model. Among them, the most known methods are the ones based on the Fast Fourier Transform (FFT), on Analysis-By-Synthesis (ABS) approaches and through Least Squares (LS) methods. The LS methods are more accurate and actually optimum for Gaussian noise, and thus, more appropriate for high quality estimations. In addition, LS methods prove to be able to cope with short analysis windows. On the contrary, the FFT and the ABS- based methods cannot handle overlapping frequency responses, in other words, they cannot handle short analysis windows. This is important since in the case of short analysis windows the stationary assumption for the signal is more valid. However, LS solutions are in general slower compared to FFT-based algorithms and optimized implementations of ABS schemes.
See More

George Grekas, On speaker interpolation and speech conversion for parallel corpora [PDF] - funded by ICS-FORTH

Abstract: In daily speech the linguistic information plays a major role in the communication between people. However, voice quality and individuality are important in speech recognition and understanding. For instance, it is exceptionally significant to understand and discriminate between two or more speakers in a radio or a television program. Voice individuality, part from providing the aforementioned advantages in communication, enriches our daily life with variety. For a number of modern applications it is important to create and maintain data bases for different speakers, for example, in gaming, in text-to-speech synthesis and in cartoon movies. This may be time consuming and expensive, depending on the requirements of the application.
See More

Maria C. Koutsogiannaki, Voice tremor detection using adaptive quasi-harmonic model [PDF] - funded by ICS-FORTH

Abstract: Speech along with hearing is the most important human ability. Voice does not only audibly represents us to the world, but also reveals our energy level, personality, and artistry. Possible disorders may lead to social isolation or may create problems on certain profession groups. Most singers seek professional voice help for vocal fatigue, anxiety, throat tension, and pain. All these symptoms must be quickly addressed to restore the voice and provide physical and emotional relief. Normophomic and dysphonic speakers have a mutual voice characteristic. Tremor, a rhythmic change in pitch and loudness, appears both in healthy subjects and in subjects with voice disorders. Physiological tremor or microtremor appears to be a derivative of natural processes. Pathological tremor, however, is distinguishable and characterized by strong periodical patterns of large amplitude that affect the quality of voice and influence the ability of patient's communication.
See More

2009

Miltiadis Vasilakis, Spectral based short-time features for voice quality assessment [PDF] - funded by ICS-FORTH

Abstract: In the context of voice quality assessment, phoniatricians are aided by the measurement of several phenomena that may reveal the existence of pathology in voice. Of the most prominent among such phenomena are these of jitter and shimmer. Jitter is defined as perturbations of the glottal cycle and shimmer is defined as perturbations of the glottal excitation amplitude. Both phenomena occur during voice production, especially in the case of vowel phonation. Acoustic analysis methods are usually employed to estimate jitter using the radiated speech signal as input. Most of these methods measure jitter in the time domain and are based on pitch period estimation, consequently, they are sensitive to the error of this estimation. Furthermore, the lack of robustness that is exhibited by pitch period estimators, it makes the use of continuous speech recordings as input problematic, and essentially limits jitter measurement to sustained vowel signals. Similarly for shimmer, time domain acoustic analysis methods are usually called to estimate the phenomenon in speech signals, based on estimation of peak amplitude per period. Moreover, these methods, for both phenomena, are affected by averaging and explicit or implicit use of low-pass information. The use of mathematical descriptions for jitter and shimmer, in order to transfer the estimation from the time domain to the frequency domain, may alleviate these problems.
See More

2007

Andre Holzapfel, A Component Based Music Classification Approach [PDF] - funded by ICS-FORTH

Abstract: This thesis introduces a new feature set based on a Non-negative Matrix Factorization approach for the classification of musical signals into genres, only using synchronous organization of music events (vertical dimension of music). This feature set generates a vector space to describe the spectrogram representation of a music signal. The space is modeled statistically by a mixture of Gaussians (GMM). A new signal is classified by considering the likelihoods over all the estimated feature vectors given these statistical models, without constructing a model for the signal itself. Cross-validation tests on two commonly utilized datasets for this task show the superiority of the proposed features compared to the widely used MFCC type of representation based on classification accuracies (over 9% of improvement), as well as on a stability measure introduced in this thesis for GMM. Furthermore, we compare results of Non-negative Matrix Factorization and Independent Component Analysis when used for the approximation of spectrograms, documenting the superiority of Non-negative Matrix Factorization. Based on our findings we give a concept for a complete musical genre classification system using matrix factorization and Support Vector Machines.

Yannis Sfakianakis, A statistical approach for intrusion detection [PDF] - funded by ICS-FORTH

Abstract: Since the Internet’s growth, network security plays a vital role in the computer industry. Attacks are becoming much more sophisticated and this fact lead the computer community to look for better and advanced anti-measures. Malicious users existed far before the Internet was created, however the Internet gave intruders a major boost towards their potential compromisations. Naturally, the Internet provides convenience and comfort to every users and “bad news” is merely an infelicity. Clearly the Internet is a step forward; it must be used for the correct reasons and towards the right cause, nevertheless. As computer technology becomes more elaborate and complex, programme vulnerabilities are more frequent and compromisations effortless. A means of attack containment are the so called “Intrusion detection systems” (IDS). In this thesis we built a network anomaly IDS, using statistical properties from the network’s traffic. We were interested in building general purpose, adaptive and data independent system with as few parameters as possible. The types of attacks it can detect are Denial of Service attacks and probing attacks. We used three models for our experiments; Fisher’s Linear Discriminant, Gaussian mixture model and Support vector machines. In our experiments we found that the most important part of statistical intrusion detection is the feature selection. Better results can be achieved when both classes are modeled (attack and normal traffic). Best results were achieved using Fisher’s Linear Discriminant method, that is 90% detection rate with 5% false alarm rate.

2006

Yannis Pantazis, Detection of Discontinuities in Concatenative Speech Synthesis [PDF] - funded by ICS-FORTH

Abstract: Last decade, unit selection synthesis became a hot topic in speech synthesis research. Unit selection gives the greatest naturalness due to the fact that it does not apply a large amount of digital signal processing to the recorded speech, which often makes recorded speech sound less natural. In order to find the best units in the database, unit selection is based on two cost functions, /target cost /and /concatenation cost/. Concatenation cost refers to how well adjacent units can be joined. The problem of finding a concatenation cost function is broken into two subproblems; into finding the proper parameterizations of the signal and into finding the right distance measure.
Recent studies attempted to specify which concatenation distance measures are able to predict audible discontinuities and thus, highly correlates with human perception of discontinuity at concatenation point. However, none of the concatenation costs used so far, can measure the similarity (or, (dis-)continuity) of two consecutive units efficiently. Many features such as line spectral frequencies (LSF) and Mel frequency cepstral coefficients (MFCC) have been used for the detection of discontinuities.
See More