SSPL - Speech Signal Processing Laboratory

2022

Muhammed Shifas P.V., Neural Networks for the Quality and Intelligibility Enhancement of Speech [PDF] - funded by Horizon2020

Abstract: Speech is the most effective way to communicate ideas generated in human minds. However, spoken communication in real life is often affected by noise in the surroundings which can substantially reduce the intelligibility and perceived quality of the signal. Techniques to enhance the communication have been proposed in the past and successfully tested in modern engines like Amazon Alexa, allowing it to operate in adverse conditions. The ambient noise can disrupt both signal acquisition by a device as well as speech perception by the listener. Speech enhancement (SE) techniques are developed to restore speech from its disrupted observations, and listening enhancement (LE) techniques are designed to improve the perceived intelligibility by altering the speech before its presentation in noise as the naturally produced speech is not always very intelligible. Often SE and LE systems are operated as two independent modules in modern devices , which limit their performance. The effort in this thesis is to combine the SE and LE enhancement techniques to have an end-to-end system for communication applications. We approach the problem from neural networking perspective. As such, multiple novel architectures for SE and LE were invented, and the concepts from those models have been used to build the final end-to-end system.
See More

2015

Maria C. Koutsogiannaki, Intelligibility enhancement of Casual speech based on Clear speech properties [PDF] - funded by ICS-FORTH

Abstract: In adverse listening conditions (e.g. presence of noise, hearing-impaired listener etc.) people adjust their speech in order to overcome the communication difficulty and successfully deliver their message. This remarkable adjustment produces different speaking styles compared to unobstructed speech (casual speech) that vary among speakers and conditions, but share a common characteristic; high intelligibility. Developing algorithms that exploit acoustic features of intelligible human speech could be beneficial for speech technology applications that seek methods to enhance the intelligibility of “speaking-devices”. Besides the commercial orientation (e.g., mobile telephone, GPS, customer service systems) of these applications, most important is their medical context, providing assistive communication to people with speech or hearing deficits. However, current speech technology is deaf, meaning that it cannot adjust, like humans do, to the dynamically changing real environments or to the listener’s specificity.

This work proposes signal modifications based on the acoustic properties of a high intelligible human speaking style, the clear speech, assisting in the development of smart speech technology systems that “mimic” the way people produce intelligible speech. Unlike other speaking styles, clear speech has a high intelligibility impact on various listening populations (native and non-native listeners, hearing impaired, cochlear implant users, elderly people, people with learning disabilities etc.) in many listening conditions (quiet, noise, reverberation). A significant part of this work is devoted to the comparative analysis between casual and clear speech, which reveals differences on prosody, vowel spaces, spectral energy and modulation depth of the temporal envelopes. Based on these observed and measured differences between the two speaking styles, we propose modifications for enhancing the intelligibility of casual speech. Compared to other state-of-the-art modification systems, our modification techniques (1) do not require excessive computation (2) are speaker and speech independent (3) maintain speech quality (4) are explicit, since they do not require statistical training and the preexistence of clear speech recordings.
Evaluations on intelligibility and quality are performed objectively using recently proposed objective intelligibility scores and subjectively with listening tests conducted by native and non native listeners in noisy environments (speech shaped noise, SSN), reverberation and in quiet. Results show that our modifications enhance speech intelligibility in SSN and reverberation for native and non-native listeners. Specifically, the proposed spectral modification technique, namely Mix-filtering, increases the intelligibility of speech in noise and reverberation while maintains the quality of the original signal, unlike other intelligibility boosters. Moreover, a modulation depth enhancement technique called DMod, increases speech intelligibility more than 30% in SSN. DMod algorithm is inspired by both clear speech properties and by the non-linear phenomena that take place in the basilar membrane. DMod not only achieves to enhance speech intelligibility, but it introduces a novel method for manipulating the modulation spectrum of the signal. Results of this study indicate a connection of the modulations of the temporal envelopes with speech perception and specifically with processes that take place on the basilar membrane of human ear and pave the way for analyzing and comprehending speech in terms of modulations.

2014

George P. Kafentzis, Adaptive Sinusoidal Models for Speech with Applications in Speech Modifications and Audio Analysis [PDF] - funded by Orange Labs

Abstract: Sinusoidal Modeling is one of the most widely used parametric methods for speech and audio signal processing. The accurate estimation of sinusoidal parameters (amplitudes, frequencies, and phases) is a critical task for close representation of the analyzed signal. In this thesis, based on recent advances in sinusoidal analysis, we propose high resolution adaptive sinusoidal models for analysis, synthesis, and modifications systems of speech. Our goal is to provide systems that represent speech in a highly accurate and compact way.

Inspired by the recently introduced adaptive Quasi-Harmonic Model (aQHM) and adaptive Harmonic Model (aHM), we overview the theory of adaptive Sinusoidal Modeling and we propose a model named the extended adaptive Quasi- Harmonic Model (eaQHM), which is a non-parametric model able to adjust the instantaneous amplitudes and phases of its basis functions to the underlying time-varying characteristics of the speech signal, thus significantly alleviating the so-called local stationarity hypothesis. The eaQHM is shown to outperform aQHM in analysis and resynthesis of voiced speech. Based on the eaQHM, a hybrid analysis/synthesis system of speech is presented (eaQHNM), along with a hybrid version of the aHM (aHNM). Moreover, we present motivation for a full-band representation of speech using the eaQHM, that is, representing all parts of speech as high resolution AM-FM sinusoids. Experiments show that adaptation and quasi-harmonicity is sufficient to provide transparent quality in unvoiced speech resynthesis. The full-band eaQHM analysis and synthesis system is presented next, which outperforms state-of-the-art systems, hybrid or full-band, in speech reconstruction, providing transparent quality confirmed by objective and subjective evaluations.
Regarding applications, the eaQHM and the aHM are applied on speech modifications (time and pitch scaling). The resulting modifications are of high quality, and follow very simple rules, compared to other state-of-the-art modification systems. The concepts of relative phase and relative phase delays are crucial for the development of artefact-free, shape-invariant, high quality modifications. Results show that harmonicity is preferred over quasi-harmonicity in speech modifications due to the embedded simplicity of representation. Moreover, the full-band eaQHM is applied on the problem of modeling audio signals, and specifically of musical instrument sounds. The eaQHM is evaluated and compared to state-of-the-art systems, and is shown to outperform them in terms of resynthesis quality, successfully representing the attack, transient, and stationary part of a musical instrument sound. Finally, another application is suggested, namely the analysis and classification of emotional speech. The eaQHM is applied on the analysis of emotional speech, providing its instantaneous parameters as features that can be used in recognition and Vector-Quantization-based classification of the emotional content of speech. Although the sinusoidal models are not commonly used in such tasks, results are promising.

2011

Maria Markaki, Selection of Relevant Features for Audio Classification tasks [PDF] - funded by ICS-FORTH

Advances in time-frequency distributions and spectral analysis techniques (i.e., for the estimation of amplitude and/or frequency modulations) allow a better representation of non-stationary signals like speech, highlighting their fine structure and dynamics. Although such representations are very useful for analysis purposes, they complicate the classification tasks due to the large number of parameters extracted from the signal (“curse of dimensionality”). For such tasks, a significant dimensionality reduction is required.
See More

2010

Andre Holzapfel, Similarity methods for computational ethnomusicology [PDF] - funded by ICS-FORTH

Abstract: The field of computational ethnomusicology has drawn growing attention by researchers in the music information retrieval community. In general, subjects are considered that are related to the processing of traditional forms of music, often with the goal to support studies in the field of musicology with computational means. Tools have been proposed that make access to large digital collections of traditional music easier, for example by automatically detecting a specific kind of similarity between pieces or by automatically segmenting data into partitions that are either relevant or irrelevant for further investigation. In this thesis, the focus lies on music of the Eastern Mediterranean, and specifically on traditional music of Greece and Turkey. At the beginning of the thesis related work, the task was defined which directed the aspects of the necessary research activities.

The task was motivated by the geographical location of the author, the island of Crete in Greece, but in the course of the thesis this task proved to have strong relevance for a much wider musical context: Given a polyphonic recording of a piece of Cretan traditional dance music, find a recording that is similar to it. Theory of musicology provided us with the way to approach this task. The traditional music encountered in Greece and in wide parts of the Balkan states and Turkey as well, follows the logic of parataxis, which means that pieces are constructed by temporally aligning short musical phrases, without the existence of structures present in classical music or popular music. Thus, a system that is designed to cope with the above mentioned task has to be able to estimate the similarity of such phrases. As we deal with polyphonic audio signals of music that has not been written to a score, at least not before the performance, we need to do some simplification.
This is because the exact transcription of the main melody from a polyphonic mixture into a score is still an unsolved problem. And on the other side, the transcription of traditional music even by human experts is an extremely complex and difficult process. For that reason, a system has been designed that considers aspects of rhythm, timbre and melody for approaching the task. The central aspect that has been considered in this thesis is rhythm. For this, a point of major interest is the estimation at which time instances within an audio signal a musical instrument starts playing a note. This estimation is referred to as onset detection, and has been approached in this thesis using novel group delay and fundamental frequency based approaches, and with a fusion of these characteristics with an spectral amplitude criterion. With these findings in the field of onset detection, improved beat trackers and rhythmic similarity estimation techniques are developed. The proposed beat tracker applies the group delay based onset detection method in the context of a state-of-the-art approach for beat tracking. Results show clear improvements when applying this method for beat tracking on a dataset of traditional music.
The rhythmic similarity estimation is based on scale transformation, which avoids the inuence of tempo differences between pieces of music that are to be compared. On datasets containing Greek and Turkish traditional music high accuracies in a classification task are achieved, and the validity of the proposed measure as a similarity measure is supported by the results of listening tests. Apart from rhythm, also the aspect of instrumental timbre has been addressed. A novel feature set based on Non-negative Matrix Factorization (NMF) is proposed to describe the characteristic spectral bases of a piece of music. These bases are modelled using statistical methods, and it is shown that these models describe the spectral space of musical genres and instrumental classes in a compact and discriminative way.
Finally, melodic aspects have been considered as well by combining state-of-the-art approaches for cover song detection in popular music and fundamental frequency detection from polyphonic signals. This combination is shown to tackle the central task of the thesis work in a satisfying way on a small exemplary dataset. A morphological analysis framework that combines the aspects of rhythm, timbre and melody is proposed, which can be used to detect similarities in traditional music. For the development of the algorithms presented in this thesis, evaluation data had to be collected. This was a task of major difficulty and much effort has been made by the author to understand well the musical context that is investigated in this thesis. For many datasets, the ground truth was achieved in cooperation with local musicians in time-consuming but very informative interviews. The knowledge obtained in these interviews and the resulting datasets are another important contribution of this thesis.

See More

Yannis Pantazis, Decomposition of AM-FM signals with applications in speech processing [PDF] - funded by ICS-FORTH/Orange Labs

Abstract: During the last decades, sinusoidal model gained a lot of popularity since it is able to represent non-stationary signals very accurately. The estimation of the instantaneous components (i.e. instantaneous amplitude, instantaneous frequency and instantaneous phase) is an active area of research. In this thesis, we develop and test models and algorithms for the estimation of the instantaneous components of sinusoidal representation. Our goal is to reduce the estimation error due to the non-stationary character of the analyzed signals by taking advantage of time-domain information. Thus, we re-introduce a time-varying model referred to as QHM which is able to adjust its frequency values closer to the true frequency values. We further show that an iterative scheme based on QHM produce statistically efficient sinusoidal parameter estimation. Moreover, we extend QHM to chirp QHM (cQHM) which is able to capture linear evolution of instantaneous frequency quite satisfactorily.
See More

2007

Yannis Agiomyrgiannakis, Sinusoidal Coding of Speech for Voice over IP [PDF] - funded by ICS-FORTH

Abstract:It is widely accepted that Voice-over-Internet-Protocol (VoIP) will dominate wireless and wireline voice communications in the near future. Traditionally, a minimum level of Quality-of-Service is achieved by careful traffic monitoring and network fine-tuning. However, this solution is not feasible when there is no possibility of controlling/monitoring the parameters of the network. For example, when speech traffic is routed through Internet there are increased packet losses due to network delays and the strict end-to-end delay requirements for voice communication. Most of today’s speech codecs were not initially designed to cope with such conditions. One solution is to introduce channel coding at the expense of end-to-end delay. Another solution is to perform joint source/channel coding of speech by designing speech codecs which are natively robust to increased packet losses. This thesis proposes a framework for developing speech codecs which are robust to packet losses.

The thesis addresses the problem in two levels: at the basic source/channel coding level where novel methods are proposed for introducing controlled redundancy into the bitstream, and at the signal representation/ coding level where a novel speech parameterization/modelling is presented that is amenable to efficient quantization using the proposed source coding methods. The speech codec is designed to facilitate high-quality Packet Loss Concealment (PLC). The speech signal is modeled with harmonically related sinusoids; a representation that enables fine time-frequency resolution which is vital for high-quality PLC. Furthermore, each packet is encoded independently of the previous packets in order to avoid a desynchronization between the encoder and the decoder upon a packet loss. This allows some redundancy to exist in the bit-stream. A number of contributions are made to well-known harmonic speech models. A fast analysis/synthesis method is proposed and used in the construction of an Analysis-by-Synthesis (AbS) pitch detector. Harmonic Codecs tend to rely on phase models for the reconstruction of the harmonic phases, introducing artifacts that effect the quality of the reconstructed speech signal. For a high-quality speech reconstruction, the quantization of phase is required. Unfortunately, phase quantization is not a trivial problem because phases are circular variables. A novel phase-quantization algorithm is proposed to address this problem. Harmonics phases are properly aligned and modeled with a Wrapped Gaussian Mixture Model (WGMM) capable of handling parameters that belong to circular spaces. The WGMM is estimated with a suitable Expectation-Maximization (EM) algorithm. Phases are then quantized by extending the efficient GMM-based quantization techniques for linear spaces to WGMM and circular spaces. When packet losses are increased, additional redundancy can be introduced using Multiple Description Coding (MDC). In MDC, each frame is encoded in two descriptions; receiving both descriptions provides a high-quality reconstruction while receiving one description provides a lower-quality reconstruction. With current GMM-based MDC schemes it is possible to quantize the amplitudes of the harmonics which represent an important portion of the information of the speech signal. A novel WGMM-based MDC scheme is proposed and used for MDC of the harmonic phases.
It is shown that it is possible to construct high-quality MDC codecs based on harmonic models. Furthermore, it is shown that the redundancy between the MDC descriptions can be used to “correct” bit errors that may have occurred during transmission. At the source coding level, a scheme for Multiple Description Transform Coding (MDTC) of multivariate Gaussians using Parseval Frame expansions and a source coding technique referred to as Conditional Vector Quantization (CVQ), are proposed. The MDTC algorithm is extended to generic sources that can be modeled with GMM. The proposed frame facilitates a computationally efficient Optimal Consistent Reconstruction algorithm (OCR) and Cooperative Encoding (CE). In CE, the two MDTC encoders cooperate in order to provide better central/side distortion tradeoffs. The proposed scheme provides scalability, low complexity and storage requirements, excellent performance in low redundancies and competitive performance in high redundancies. In CVQ, the focus is given in correcting the most frequent type of errors; single and double packet losses. Furthermore, CVQ finds application to BandWidth Expansion (BWE), the extension of the bandwidth of narrowband speech to wideband. Concluding, two proof-of-concept harmonic codecs are constructed, a single description and a multiple description codec. Both codecs are narrowband, variable rate, similar to quality with the state-of-the-art iLBC (internet Low Bit-Rate Codec) under perfect channel conditions and better than iLBC when packet losses occur. The single description codec requires 14 kbps and it is capable of accepting 20% packet losses with minimal quality degradation while the multiple description codec operates at 21 kbps while it is capable of accepting 40% packet losses without significant quality degradation.

Home

Where are we?

Who are we?

What are we working on?

Completed M.Sc. theses

Publications

Completed Ph.D. theses

2022 2015 2014 2011 2010 2007

2022

2015

2014

2011

2010

2007