The intellectual property of the theses is jointly owned by the University of Crete and ICS-FORTH or by the University of Crete and Orange Labs, where indicated.
2022
Muhammed Shifas P.V.,
Neural Networks for the Quality and Intelligibility Enhancement of Speech [PDF] - funded by Horizon2020
Abstract: Speech is the most effective way to communicate ideas generated in human minds. However, spoken communication
in real life is often affected by noise in the surroundings which can substantially reduce the intelligibility and perceived
quality of the signal. Techniques to enhance the communication have been proposed in the past and successfully tested
in modern engines like Amazon Alexa, allowing it to operate in adverse conditions. The ambient noise can disrupt both
signal acquisition by a device as well as speech perception by the listener. Speech enhancement (SE) techniques are
developed to restore speech from its disrupted observations, and listening enhancement (LE) techniques are designed
to improve the perceived intelligibility by altering the speech before its presentation in noise as the naturally produced
speech is not always very intelligible. Often SE and LE systems are operated as two independent modules in modern
devices , which limit their performance. The effort in this thesis is to combine the SE and LE enhancement techniques to
have an end-to-end system for communication applications. We approach the problem from neural networking perspective.
As such, multiple novel architectures for SE and LE were invented, and the concepts from those models have been
used to build the final end-to-end system.
Regarding speech enhancement (SE), three new architectures have been invented; two of which are in the feature
domain and one in the waveform domain. The feature domain architectures formulate the enhancement task in the shorttime
Fourier transform (STFT) representation of speech, therefore, are parametrically less complex. Features from the
two-dimensional (2D) representation of speech are extracted with the use of gruCNN neural cell, which is found effective
in isolating noises with high variance. The gruCNN-SE model has outperformed state-of-the-art speech enhancement
systems with standard convolution (CNN) and long short-term memory (LSTM) cells. Subsequently, a bidirectional
extension of gruCNN module (BigruCNN) is proposed with the inclusion of backward dependencies among the 2D
frames. Besides, a novel waveform domain network with a characteristic dilation pattern (SE-FFTNet) is presented. The
SE-FFTNet is found efficient in learning the statistical dissimilarity of speech and noise in a noisy observation.
Regarding listening enhancement (LE), a novel WaveNet-like architecture to improve the listener’s intelligibility in
noise (wSSDRC) is proposed. The wSSDRC system performs both spectral shaping (SS) and dynamic range compression
(DRC) of the input for intelligibility enhancement. The model is found to produce a median absolute intelligibility boost
of 39% for normal hearing and 38% for hearing-impaired listeners in stationary noise over the unprocessed speech.
Subsequently, a novel end-to-end system which combines the objectives of SE and LE is proposed to enhance the
intelligibility of noisy observations. The end-to-end system was found to increase the listeners’ keyword correct rate in
stationary noise from 2.5% to 60% at 0 dB input SNR, and from about 10% to 75% at 5 dB input SNR, compared with
the unprocessed speech, while substantially outperforming the modular setup with SE followed by LE.
See More
2015
Maria C. Koutsogiannaki,
Intelligibility enhancement of Casual speech based on Clear speech properties [PDF] - funded by ICS-FORTH
Abstract: In adverse listening conditions (e.g. presence of noise, hearing-impaired listener etc.) people adjust their
speech in order to overcome the communication difficulty and successfully deliver their message. This remarkable
adjustment produces different speaking styles compared to unobstructed speech (casual speech) that
vary among speakers and conditions, but share a common characteristic; high intelligibility. Developing algorithms
that exploit acoustic features of intelligible human speech could be beneficial for speech technology
applications that seek methods to enhance the intelligibility of “speaking-devices”. Besides the commercial
orientation (e.g., mobile telephone, GPS, customer service systems) of these applications, most important is
their medical context, providing assistive communication to people with speech or hearing deficits. However,
current speech technology is deaf, meaning that it cannot adjust, like humans do, to the dynamically changing
real environments or to the listener’s specificity.
This work proposes signal modifications based on the acoustic properties of a high intelligible human
speaking style, the clear speech, assisting in the development of smart speech technology systems that “mimic”
the way people produce intelligible speech. Unlike other speaking styles, clear speech has a high intelligibility
impact on various listening populations (native and non-native listeners, hearing impaired, cochlear implant
users, elderly people, people with learning disabilities etc.) in many listening conditions (quiet, noise, reverberation).
A significant part of this work is devoted to the comparative analysis between casual and clear speech,
which reveals differences on prosody, vowel spaces, spectral energy and modulation depth of the temporal
envelopes. Based on these observed and measured differences between the two speaking styles, we propose
modifications for enhancing the intelligibility of casual speech. Compared to other state-of-the-art modification
systems, our modification techniques (1) do not require excessive computation (2) are speaker and speech
independent (3) maintain speech quality (4) are explicit, since they do not require statistical training and the
preexistence of clear speech recordings.
Evaluations on intelligibility and quality are performed objectively using recently proposed objective intelligibility
scores and subjectively with listening tests conducted by native and non native listeners in noisy
environments (speech shaped noise, SSN), reverberation and in quiet. Results show that our modifications
enhance speech intelligibility in SSN and reverberation for native and non-native listeners. Specifically, the
proposed spectral modification technique, namely Mix-filtering, increases the intelligibility of speech in noise
and reverberation while maintains the quality of the original signal, unlike other intelligibility boosters. Moreover,
a modulation depth enhancement technique called DMod, increases speech intelligibility more than 30%
in SSN. DMod algorithm is inspired by both clear speech properties and by the non-linear phenomena that take
place in the basilar membrane. DMod not only achieves to enhance speech intelligibility, but it introduces a
novel method for manipulating the modulation spectrum of the signal. Results of this study indicate a connection
of the modulations of the temporal envelopes with speech perception and specifically with processes that
take place on the basilar membrane of human ear and pave the way for analyzing and comprehending speech in terms of modulations.
See More
2014
George P. Kafentzis,
Adaptive Sinusoidal Models for Speech with Applications in Speech Modifications and Audio Analysis [PDF] - funded by Orange Labs
Abstract: Sinusoidal Modeling is one of the most widely used parametric methods for speech and audio signal processing. The
accurate estimation of sinusoidal parameters (amplitudes, frequencies, and phases) is a critical task for close representation
of the analyzed signal. In this thesis, based on recent advances in sinusoidal analysis, we propose high resolution
adaptive sinusoidal models for analysis, synthesis, and modifications systems of speech. Our goal is to provide systems
that represent speech in a highly accurate and compact way.
Inspired by the recently introduced adaptive Quasi-Harmonic Model (aQHM) and adaptive Harmonic Model (aHM),
we overview the theory of adaptive Sinusoidal Modeling and we propose a model named the extended adaptive Quasi-
Harmonic Model (eaQHM), which is a non-parametric model able to adjust the instantaneous amplitudes and phases
of its basis functions to the underlying time-varying characteristics of the speech signal, thus significantly alleviating
the so-called local stationarity hypothesis. The eaQHM is shown to outperform aQHM in analysis and resynthesis of
voiced speech. Based on the eaQHM, a hybrid analysis/synthesis system of speech is presented (eaQHNM), along with
a hybrid version of the aHM (aHNM). Moreover, we present motivation for a full-band representation of speech using
the eaQHM, that is, representing all parts of speech as high resolution AM-FM sinusoids. Experiments show that adaptation
and quasi-harmonicity is sufficient to provide transparent quality in unvoiced speech resynthesis. The full-band
eaQHM analysis and synthesis system is presented next, which outperforms state-of-the-art systems, hybrid or full-band,
in speech reconstruction, providing transparent quality confirmed by objective and subjective evaluations.
Regarding applications, the eaQHM and the aHM are applied on speech modifications (time and pitch scaling). The
resulting modifications are of high quality, and follow very simple rules, compared to other state-of-the-art modification
systems. The concepts of relative phase and relative phase delays are crucial for the development of artefact-free,
shape-invariant, high quality modifications. Results show that harmonicity is preferred over quasi-harmonicity in speech
modifications due to the embedded simplicity of representation. Moreover, the full-band eaQHM is applied on the problem
of modeling audio signals, and specifically of musical instrument sounds. The eaQHM is evaluated and compared to
state-of-the-art systems, and is shown to outperform them in terms of resynthesis quality, successfully representing the
attack, transient, and stationary part of a musical instrument sound. Finally, another application is suggested, namely the
analysis and classification of emotional speech. The eaQHM is applied on the analysis of emotional speech, providing its
instantaneous parameters as features that can be used in recognition and Vector-Quantization-based classification of the
emotional content of speech. Although the sinusoidal models are not commonly used in such tasks, results are promising.
See More
2011
Maria Markaki,
Selection of Relevant Features for Audio Classification tasks [PDF] - funded by ICS-FORTH
Advances in time-frequency distributions and spectral analysis techniques (i.e., for the estimation
of amplitude and/or frequency modulations) allow a better representation of non-stationary signals
like speech, highlighting their fine structure and dynamics. Although such representations
are very useful for analysis purposes, they complicate the classification tasks due to the large
number of parameters extracted from the signal (“curse of dimensionality”). For such tasks, a
significant dimensionality reduction is required.
In this thesis, the problem of dimensionality reduction of these time/frequency-frequency
representations is studied; selection criteria of the optimal parameters are suggested, based on
their relevance to a given classification task. Relevance is defined based on mutual information.
First, using tools from multilinear algebra, such as High Order SVD, the initial dimensions and
the noise components of the representation are reduced. Then, feature selection proceeds based
on maximum relevance criterion. It is shown that the suggested process is equivalent to the
maximum dependency criterion for feature selection, without, however, the need of the multivariate
probability densities estimation.
The feature selection approach suggested in the thesis is applied on a number of audio classification
tasks, including speech detection in broadcast news and voice pathology detection and
discrimination from vowel recordings. The complementarity of the modulation spectral features to
the state-of-the-art Mel frequency cepstral coefficients is shown for the above classification tasks.
A system for the automatic discrimination of pathological heart murmurs using a high resolution
time-frequency analysis of the phonocardiogram (PCG) is also presented. The classification accuracy
of the system is comparable to the diagnostic accuracy of experienced paedo-cardiologists
on the same PCG dataset.
See More
2010
Andre Holzapfel,
Similarity methods for computational ethnomusicology [PDF] - funded by ICS-FORTH
Abstract: The field of computational ethnomusicology has drawn growing attention by researchers
in the music information retrieval community. In general, subjects are
considered that are related to the processing of traditional forms of music, often
with the goal to support studies in the field of musicology with computational
means. Tools have been proposed that make access to large digital collections of traditional
music easier, for example by automatically detecting a specific kind of similarity
between pieces or by automatically segmenting data into partitions that are either
relevant or irrelevant for further investigation.
In this thesis, the focus lies on music of the Eastern Mediterranean, and specifically
on traditional music of Greece and Turkey. At the beginning of the thesis related
work, the task was defined which directed the aspects of the necessary research
activities.
The task was motivated by the geographical location of the author, the island of
Crete in Greece, but in the course of the thesis this task proved to have strong relevance
for a much wider musical context: Given a polyphonic recording of a piece
of Cretan traditional dance music, find a recording that is similar to it. Theory of
musicology provided us with the way to approach this task.
The traditional music encountered in Greece and in wide parts of the Balkan states
and Turkey as well, follows the logic of parataxis, which means that pieces are
constructed by temporally aligning short musical phrases, without the existence
of structures present in classical music or popular music. Thus, a system that is
designed to cope with the above mentioned task has to be able to estimate the
similarity of such phrases. As we deal with polyphonic audio signals of music that
has not been written to a score, at least not before the performance, we need to do
some simplification.
This is because the exact transcription of the main melody from a polyphonic
mixture into a score is still an unsolved problem. And on the other side, the transcription
of traditional music even by human experts is an extremely complex and
difficult process. For that reason, a system has been designed that considers aspects
of rhythm, timbre and melody for approaching the task.
The central aspect that has been considered in this thesis is rhythm. For this, a
point of major interest is the estimation at which time instances within an audio
signal a musical instrument starts playing a note. This estimation is referred to as
onset detection, and has been approached in this thesis using novel group delay and
fundamental frequency based approaches, and with a fusion of these characteristics
with an spectral amplitude criterion. With these findings in the field of onset detection,
improved beat trackers and rhythmic similarity estimation techniques are
developed. The proposed beat tracker applies the group delay based onset detection
method in the context of a state-of-the-art approach for beat tracking. Results
show clear improvements when applying this method for beat tracking on a dataset
of traditional music.
The rhythmic similarity estimation is based on scale transformation, which avoids
the inuence of tempo differences between pieces of music that are to be compared.
On datasets containing Greek and Turkish traditional music high accuracies in a
classification task are achieved, and the validity of the proposed measure as a similarity
measure is supported by the results of listening tests.
Apart from rhythm, also the aspect of instrumental timbre has been addressed. A
novel feature set based on Non-negative Matrix Factorization (NMF) is proposed
to describe the characteristic spectral bases of a piece of music. These bases are
modelled using statistical methods, and it is shown that these models describe the
spectral space of musical genres and instrumental classes in a compact and discriminative
way.
Finally, melodic aspects have been considered as well by combining state-of-the-art
approaches for cover song detection in popular music and fundamental frequency
detection from polyphonic signals. This combination is shown to tackle the central
task of the thesis work in a satisfying way on a small exemplary dataset. A morphological
analysis framework that combines the aspects of rhythm, timbre and
melody is proposed, which can be used to detect similarities in traditional music.
For the development of the algorithms presented in this thesis, evaluation data
had to be collected. This was a task of major difficulty and much effort has been
made by the author to understand well the musical context that is investigated in
this thesis. For many datasets, the ground truth was achieved in cooperation with
local musicians in time-consuming but very informative interviews. The knowledge
obtained in these interviews and the resulting datasets are another important
contribution of this thesis.
See More
Yannis Pantazis,
Decomposition of AM-FM signals with applications in speech processing [PDF] - funded by ICS-FORTH/Orange Labs
Abstract: During the last decades, sinusoidal model gained a lot of popularity since it is able to represent
non-stationary signals very accurately. The estimation of the instantaneous components (i.e.
instantaneous amplitude, instantaneous frequency and instantaneous phase) is an active area
of research. In this thesis, we develop and test models and algorithms for the estimation of the
instantaneous components of sinusoidal representation. Our goal is to reduce the estimation error
due to the non-stationary character of the analyzed signals by taking advantage of time-domain
information. Thus, we re-introduce a time-varying model referred to as QHM which is able to
adjust its frequency values closer to the true frequency values. We further show that an iterative
scheme based on QHM produce statistically efficient sinusoidal parameter estimation. Moreover,
we extend QHM to chirp QHM (cQHM) which is able to capture linear evolution of instantaneous
frequency quite satisfactorily.
However, neither QHM nor cQHM are not able to represent highly non-stationary signals
adequately. Thus, we further extend QHM to adaptive QHM (aQHM) which uses time-domain
frequency information. aQHM is able to adjust its non-parametric basis functions to the timevarying
characteristics of the signal. This results to reduction of the estimation error of the
instantaneous components. Moreover, an adaptive AM-FM decomposition algorithm based on
aQHM is proposed. Results on synthetic signals as well in voiced speech showed that aQHM
greatly reduce the reconstruction error compared to QHM or sinusoidal model of McAulay and
Quatieri.
Concentrating on speech applications, we develop an analysis/synthesis speech system based
on aQHM. Actually, aQHM is used for the representation of the quasi-periodicities of speech
while the aperiodic part of speech is modeled as a time- and frequency-modulated noise. The
resynthesized speech signal produced by the proposed system is indistinguishable from the original.
Finally, another application of speech analysis where aQHM can be applied is the extraction
of vocal tremor characteristics. Since vocal tremor is defined as modulations of the instantaneous
components of speech, aQHM is the appropriate model for the representation of these modulations.
Indeed, results showed that the reconstructed signals are close to the original signals which
validate our method.
See More
2007
Yannis Agiomyrgiannakis,
Sinusoidal Coding of Speech for Voice over IP [PDF] - funded by ICS-FORTH
Abstract:It is widely accepted that Voice-over-Internet-Protocol (VoIP) will dominate wireless
and wireline voice communications in the near future. Traditionally, a minimum level of Quality-of-Service is
achieved by careful traffic monitoring and network fine-tuning. However, this solution is not feasible when there
is no possibility of controlling/monitoring the parameters of the network. For example, when speech traffic is
routed through Internet there are increased packet losses due to network delays and the strict end-to-end delay
requirements for voice communication. Most of today’s speech codecs were not initially designed to cope with such
conditions. One solution is to introduce channel coding at the expense of end-to-end delay. Another solution is to
perform joint source/channel coding of speech by designing speech codecs which are natively robust to increased packet
losses. This thesis proposes a framework for developing speech codecs which are robust to packet losses.
The thesis addresses the problem in two levels: at the basic source/channel coding level where novel methods are proposed
for introducing controlled redundancy into the bitstream, and at the signal representation/ coding level where a novel
speech parameterization/modelling is presented that is amenable to efficient quantization using the proposed source
coding methods. The speech codec is designed to facilitate high-quality Packet Loss Concealment (PLC). The speech signal
is modeled with harmonically related sinusoids; a representation that enables fine time-frequency resolution which is vital
for high-quality PLC. Furthermore, each packet is encoded independently of the previous packets in order to avoid a
desynchronization between the encoder and the decoder upon a packet loss. This allows some redundancy to exist in the bit-stream.
A number of contributions are made to well-known harmonic speech models. A fast analysis/synthesis method is proposed and
used in the construction of an Analysis-by-Synthesis (AbS) pitch detector. Harmonic Codecs tend to rely on phase models for
the reconstruction of the harmonic phases, introducing artifacts that effect the quality of the reconstructed speech signal.
For a high-quality speech reconstruction, the quantization of phase is required. Unfortunately, phase quantization is not a
trivial problem because phases are circular variables. A novel phase-quantization algorithm is proposed to address this problem.
Harmonics phases are properly aligned and modeled with a Wrapped Gaussian Mixture Model (WGMM) capable of handling parameters
that belong to circular spaces. The WGMM is estimated with a suitable Expectation-Maximization (EM) algorithm. Phases are then
quantized by extending the efficient GMM-based quantization techniques for linear spaces to WGMM and circular spaces. When packet
losses are increased, additional redundancy can be introduced using Multiple Description Coding (MDC). In MDC, each frame is encoded
in two descriptions; receiving both descriptions provides a high-quality reconstruction while receiving one description provides a
lower-quality reconstruction. With current GMM-based MDC schemes it is possible to quantize the amplitudes of the harmonics which
represent an important portion of the information of the speech signal. A novel WGMM-based MDC scheme is proposed and used for MDC
of the harmonic phases.
It is shown that it is possible to construct high-quality MDC codecs based on harmonic models. Furthermore,
it is shown that the redundancy between the MDC descriptions can be used to “correct” bit errors that may have occurred during
transmission. At the source coding level, a scheme for Multiple Description Transform Coding (MDTC) of multivariate Gaussians using
Parseval Frame expansions and a source coding technique referred to as Conditional Vector Quantization (CVQ), are proposed. The MDTC
algorithm is extended to generic sources that can be modeled with GMM. The proposed frame facilitates a computationally efficient
Optimal Consistent Reconstruction algorithm (OCR) and Cooperative Encoding (CE). In CE, the two MDTC encoders cooperate in order to
provide better central/side distortion tradeoffs. The proposed scheme provides scalability, low complexity and storage requirements,
excellent performance in low redundancies and competitive performance in high redundancies. In CVQ, the focus is given in correcting
the most frequent type of errors; single and double packet losses. Furthermore, CVQ finds application to BandWidth Expansion (BWE),
the extension of the bandwidth of narrowband speech to wideband. Concluding, two proof-of-concept harmonic codecs are constructed, a
single description and a multiple description codec. Both codecs are narrowband, variable rate, similar to quality with the state-of-the-art
iLBC (internet Low Bit-Rate Codec) under perfect channel conditions and better than iLBC when packet losses occur. The single description
codec requires 14 kbps and it is capable of accepting 20% packet losses with minimal quality degradation while the multiple description
codec operates at 21 kbps while it is capable of accepting 40% packet losses without significant quality degradation.
See More