2. MEASURING SPEECH QUALITY
There are two broad classes of speech quality metrics: subjective and objective. Subjective measures involve humans
listening to a live or recorded conversation and assigning a rating to it. This rating can be either a single overall
quality rating or a rating of a particular characteristic (i.e., clarity or listening eort) or a particular distortion
(i.e., clipping, hum). Because they use human subjects, subjective measures are often very accurate and useful for
evaluating a telephony system. The mean opinion score (MOS) is one such useful metric. Although the MOS is not
the only subjective measure, it is one of the most widely used and recognized. ITU-T Recommendation P.830[10]
describes in detail how to conduct a subjective test experiment, but the procedure can be summed up as follows. A
panel of subjects listens to a set of speech samples, assigning to each sample an overall quality score ranging from 1
(Bad) to 5 (Excellent). The average score of the panel for a given sample is that sample's MOS.
Clearly, a metric such as MOS that uses human subjects can be a good measure of perceived speech quality.
However, subjective metrics have disadvantages, too. In particular, they can be time-consuming and expensive.
Some researchers or organizations may not have the resources to conduct the tests. Certainly, such metrics cannot
be used in any sort of real-time or online application. These shortcomings, among other reasons, have led to the
development of objective metrics. Such measures predict perceived speech quality based typically on a computation
of distortion between the original (clean) signal and a received (noisy) signal. In some algorithms, something other
than the dierence between the received and original signals is used, such as a quantitative measure of the distortion.
Typically, the accuracy or eectiveness of an objective metric is determined by its correlation, usually the Pearson
(linear) correlation, with MOS scores for a set of data. If an objective metric has a high correlation with MOS, then
it is deemed to be an eective measure of perceived speech quality, at least for speech data and transmission systems
with the same characteristics as those in the experiment. Indeed, metrics that work well under some conditions are
not necessarily good predictors of perceived voice quality under other conditions.
Our goal is to nd a speech quality metric that accurately predicts human perception under conditions typical
of VoIP systems. To do this, we compare three types of measures. The rst type is perceptually weighted distortion
measures, which include the enhanced modied Bark spectral distance (EMBSD) [11][12][13] and measuring normalizing
blocks (MNB) [9][1][2][3] algorithms. The second uses the word-error rates output by a continuous speech
recognizer of the original and received signals to predict voice quality[4]. The third is the ITU E-model[6][7][8].
2.1. Perceptually Weighted Distortion Measures
Modern objective metrics use knowledge of the human auditory system to compute a perceptually weighted distortion
measure. Distortions that are most signicant to the human ear are weighted more heavily while those that are
inaudible or nearly so are weighted lightly or not at all. A number of algorithms exist in this class of measures. We
chose the two best performers, according to the literature: measuring normalizing blocks (MNB), which is found in
Appendix A of ITU-T Recommendation P.861, and enhanced modied Bark spectral distance (EMBSD).
The MNB algorithm comprises two stages: a simple perceptual transformation, and a distance measure that uses
hierarchies of measuring normalizing blocks. For perceptual transformation, the time-aligned, normalized signals,
original and received, are divided into 50% overlapping frames of 128 samples. Each frame is multiplied by a
Hamming window and transformed using the fast Fourier transform (FFT). Only the squared magnitudes of the
FFT coecients are preserved. The coecients are transformed to the Bark scale, a psychoacoustic frequency scale
where
b = 6 sinh1 f
600
denes the transformation. This is accomplished by grouping the squared FFT coecients into bins of equal width
on the Bark scale. The total energy of each frame is computed, and frames below an energy threshold in either
the original or received signals are discarded. All samples in remaining frames are transformed using a logarithm to
model perceived loudness.
The distance measure used is a linear combination of the distances computed in the time and frequency MNBs.
There is one frequency MNB (FNMB) for each power spectrum coecient. A frequency MNB averages the dierence
at that coecient between the original and received signals across all frames that exceed the above-mentioned energy
thresholds. Four measurements covering the lower and upper band edges of telephone band speech are saved in
measurement vector m. There are two dierent time MNB (TMNB) structures using dierent frequency scales,
< Previous Page Next Page > |