Objective Speech Quality Measures for Internet Telephony
Timothy A. Hall
National Institute of Standards and Technology
100 Bureau Drive, STOP 8920
Gaithersburg, MD 20899-8920
ABSTRACT
Measuring voice quality for telephony is not a new problem. However, packet-switched, best-eort networks such as
the Internet present signicant new challenges for the delivery of real-time voice trac. Unlike the circuit-switched
public switched telephone network (PSTN), Internet protocol (IP) networks guarantee neither sucient bandwidth
for the voice trac nor a constant, acceptable delay. Dropped packets and varying delays introduce distortions
not found in traditional telephony. In addition, if a low bitrate codec is used in voice over IP (VoIP) to achieve a
high compression ratio, the original waveform can be signicantly distorted. These new potential sources of signal
distortion present signicant challenges for objectively measuring speech quality. Measurement techniques designed
for the PSTN may not perform well in VoIP environments.
Our objective is to nd a speech quality metric that accurately predicts subjective human perception under
the conditions present in VoIP systems. To do this, we compared three types of measures: perceptually weighted
distortion measures such as enhanced modied Bark spectral distance (EMBSD) and measuring normalizing blocks
(MNB), word-error rates of continuous speech recognizers, and the ITU E-model. We tested the performance of
these measures under conditions typical of a VoIP system. We found that the E-model had the highest correlation
with mean opinion scores (MOS). The E-model is well-suited for online monitoring because it does not require the
original (undistorted) signal to compute its quality metric and because it is computationally simple.
Keywords: speech quality, Internet telephony, voice over IP, network metrology
1. INTRODUCTION In recent years, there has been growing interest in using the Internet and other Internet protocol (IP) networks
for telephony. Motivations such as reduced cost, simplication of infrastructure through network convergence, and
the opportunity to provide new and programmable services have driven this interest. However, success of Internet
telephony depends upon the reliable delivery of good voice quality, and speech quality metrics are needed for designing,
building, and maintaining such VoIP systems. While the problem of measuring speech quality of telephony systems
is not new, the characteristics of VoIP systems are dierent in many respects from those of the existing PSTN.
Best-eort IP networks present signicant new challenges to the delivery of real-time voice trac. Whereas the
circuit-switched PSTN guarantees that sucient bandwidth is reserved and available for the duration of the call, IP
networks, in general, do not. Delay is not guaranteed to be either minimal or constant in an IP network. In addition,
dropped packets and packet delay variation, or jitter, introduce distortions not found in traditional telephony. Low
bitrate (high compression ratio) codecs used to reduce required bandwidth distort the original waveform signicantly
before it is even transmitted. The compressed speech produced by such codecs is also more sensitive to packet loss.
These and other characteristics of VoIP make delivery of toll quality speech challenging. These same characteristics
make measuring the speech quality dicult as well. Most existing objective speech quality measures have been
developed for high bit-rate, error-free telephony environments and do not accurately predict subjective voice quality
in the presence of the signicant impairments introduced by VoIP systems. In this paper, we evaluate several
objective speech measures to determine their eectiveness in predicting human perception of speech quality in VoIP
networks. We also discuss the suitability of the algorithms for implementation in an online monitoring environment
capable of providing speech quality measures in real time.
The paper is organized as follows. We rst give a general background on speech quality measurement, along
with brief descriptions of the algorithms we evaluated. Second, we describe the two experiments we conducted to
evaluate them, including the data sets used and the distortions introduced. Third, we present the results of the two
experiments, and, nally, we discuss implications of the results.
Next Page > |