voip voip voip
Cliconnect Internet Telephony voip
voip voip  Home > Support > Articles >Objective Speech Quality Measures for Internet Telephony voip
voip voip   voip voip
voip
voip
voip
Live Chat Support
voip
  voip
  voip
voip

2. MEASURING SPEECH QUALITY There are two broad classes of speech quality metrics: subjective and objective. Subjective measures involve humans listening to a live or recorded conversation and assigning a rating to it. This rating can be either a single overall quality rating or a rating of a particular characteristic (i.e., clarity or listening e ort) or a particular distortion (i.e., clipping, hum). Because they use human subjects, subjective measures are often very accurate and useful for evaluating a telephony system. The mean opinion score (MOS) is one such useful metric. Although the MOS is not the only subjective measure, it is one of the most widely used and recognized. ITU-T Recommendation P.830[10] describes in detail how to conduct a subjective test experiment, but the procedure can be summed up as follows. A panel of subjects listens to a set of speech samples, assigning to each sample an overall quality score ranging from 1 (Bad) to 5 (Excellent). The average score of the panel for a given sample is that sample's MOS. Clearly, a metric such as MOS that uses human subjects can be a good measure of perceived speech quality. However, subjective metrics have disadvantages, too. In particular, they can be time-consuming and expensive. Some researchers or organizations may not have the resources to conduct the tests. Certainly, such metrics cannot be used in any sort of real-time or online application. These shortcomings, among other reasons, have led to the development of objective metrics. Such measures predict perceived speech quality based typically on a computation of distortion between the original (clean) signal and a received (noisy) signal. In some algorithms, something other than the di erence between the received and original signals is used, such as a quantitative measure of the distortion. Typically, the accuracy or e ectiveness of an objective metric is determined by its correlation, usually the Pearson (linear) correlation, with MOS scores for a set of data. If an objective metric has a high correlation with MOS, then it is deemed to be an e ective measure of perceived speech quality, at least for speech data and transmission systems with the same characteristics as those in the experiment. Indeed, metrics that work well under some conditions are not necessarily good predictors of perceived voice quality under other conditions. Our goal is to nd a speech quality metric that accurately predicts human perception under conditions typical of VoIP systems. To do this, we compare three types of measures. The rst type is perceptually weighted distortion measures, which include the enhanced modi ed Bark spectral distance (EMBSD) [11][12][13] and measuring normalizing blocks (MNB) [9][1][2][3] algorithms. The second uses the word-error rates output by a continuous speech recognizer of the original and received signals to predict voice quality[4]. The third is the ITU E-model[6][7][8]. 2.1. Perceptually Weighted Distortion Measures Modern objective metrics use knowledge of the human auditory system to compute a perceptually weighted distortion measure. Distortions that are most signi cant to the human ear are weighted more heavily while those that are inaudible or nearly so are weighted lightly or not at all. A number of algorithms exist in this class of measures. We chose the two best performers, according to the literature: measuring normalizing blocks (MNB), which is found in Appendix A of ITU-T Recommendation P.861, and enhanced modi ed Bark spectral distance (EMBSD). The MNB algorithm comprises two stages: a simple perceptual transformation, and a distance measure that uses hierarchies of measuring normalizing blocks. For perceptual transformation, the time-aligned, normalized signals, original and received, are divided into 50% overlapping frames of 128 samples. Each frame is multiplied by a Hamming window and transformed using the fast Fourier transform (FFT). Only the squared magnitudes of the FFT coecients are preserved. The coecients are transformed to the Bark scale, a psychoacoustic frequency scale where b = 6  sinh��1  f 600 de nes the transformation. This is accomplished by grouping the squared FFT coecients into bins of equal width on the Bark scale. The total energy of each frame is computed, and frames below an energy threshold in either the original or received signals are discarded. All samples in remaining frames are transformed using a logarithm to model perceived loudness. The distance measure used is a linear combination of the distances computed in the time and frequency MNBs. There is one frequency MNB (FNMB) for each power spectrum coecient. A frequency MNB averages the di erence at that coecient between the original and received signals across all frames that exceed the above-mentioned energy thresholds. Four measurements covering the lower and upper band edges of telephone band speech are saved in measurement vector m. There are two di erent time MNB (TMNB) structures using di erent frequency scales,

< Previous Page Next Page >

voip
voip
voip
voip
    voip
voip   voip voip voip
Copyright 2005-2010 Cliconnect.com. All Rights Reserved LEGAL   PRIVACY POLICY   CONTACT US
Cliconnect is a private company with branches in Canada and Brazil. Cliconnect uses high-quality VoIP technology to offer Internet Telephony services for business and residential customers. Cliconnect supports a wide range of Internet telephony equipment including Sipura 2000 and 3000, Cisco 186, Linksys PAP2 and RT31P2 phone adaptors.
 

voip

voip
voip
voip
voip voip voip voip voip voip voip voip voip voip
Home Products Support Login Site Map About Us Portugues