вход по аккаунту



код для вставкиСкачать
Intelligibility and Localization of Speech
from Virtual Directions
GILBERT L. RICARDl and SUSAN L. MEIRS, Grumman Aerospace Corporation, Bethpage,
New York
We studied the intelligibility of speech and the ability of listeners to localize
speech when synthetic direction information was added to the signal. Masked
thresholds for single synthesized words were measured for simulated, horizontal
angular separations of speech in the presence of a masking noise. Speech recognition was measured using the Modified Rhyme Test; the masker was broad-band
white noise. A single set of head-related transfer functions conditioned the speech
and noise waveforms presented over headphones, and listeners were free to move
their heads relative to the apparent directions of these signals. The masker was
fixed at 0 deg, and when speech was presented from other azimuths, its intelligibility typically increased by 4 to 5 dB. Individuals' masked thresholds were variable yet repeatable, and the plot of threshold by azimuth seemed unique to each
subject. In a separate test, the same listeners were asked to estimate the azimuth
of items from the rhyme test. Localization was accurate and performance was
similar to previous work, except that confusions of source locations in the front
and rear hemispheres were evenly distributed across azimuth.
Much of the current interest in virtual reality has been stimulated by technologies for
human-computer interaction that are "natural" in the sense that they present information in ways to which human beings are accustomed.
One such technology
is the
appropriate real-time filtering of a monaural
signal to produce artificial
(Wenzel, 1992). Communication systems of
modern aircraft typically carry two types of
signals. One is speech from a variety of
I Requests for reprints should be sent to Gilbert Ricard,
Mail Stop K01-14, Grumman Aircraft Systems, Bethpage,
NY 11714.
sources and the other is a number of warning
tones. Second, Sorkin, Wightman, Kistler,
and Elvers (1989) showed an advantage of
adding movement-linked, directional information to the warning signals. In addition,
Perrott, Saberi, Brown, and Strybel (1990)
and Perrott, Sadralodabai, Saberi, and Strybel (1991) showed that such auditory direction cuing reduces the time needed to locate a
visual target. One use for directional filtering,
then, is to add information about source location to an auditory display-information
that was not there before-but another would
be to increase the detectability of signals such
as speech. Thus maximizing the accuracy of
direction estimates as well as the intelligibil-
<!:l 1994, Human Factors and Ergonomics Society. All rights reserved.
March 1994-121
ity of communications may both become design goals for systems that employ headrelated transfer functions (HRTFs).
Empirical work with artificial direction
cues has, to date, been concerned with validation. After they had measured each subject's HRTF, Wightman and Kistler (1989)
had their listeners localize bursts of noise
presented either free-field or over headphones but filtered with each subject's own
set of transfer functions. Their judgments of
azimuth were quite accurate, and there were
no obvious differences between judgments
made with free-field signals and those made
with the synthesized ones. Judgments of elevation, however, tended to be more variable
with the synthesized signals.
Wenzel, Wightman, and Kistler (1991) and
Begault and Wenzel (1991) extended Wightman Kistler's work to untrained subjects listening with nonindividualized HRTFs. Both
studies found variability attributed to individual differences-presumably
to the way
used interaural
when listening with HRTFs obtained from
someone else-and both found the distribution of front/back errors to be asymmetric
around the interaural axis.
Because subjects have shown variability in
their localization of signals conditioned by
nonindividualized HRTFs, we decided to see
if similar variability characterizes the intelligibility of speech presented from synthesized azimuths. In part this was to measure
the gain of speech intelligibility provided by
directional filtering; we also wanted to see if
anomalies of localization covary with differences of intelligibility when both are measured within the same subject. We measured
masked thresholds for speech presented from
different simulated directions, as well as subjects' ability to estimate those directions.
This was done using a single set of transfer
functions to add directional information to
the speech, and these signals were presented
over headphones so that our listeners were
free to move relative to the location of the
simulated sources.
Three students and an engineer, whose ages
ranged from 20 to 50 years, served as listeners. Pure-tone audiograms were obtained
from each using a Grason-Stadler Model 40
audiometer. Three subjects displayed normal
sensitivity (within 20 dB of audiometric zero)
and one (GR) showed a moderate
highfrequency loss in the 6 and 8 kHz bands.
We measured speech intelligibility with the
Modified Rhyme Test (MRT). Apersonal computer selected the correct items, changed the
order of rows in a test and the order of items
in a row, passed the correct item to a DECtalk
(Model 2.0) speech synthesizer for presentation to the listeners, displayed the set of possible responses on a computer display, and
scored the listeners' choices. The synthesized
words were added to a continuous white
noise that was band-limited 0 to 5 kHz with a
roll-off of 96 dB per octave and set to a spectrum level of 40 dB SPL. We measured the
speech level by extending the duration of all
ofthe vowel sounds made by the DECtalk and
then measuring their level in the headset. Averaged across vowels and across subjects, the
masked threshold for the MRT presented at
an azimuth of 0 deg was 58 dB(A).
The speech and noise waveforms were led
to separate "channels" of a Convolvotron
(Crystal River Engineering), where they were
filtered according to azimuth, mixed, and led
to Sennheiser (HD 540 Series 1) stereophonic
headphones. Head position was measured by
a Polhemus Navigation Sciences 3SPACE
Isotrak magnetic tracker. The magnetic
source was mounted facing the subject at ear
level, and its sensor was mounted on the
headband of the subject's headset. Signal and
noise levels were controlled through a mix of
external attenuators and software gains in
the convolution machine. All listening was
done in an Eckel Industries sound isolation
room (Model 665 SA),where the level of background noise was 38 dB(A). We used the highresolution (256-point) HRTFs of SDO, one of
the listeners for Wightman and Kistler, to
condition the MRT phrases for both the recognition and localization tasks.
Our synthesized locations of speech form a
circle, centered at the listener, that is located
on the horizontal plane that includes the interaural axis. During listening these locations
remained fixed, but their azimuths and elevations would change with rotation of the
head. Data are reported at the azimuth of the
source when the head was boresighted to the
magnetic source. Distance to the simulated
source is determined by the source-to-listener
distance when the transfer functions were
measured; in this case, Wightman and Kistler's data were collected at a speaker distance
of about 1.4 m.
For both the intelligibility and localization
tests, the listeners were seated in the sound
isolation room facing a keyboard and computer terminal. These tests were conducted
sequentially, but as we used the same subjects, equipment, and speech signals, they are
described as a single experiment.
Our MRT was automated. During the intelligibility testing, subjects started a series of
50 trials wi th a keypress and then entered
their choice for each trial via the same keyboard. These responses were numbers between 1 and 6, and they could be changed
before they were "entered" with the return
key. Once a response was recorded, the next
trial started. The MRT uses the lead phrase,
"Item number (n) is
" to introduce each
item. After this lead item and the test word
were presented, subjects had unlimited time
to make a selection. Once a choice was entered, the next set of response items was displayed and the lead phrase followed within
300 ms.
During the MRT, the interference was kept
at an azimuth of 0 deg (relative to the headtracker), and the synthesized speech appeared about the subject in a random sequence of azimuths in increments of 30 deg.
Testing at a given azimuth consisted of two
presentations of the MRT for each point on a
psychometric function-that
is, each point
was based on 100 individual trials. Typically
two points per function were measured. All
data for a given azimuth were collected before a new azimuth was tested. At the end of
each set of 50 trials, the percentage correct
was given as feedback, but there was no indication of whether or not a choice was correct
on individual trials. Thresholds for masking
of the speech were linearly extrapolated as
the signal level that produced 70% correct
performance. Typically, one threshold was
measured per day. Performance on the MRT
stabilizes quickly. Several testing sessions
were used for training, until thresholds could
be replicated to within 2 dB, then data were
During the localization tests, signal azimuth was changed between the presentation
of each MRT item, and so subjects had to be
notified when the equipment was ready for
the next presentation. They then used the
keyboard to initiate the next trial. Judgments
of azimuth were written on a data sheet, and
subjects were allowed unlimited time to respond. This is essentially the same localization technique used in the other studies that
employed nonindividualized HRTFs. Signals
were filtered to appear at increments of 15
deg and, as with the MRT, both the lead
phrase and the test item were directionally
March 1994-123
A set of 24 MRT items (one at each azimuth) presented in random order constituted
a test run. The localization data consisted of
12 such tests, usually given two per day. As
for the MRT, feedback for the localizations
was the percentage correct at the end of a run
of 24 trials. During the test a graphic compass
rose with azimuths labeled in increments of
15 deg was mounted next to the subjects'
computer display to aid them in correctly assigning a direction to the apparent location.
No masking noise was present during the localization test, and the signal level was set at
65 dB(A).
Masked Thresholds
Across subjects, the average change of
threshold signal-to-noise ratios varied by less
than 1 dB to over 5 dB. Individual thresholds
were more variable; they ranged from -5.5
to 10.5 dB, depending on the azimuth of the
signal. Within a subject, the average gain was
more stable. One subject generated several
negative threshold signal-to-noise ratios for
speech located in the rear hemisphere, so her
average threshold shift was only 1.1 dB, but
the other three showed average gains of 4.0,
4.6, and 5.0 dB for measurements averaged
over signal azimuths of other than 0 deg. Because individuals showed different changes of
threshold with signal azimuth, we present individual polar plots of the change of threshold signal-to-noise ratio as a function of signal direction.
In Figure 1, all thresholds are expressed relative to the threshold signal-to-noise ratio for
speech presented at an azimuth of 0 degplotted on the periphery of the figure. Relative to the outer circle, the inner one represents a gain of intelligibility
of 10 dB.
Occasionally we selected thresholds for rep-
lication, and when we did, the new measurements invariably came within a decibel of the
These intelligibility thresholds were measured from psychometric functions with an
average slope of 2 .3%correct per dB of signalto-noise ratio. Slopes for individual functions
varied between 0.7% correct per dB and 4.8%
correct per dB, and the average slopes of our
four subjects were 2.7%, 2.3%, 2.2%, and 1.9%
correct per dB. Across azimuths the slopes of
these psychometric functions were constant,
P(11,36) = 1.12, p > 0.05.
Because listeners are occasionally unable
to determine correctly whether or not a simulated source is located in the front or rear
hemisphere, it has become common to correct hemispheric errors of source location. In
studies of free-field localization, typically the
rate of such errors has been low, and the majority of them occur with sources near the
midline. The motivation for correcting front!
back confusions has been to prevent measures of localization sensitivity from being inflated by large errors; but because these
confusions do occur, we analyzed both corrected and uncorrected data. Oldfield and
Parker (1984) described two kinds of localization errors that were distinguished by whether or not the perceived location varied with
the source's distance from the interaural axis.
One of these, their "defaults-to-90 deg," arose
largely from sources elevated 30 deg or more
above the horizontal plane. Because our simulated locations varied only in azimuth, we
have taken a conservative approach and have
defined only frontlback hemispheric confusions. For correction, we used the same rule
as a number of other recent studies: if the
unsigned error of azimuth judgment could be
made smaller by reflecting the perceived signal location around the interaural axis, the
Figure 1. Changes of speech threshold relative to the masked threshold of speech presented at an azimuth of
o deg. Relative to the outer circle, the inner circle is -} 0 dB.
smaller error was recorded as a "corrected"
datum and the number of frontlback confusions was incremented by one. Such a rule
can be expected to produce a bias toward an
underestimation of localization error near
the interaural axis along with a related increase of the number of confusions. but these
tendencies were not apparent in our results.
Figure 2 presents individual plots of our
subjects' azimuth estimates averaged across
March 1994--125
Figure 2. Individual judgments of azimuth and average front/back confusions plotted within a hemisphere.
Judgments of locations in the left hemisphere are "folded" over to the right hemisphere.
blocks of trials. Makous and Middlebrooks
(1990) showed the systematic variation of
front-back errors with stimulus azimuth by
first separating them from the localization
judgments and then plotting both the errors
and the remaining judgments along the same
hemispheric axis.
Figure 2 presents a similar plot. The judgments were pooled by "folding" signal azimuths from -15 deg to -165 deg over to the
right hemisphere, and the combined data
were used for the figure. Individual plots
were made of subjects' estimates of signal azimuth, but average judgments and their standard deviations are shown for the frontlback
errors. Linear functions were fitted to these
data (before they were folded) and accounted
for over 99% of the variance of each subject's
azimuth judgments. This sort of linear relation has become the expectation for subjects
using nonindividualized
HRTFs when the
sources are located on the horizontal plane
that contains the interaural axis. The y-axis
intercepts of these functions can be a measure
of rotational bias (the extent to which the perceived simulated auditory world is aligned
with the real one), and these intercepts were
-2.4, -28.4,1.39, and -1.7 deg.
The standard deviations of the corrected
azimuth judgments did not differ across azimuths, F(23,72) = 1.44, P > 0.11; their average was 15.77 deg. The average standard deviation for the pooled frontlback confusions
was 16.2 deg, but because these errors are
126-March 1994
largest near the midline, uncorrected variability shows maxima at 0 deg and 180 deg.
Our subjects displayed a total of 211 front!
back confusions for an overall error rate of
18.5%. Within a block of 24 judgments, the
range of individual rates of reversal could be
large; we observed rates from as low as 0% to
occasionally as high as 50%. In addition, individuals' error rates were stable yet different; the average rates of frontlback reversals
for our listeners were 28,4%, 21.3%, 5.2%, and
45.0%, and these were significantly different,
X2(3) = 69.02, P < 0.01.
In previous studies in which subjects used
nonindividualized HRTFs, the majority ofthe
frontlback confusions came from the front
hemisphere. About half (50.7% or 107) of our
errors occurred when the simulated source
was between -75 deg and + 75 deg. We have
no indication that frontlback confusions arise
selectively from certain azimuths; their distribution was flat X2(23) = 17.80, P > 0.30.
Finally, only once was an azimuth judgment
scored as a frontlback confusion when it involved, in addition, a confusion of left and
right hemispheres. This occurred for a signal
presented at 165 deg that was judged to be
from - 15 deg.
and the Modeling
of Intelligibility
Bronkhorst and Plomp (1988) measured
masked thresholds for speech when the signal
and masker were separated in azimuth, and
they observed a symmetric release from
masking that was maximal at 90 deg and
minimal along the midline. They always presented speech from the front, and the masking noise was located at different azimuths.
As the noise was moved to 90 deg, the increase of speech intelligibility reached 10 dB,
but by 180 deg the speech signal-to-noise ratio was again 0; the level of the masked
speech for 180 deg was not different from the
baseline condition of signal and masker at
o deg.
Their subjects listened to signals recorded
from the left and right ears of a dummy head,
and directional cues were created by the
placement of speakers in an anechoic environment. The results of the present study
were essentially similar to the earlier work of
Bronkhorst and Plomp. Our thresholds were
too variable between subjects for such a maximum to be evident, but our average gain was
half that of their study. Their interest was in
the masking of speech by sources off-axis to
the speaker (as in a cocktail party), so they
used complete sentences for the intelligibility
measurement and a masker filtered to have
the long-term spectrum of speech. In addition, they employed a high criterion for performance by requiring that subjects repeat
the entire sentence correctly for it to be
counted as a correct response.
Tonning (1971) measured the intelligibility
of speech at 0 deg with a masker at either
+/- 90 deg or 180 deg, and showed a pattern
much like the results of Bronkhorst and
Plomp. His maximum gain at 90 deg, however, was only 6.5 dB-more like the range of
our measurements. In addition, Tonning used
a wide-band masker and an intelligibility test
whose items were single. unrelated words.
Similar results were obtained by Dirks and
Wilson (1969), who measured masked thresholds for signals located at +/- 90 deg and
180 deg when the masker was presented from
an azimuth of 0 deg. Their threshold for
speech located at - 90 deg showed a gain of
about 6 dB relative to when the signal and
masker were colocated at 0 deg.
Last, an anomaly seen in our results is the
wide range of individuals' thresholds (over 13
dB) for signals presented from an azimuth of
180 deg. This was twice the range of thresholds seen at any other azimuth and where one
would expect variability to be like that seen
for signals presented at 0 deg. For signals presented at 0 deg, the range of our subjects'
thresholds was 4 dB.
Localization, Individual
HRTF Technology
Differences, and
Clearly, subjects can accurately judge the
direction of signals with simulated location
information, especially when only differences
of azimuth are present. The linearity of the
estimates, as well as their uniformity, is well
within the range of previous measurements.
Confusions of front and back present a difficulty for those attempting to apply directional sound technology. Our rate of confusions was high, but it is in the range of other
localization studies that used HRTFs. Wightman and Kistler (1989) showed an increase of
the rate of confusions from 5.6% to 11% when
subjects went from judging free-field signals
to using their own HRTFs. In the studies using nonindividualized HRTFs, Wenzel et al.'s
(1991) subjects showed a larger increase, going from 19% with free-field signals to 31%
using HRTFs, and Begault and Wenzel (1991)
noted that over 60% of their subjects' judgments made at azimuths of 0 and - 30 deg
were reversals. Their overall rate, though,
also appeared to be about 30%.
Finally, SDO, the contributor of the HRTFs
used in these studies, showed frontlback error
rates of 4% for free-field signals and 11% using her own HRTFs. The rate of frontlback
confusions, as well as the fact that their magnitude is greatest around the midline, creates
the challenge for an applied technology of directional cuing. Until front/back confusions
can be controlled, it may be better if virtual
auditory displays used direction information
as a redundant cue.
Although they were free to do so, our listeners did not move their heads much during
March 199~127
presentation of the MRT items, and we feel
this was a consequence of the nature of the
MRT. Recognition improved if listeners knew
the possible responses before a test item was
presented, and reading the response set kept
our subjects oriented toward the computer
display and magnetic tracker. Perhaps a recognition test with reduced visual requirements would have allowed subjects to experiment more with different head positions, but
these same requirements make the MR T a
good analog of the operational conditions for
which directional cuing is being developed.
Head-up displays enable pilots to look at the
world ahead of them during critical phases of
flight, and there is good reason to believe that
under these conditions, the initial recognition
of auditory direction will be accomplished
with minimal movements of the head.
We would like to thank Gregory Moyer and Russell
Singer for acting as listeners and David Musicant and
Cheo Teng for programming and assistance.
Begault, D. R., and Wenzel, E. M. (1991) Headphone localization of speech stimuli. In Proceedings of the Human
Factors Society 35th Annual Meeting (pp. 82-86). Santa
Monica, CA: Human Factors and Ergonomics Society.
Bronkhorst, A. W., and Plomp, R. (1988). The effect of
head-induced interaural time and level differences on
speech intelligibility in noise. Journal of the Acoustical
Society of America, 83, 1508-1516.
Dirks, D. D., and Wilson, R. H. (1969). The effect of spatially separated sound sources on speech intelligibility.
Journal of Speech and Hearing Research, J 2, 5-38.
Makous, J. C., and Middlebrooks,
J. C. (1990). Twodimensional sound localization by human listeners.
Journal of the Acoustical Society of America, 87, 21882200.
Oldfield, S. R., and Parker, S. P. A. (1984). Sound localisation: a topography of auditory space. 1. Normal hear·
ing conditions. Perception, J 3, 581"'{'00.
Perrott, D. R., Saberi, K., Brown, K., and Strybel. T. Z.
(1990). Auditory psychomotor coordination and visual
search behavior. Perception and Psychophysics,
Perrott, D. R., Sadralodabai, T., Saberi, K., and Strybel.
T. Z. (1991). Aurally aided visual search in the central
visual field: Effects of visual load and visual enhancement of the target. Human Factors, 33,389-400.
Sorkin, R. D., Wightman, F. L., Kistler, D. S., and Elvers,
G. C. (1989). An exploratory study of the use of movement-correlated cues in an auditory head-up display.
Human Factors, 31, 161-166.
Tonning, F. M. (1971). Directional audiometry: II. The influence of azimuth on the perception of speech. Acta
Otolaryngologica, 72,352-357.
Wenzel, E. M. (1992). Localization in virtual acoustic displays. Presence: Teleoperators and Virtual Environments,
Wenzel, E. M., Wightman, F. L., and Kistler, D. J. (1991).
Localization with non-individualized virtual display
cues. In Proceedings of CH1'91, ACM Conference on
Computer-Human Interaction (pp. 351-395). New York:
Association for Computing Machinery.
Wightman. F. L.. and Kistler, D. J. (1989). Headphone simulation of free-field listening: II. Psychophysical validation. Journal of the Acoustical Society of America, 85,
Date received: November 20, 1992
Date accepted: June 25,1993
Без категории
Размер файла
499 Кб
Пожаловаться на содержимое документа