close

Вход

Забыли?

вход по аккаунту

?

JP2007006253

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2007006253
PROBLEM TO BE SOLVED: To accurately estimate the arrival time difference of a direct sound
even if there is the influence of a reflected sound, improve the accuracy of the speaker direction
detection, and cope with the fluctuation of the pitch frequency by a simple process. SOLUTION: A
signal processing device 5 of the present invention is a microphone unit 1 in which a plurality of
microphones 2 and 3 in which a plurality of microphones are provided in an array are arranged
on at least two axes. A sound source direction detection unit 6 for detecting a voice component in
all directions of the speaker from a voice component, and a voice arrival direction detection
based on the voice component for all directions of the speaker direction detected by the sound
source direction detection unit 6 And a voice detection unit 7. [Selected figure] Figure 1
Signal processing apparatus, microphone system, speaker direction detection method, and
speaker direction detection program
[0001]
The present invention relates to a signal processing device for detecting a direction of a speaker
serving as a sound source, a microphone system, a method of detecting a direction of a speaker,
and a program for detecting a direction of a speaker.
[0002]
FIG. 7 shows the basic principle of conventional speaker direction detection.
04-05-2019
1
In FIG. 7, microphone arrays 71 each consisting of two or more omnidirectional microphones q2, q-1, q, q + 1, q + 2... The sound reception signals of the microphones of are respectively set as
xq-2 (t), xq-1 (t), xq (t), xq + 1 (t), xq + 2 (t). When the speaker 72 speaks toward the microphone
array 71, the direct sound S (t) reaching the microphone array 71 from the speaker 72 at an
angle θ and the reflection on the wall 73 reach the microphone array 71 at an angle θ ′ For
the primary reflected sound αS (t−τ), the received signal of the microphone q at the center
position of the microphone array 71 is a direct sound S (t) and a primary reflected sound αS (t−
It becomes an addition part of (tau)). However, noise and nondirectional noise generated
independently in the microphone q, or reverberation that is reflected from the speaker 72 several
times and reaches the microphone q is omitted because its influence on the direct sound S (t) is
small. .
[0003]
[0004]
Also, the sound reception signal of the microphone q + 1 adjacent to the microphone q at the
center position of the microphone array 71 is a direct sound S (t-τd) and a primary reflected
sound αS (t-τ-τd ', as shown in Equation 2. It becomes an addition part of).
[0005]
[0006]
Here, τd and τd ′ are microphones with microphone q when direct sound S (t−τd) and
primary reflected sound αS (t−τ−τd ′) reach microphone q + 1 at angles θ and θ ′,
respectively. Between the arrival time, α is the attenuation rate due to reflection, and τ is the
delay time difference between the direct sound and the primary reflection.
In the case where only the direct sound described above does not have a reflected sound, the
arrival time difference τd between the microphones of the interval d is uniquely determined by
the angle θ, as shown in Formula 3.
However, c shows the speed of sound.
04-05-2019
2
[0007]
[0008]
Therefore, if the arrival time difference τd due to the direct sound can be estimated from the
time difference of the sound signals from the plurality of microphones, the arrival direction θ of
the sound can be obtained.
Further, there has been a technique of obtaining a covariance matrix from sound reception
signals between microphones of a microphone array and multiplying a phase rotation vector for
each estimation direction to specify a speaker direction (see Patent Document 1).
There is also a technology for detecting a speaker direction based on a signal-to-noise ratio,
taking noise and reflections into consideration as well (Japanese Patent Application Laid-Open
No. 2003-147118). There is also a technique for estimating the direction of arrival of vowel
utterances using the harmonic structure of speech (Non-Patent Document 1). [Patent Document
1] Japanese Patent Application Publication No. 2005-62096 Patent Document 2 Japanese
Unexamined Patent Application Publication No. 2004-12151 "Signal estimation of speech in an
environment where reflected sound exists"
[0009]
However, if there is a reflected sound in which the direct sound is reflected on the wall 73 or the
like, the delay from the direct sound and the reflection direction is also included, so there is a
disadvantage that the arrival time difference due to the direct sound can not be accurately
estimated. Further, in the technology described in Patent Document 1, the covariance matrix is
obtained from the sound reception signal between microphones of the microphone array, and the
phase rotation vector is multiplied for each estimation direction to specify the speaker direction.
As described above, it is not possible to accurately estimate the arrival time difference τd due to
the direct sound because the reception signal including the reflected sound is not assumed.
[0010]
04-05-2019
3
Further, in the technique described in Patent Document 2, the speaker direction is detected based
on the signal-to-noise ratio also in consideration of noise and reflected sound, but the feature
amount for identifying voice / non-voice is not used. Therefore, although the noise source
direction can also be detected, complicated processing such as calculation of a signal-to-noise
ratio is required. Further, in the technique described in Non-Patent Document 1, the arrival
direction of vowel utterance is estimated using the harmonic structure of speech, but since fixed
pitch frequency is assumed, it is similar to ordinary speech speech. It can not cope with
fluctuation of the pitch frequency in a short time, and the direction detection accuracy is
deteriorated.
[0011]
Therefore, according to the present invention, even if there is the influence of the reflected
sound, it is possible to accurately estimate the arrival time difference of the direct sound to
improve the accuracy of the speaker direction detection, and also to simplify when the pitch
frequency fluctuates. It is an object of the present invention to provide a signal processing
apparatus which can cope with various processing, a microphone system using this signal
processing apparatus, a speaker direction detection method, and a speaker direction detection
program.
[0012]
In order to solve the above problems and to achieve the object of the present invention, a signal
processing device according to the present invention comprises axial directions of microphone
units in which microphone arrays provided with a plurality of microphones in an array are
arranged on at least two axes. A sound source direction detection unit for detecting a voice
component in all directions of the speaker direction from the voice component in the speaker
direction of the speaker, and arrival of voice based on the voice component in all directions of
the speaker direction detected by the sound source direction detection unit And a voice detection
unit for detecting a direction.
[0013]
According to the signal processing device of the present invention, the sound source direction
detection unit calculates the sound component for each estimated direction of each microphone
array from the plurality of microphone arrays arranged on the axes of at least two directions, and
the sound detection unit To detect the speaker direction in all directions.
04-05-2019
4
At this time, for example, when the sound component of a certain microphone array can not
estimate the angle accurately because the resolution of the angle becomes rough depending on
the angle direction, the sound source direction detection unit also uses the estimated angle of the
microphone array of other axes.
At this time, for example, when detecting an audio component, attention is paid to the harmonic
structure of the audio component, and when there is an effective harmonic component and it
comes from a specific direction, it is determined as an audio.
[0014]
The microphone system according to the present invention is a microphone unit in which a
microphone array in which a plurality of microphones are provided in an array is arranged on at
least two axes, and all the speech components in the speaker direction of each axis of the
microphone unit A sound source direction detection unit for detecting a voice component in the
direction of the speaker direction; and a voice detection unit for detecting an arrival direction of
voice based on the voice direction of the speaker direction in all directions detected by the sound
source direction detection unit. And a signal processing device.
[0015]
According to the microphone system of the present invention, for example, the sound source
direction detection unit of the signal processing device uses the microphone units arranged so as
to intersect a plurality of microphone arrays at the central point, and the sound component for
each estimation direction of each microphone array , And the speech detection unit detects the
speaker direction in all directions by combining them for each direction.
At this time, for example, when the sound component of a certain microphone array can not
accurately estimate the angle depending on the angular direction, the sound source direction
detection unit also uses the estimated angle of the microphone array of another axis.
[0016]
Further, according to the speaker direction detection method of the present invention, the voice
component in the speaker direction in each axial direction from the microphone unit in which the
microphone array provided with the plurality of microphones in an array is arrayed on at least
04-05-2019
5
two axes A step of converting into components, a step of averaging cross-correlations of
correlated frequency components in each axial direction of the microphone unit, and a speaker in
each axial direction obtained using the correlation component average of each axial direction
Estimating the speech component of the omnidirectional speaker direction from the speech
component of the direction; and detecting the arrival direction of the speech based on the
detected omnidirectional speaker direction speech component. is there.
[0017]
According to the speaker direction detection method of the present invention, the influence of
the reflected sound is suppressed by averaging the correlation between the adjacent
microphones with the speech components including the correlation.
In addition, sound components for each estimated direction of each microphone array from the
microphone unit in which a plurality of microphone arrays are arranged on at least two
directions are calculated, and are synthesized for each direction by all directions. Thus, the
speaker direction can be detected.
[0018]
Further, a speaker direction detection program according to the present invention is a
microphone unit in which a microphone array in which a plurality of microphones are provided
in an array is arranged on at least two axes. Means for converting the voice component of the
speaker direction in each axial direction from the above into frequency components; means for
averaging the cross-correlation of correlated frequency components in each axial direction of the
microphone unit; and Means for estimating speech components in all directions of the speaker
direction from speech components in each direction of the speaker direction obtained using the
correlation component average, and based on the detected speech components in all directions of
the speaker direction It is intended to function as a means for detecting the incoming direction of
voice.
[0019]
According to the speaker direction detection program of the present invention, the computer for
controlling the process of detecting the direction of the speaker influences the influence of the
reflected sound by averaging the speech components including the correlation between the
04-05-2019
6
adjacent microphones. Function to suppress
In addition, the computer for controlling the process of detecting the direction of the speaker
calculates voice components for each estimated direction of each microphone array from the
microphone unit in which the plurality of microphone arrays are arranged on at least two axes.
By combining them for each direction, it functions to detect the speaker direction in all
directions.
[0020]
According to the present invention, since the influence of the reflected sound can be suppressed
by averaging the correlation between the adjacent microphones with the speech components
including the correlation, the accuracy of the detection of the speaker direction can be improved.
The effect of being able to In addition, it is possible to perform the simple processing of
averaging correlated voice components, and also to cope with fluctuation of the pitch frequency
by averaging processing of band frequency components.
[0021]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings. FIG. 1 is a block diagram showing a speaker direction detection system according to an
embodiment of the present invention. The speaker direction detection system shown in FIG. 1
comprises a plurality of microphones 2-1, 2-2, 2-3, 2-4, 2-5, 3-1, 3-1, 3-2, 3-3, 3-4. It comprises
the microphone unit 1 which arranged the microphone arrays 2 and 3 provided in array form on
the axis | shaft of at least 2 directions.
[0022]
Here, the microphone unit 1 in which the vertical direction microphone array 2 and the
horizontal direction microphone array 3 are crossed at the central position is shown as an
example, but it is not horizontal direction or vertical direction but other intermediate direction or
other arbitrary not parallel to each other The direction of Moreover, if it is on a plane, it does not
necessarily have to be in two directions, and may be arranged in multiaxial directions of three or
more axes. When the microphone array is arranged in the other direction, the same signal
04-05-2019
7
processing in the horizontal axis direction and the vertical axis direction described later with
respect to the axial direction may be performed. Also, the number of microphones used in one
microphone array may be three or more.
[0023]
Further, the speaker direction detection system shown in FIG. 1 is configured to include the
signal processing device 5 that performs the operation of the speaker direction detection process
from the voice signal from the microphone unit 1. The signal processing device 5 detects a sound
component in the direction of the speaker in all directions from the sound component in the
direction of the speaker in each axial direction of the microphone unit 1; The voice detection unit
7 is configured to detect the arrival direction of voice based on the voice component of the
speaker direction of the direction.
[0024]
According to the speaker direction detection system (FIG. 1) configured as described above, the
microphone unit 1 arranges the vertical direction microphone array 2 and the horizontal
direction microphone array 3 so as to intersect with the microphone 2-3 at the center point.
Therefore, in the plane space formed by the vertical direction microphone array 2 and the
horizontal direction microphone array 3, the direct sound s (t, θ) and the primary reflected
sound s' (t ', θ') from the speaker 4 are different from one another. The time and arrival angle
are reached, and they are input as a summed sound x (t).
[0025]
The sound source direction detection unit 6 of the signal processing device 5 calculates the
power P (φ, t) of the sound component for each estimation direction from the sound components
x <LR, FB> q ± i, j (t) of each of the microphone arrays 2 and 3 calculate.
Then, the voice detection unit 7 detects the speaker direction θ <−> (t) in all directions from the
power P (φ, t) of the voice component for each estimated direction, which is the output of the
sound source direction detection unit 6. To detect At this time, for example, when the sound
component of the microphone array 2 (or the microphone array 3) can not accurately estimate
the angle depending on the angular direction, for example, the sound source direction detection
unit 6 may detect the microphone array 3 (or microphone array) Use the estimated angle of 2).
04-05-2019
8
[0026]
Here, the signal processing device 5 configures the sound source direction detection unit 6 and
the voice detection unit 7 as separate or integrated computer for signal processing, and causes
each unit to function by a dedicated speaker direction detection program as described later. You
may Also, the microphone unit 1 and the signal processing device 5 may be configured
separately or integrally.
[0027]
FIG. 2 is a block diagram showing the configuration of the sound source direction detection unit.
Similar to FIG. 1, the nine omnidirectional microphones 2-1, 2-2, 2-3, 2-4, 2-5, 3-1, 3-1, 3-2, 3-3,
3-4 are spaced apart. The microphone arrays 2 and 3 arranged in a cross shape at d are
configured, and the direction of arrival is the voice arrival direction θ based on a certain
direction, for example, the horizontal direction. The sound reception signals of the individual
microphones at time t are x <FB> q-2 (t), x <FB> q-1 (t), x <FB> q (t) for the microphone array 2 in
the vertical direction, respectively. , X <FB> q + 1 (t), x <FB> q + 2 (t).
[0028]
Also, the sound reception signals of the microphone array of the horizontal direction microphone
array 3 are represented by x <LR> q−2 (t), x <LR> q−1 (t), x <LR> q (t), x < It is assumed that
LR> q + 1 (t) and x <LR> q + 2 (t). x <LR> q (t) and x <FB> q (t) are the same signal. From these
voice signal sequences of the time domain microphone, the sound source direction detection unit
6 shown in FIG. 2 calculates the power P (φ, t) of the voice component for each scanning
direction φ, and based on this, the voice detection unit The speech arrival direction θ <−> (t) is
estimated by
[0029]
The detailed configuration and operation of the sound source direction detection unit 6 will be
described below. First, the above-described nine microphone input signal sequences are
04-05-2019
9
converted into digital signals by an A / D converter (not shown), and a window function
corresponding to a processing unit is applied to divide the digital signals into fixed intervals.
Then, the frequency spectrum analysis is performed by the short time Fourier transform unit 11,
and the frequency spectrum X <FB> q-2 (ω), X <FB> q-1 (ω) of the vertical array in the
frequency domain for each microphone , X <FB> q (ω), X <FB> q + 1 (ω), and X <FB> q + 2 (ω).
[0030]
Similarly, frequency spectrum X <LR> q-2 (ω), X <LR> q-1 (ω), X <LR> q (ω), X <LR> q + 1 (ω)
of the horizontal array in the frequency domain. ), X <LR> q + 2 (ω) is obtained.
[0031]
In the following, the horizontal frequency component and the vertical frequency component will
be described at the same time, since they perform the same processing independently with the
same configuration.
In the cross power spectrum units 12 and 22, for example, the cross power spectrum G q, q + 1
(ω) showing the cross correlation in the frequency domain for the microphone q adjacent to the
microphone q at the center position of the microphone arrays 2, 3 is shown below Calculated by
the equation 4 * Shown in the equation 4 indicates a complex conjugate.
[0032]
[0033]
For example, in the microphone arrays 2 and 3 configured by cross microphones and nine
microphones, four sets of cross power spectra for each axis can be obtained.
Specifically, the cross power spectrum in the horizontal direction G <LR> q−2, q−1 (ω), G <LR>
q−1, q (ω), G <LR> q, q + 1 (ω), G <LR> q + 1, q + 2 (ω), cross power spectrum in the vertical
direction G <FB> q−2, q−1 (ω), G <FB> q−1, q (ω), G <FB> q , Q + 1 (ω), G <FB> q + 1, q + 2
(ω).
04-05-2019
10
[0034]
Next, the pitch extraction unit 19 estimates the pitch frequency ω0 from the power spectrum |
Xq (ω) | <2> for the microphone q at the center position of the microphone array 2 and 3
configured by nine cross microphones. . The pitch frequency is estimated using a known
estimation method (see, for example, Information Processing Society 99-MUS-31-16 "Melody and
Bass Pitch Estimation for Real-World Music Acoustic Signals").
[0035]
The frequency vector units 13 and 23 use the pitch frequency ω 0 estimated by the pitch
extraction unit 19 and the frequency iω 0 (i = 1,..., N) at which the spectrum power is maximum
around the frequency of integer multiples. The frequency vector Gq, q + 1 (ω0) of the cross
power spectrum of is calculated by the following equation 5. T shown in equation 5 indicates
transposition.
[0036]
Here, N is an integer such that Nω0 <= πc / d. C is the speed of sound.
[0037]
Specifically, horizontal frequency vectors G <LR> q-2, q-1 (ω0), G <LR> q-1, q (ω0), G <LR> q, q +
1 (ω0), G <LR> q + 1, q + 2 (ω0), frequency vector G <FB> q−2, q−1 (ω0) in the vertical
direction, G <FB> q−1, q (ω0), G <FB> q, q + 1 (Ω0), G <FB> q + 1, q + 2 (ω0) are obtained.
[0038]
The inter-microphone average processing units 14 and 24 calculate the frequency vector among
the (Q-1) microphones obtained for the Q microphones with respect to the frequency vector of
each axis by the following equation 6 to calculate the microphones. An inter-average frequency
vector G <LR, FB> SP (ω0) is determined.
[0039]
04-05-2019
11
[0040]
FIG. 3 shows an example of the cross power spectrum thus obtained.
In FIG. 3, the inter-microphone average frequency vector GSP is divided into a plurality of bands
centered on iω0, (i = 1,..., N).
[0041]
The band averaging processing units 15 and 25 calculate frequency components within the band
according to the following equation 7 with respect to the narrow band component group having
the pitch frequency ω 0 and a frequency that is an integral multiple of the pitch frequency ω 0
as shown in FIG. Weighted average.
[0042]
[0043]
FIG. 4 is a diagram showing a weighted average of in-band frequency components.
In FIG. 4, in a band centered on ω0, for example, assuming that weighting bandwidth R = 5, ω05Δω, ω0-4Δω, ω0-3Δω, ω0-2Δω, ω0-Δω, ω0, ω0 + Δω, ω0 + 2Δω, The components
adjacent to the central frequency component are weighted so as to be within an average range
using the weighting coefficient δr in ω0 + 3Δω, ω0 + 4Δω, ω0 + 5Δω.
Δω is an interval of frequency components in discrete Fourier transform.
[0044]
As a result, even if the pitch frequency fluctuates within the above-mentioned band, the power of
the voice direction vector, which will be described later, can be detected, so that the center
frequency component is also shifted to another frequency component in the band. The power of
04-05-2019
12
the voice direction vector can be secured from other frequency components without fluctuating
the peak of the.
[0045]
The harmonic selection units 16 and 26 divide the phase difference of the cross power spectrum
band-averaged by Equation 8 for N harmonic components m which are candidates to be selected
for the pitch frequency ω 0 and its integral multiple frequency components. Ask.
[0046]
[0047]
This phase difference φ (mω0) is replaced with the time difference T (m) = φ (mω0) / mω0
corresponding to the direct sound arrival time difference τd, and the least square error | Tave-T
(m) with respect to the average value Tave M harmonic components m are selected in ascending
order of <2> |.
Thereby, based on the phase difference of each frequency component between microphones, it is
possible to select a harmonic component for virtually arranging each frequency component in
the space of the voice arrival direction.
[0048]
The covariance matrix forming units 17 and 27 use the cross spectral power spectrum obtained
by band averaging the frequencies of the M harmonic components m selected by the harmonic
selecting units 16 and 26 to obtain The covariance matrix R (ω0) is constructed using Equation
10.
The covariance matrix R (ω0) is a virtual arrangement of the frequency components of the M
harmonic components m of the band averaged cross spectral power vector in the space of the
voice arrival direction.
04-05-2019
13
[0049]
[0050]
[0051]
The vertical direction estimation unit 18 and the horizontal direction estimation unit 28 estimate
the covariance matrices R <LR> (ω0) and R <FB> (ω0) for each of the microphone array axes in
the vertical direction and the horizontal direction, for example, as known sound source
estimation. The speech power for the direction φ is calculated using the method MUSIC.
At this time, each element R (ω0) of the covariance matrix is normalized by its size as shown in
Equation 11, and decomposed into eigenvectors V (ω0) as follows.
[0052]
[0053]
Then, for the obtained eigenvector Vm, the power P (φ) of the voice in the direction φ is
determined by the following equation (14).
In equation 12, T (φ) is the delay time between microphones with respect to the direction Φ.
[0054]
[0055]
The omnidirectional estimation unit 21 synthesizes voice direction powers P <LR> MUSIC (φ)
and P <FB> MUSIC (φ) with respect to the direction φ with respect to each horizontal and
vertical directional axis.
04-05-2019
14
As for the combining method, for example, as shown by equation 13 regarding the direction φ,
one with smaller power in each of the horizontal and vertical direction axes is taken as the
combined value PMUSIC (φ).
This is to remove the influence of the spatially aliasing components appearing symmetrically due
to the calculation of the direction components in each direction axis.
[0056]
[0057]
FIG. 5 shows that the omnidirectional estimation unit 21 omnidirectionally estimates the
composite value PMUSIC (φ) when speech is input from the direction of voice arrival direction
θ = 45 ° to the horizontal microphone array 2 as a reference. It is a result.
It can be seen from FIG. 5 that the composite value PMUSIC (φ) is at the maximum level in the θ
= 45 ° direction.
[0058]
The voice detection unit 7 shown in FIG. 1 determines the voice direction θ from the direction
φ at which the level of the power value P (φ, t) in the voice direction estimated by the sound
source direction detection unit 6 becomes maximum as shown by Eq. Can be estimated.
[0059]
[0060]
However, as shown in FIG. 6, even if there is no speech, the omnidirectional estimation unit 21
sequentially estimates the composite value PMUSIC (φ) in some direction, so as shown by
equation 15, in the speech direction θ (t) If the ratio of the power P (.theta., T) to the average
04-05-2019
15
power Pave (.phi., T) in any other direction is greater than or equal to a threshold value TSD, it is
determined that speech is detected.
Here, for example, 2-3 [dB] is set as the threshold value TSD.
The threshold value TSD may be a value that can distinguish between the case where there is
speech and the case where there is no speech.
[0061]
[0062]
In addition, the power values P (φ, t) in all directions in FIGS. 5 and 6 may be used as the
pointing pattern of the array microphone as it is.
In this case, the directivity can be made sharper in the sound source direction by increasing the
gain as the level of the power value P (φ, t) in the estimated direction increases.
On the other hand, in order to strongly depend on the power of voice which changes
sequentially, for example, a peak hold function e <-. Mu.t> which gradually attenuates with an
attenuation time .mu. Let's do it.
[0063]
[0064]
Needless to say, the present invention is not limited to the above-described embodiment of the
present invention, and modifications can be made as appropriate within the scope of the claims
of the present invention.
[0065]
04-05-2019
16
It is a block diagram which shows the speaker direction detection system by embodiment of this
invention.
It is a block diagram which shows the structure of a sound source direction detection part.
It is a figure which shows the example of a cross power spectrum.
It is a figure which shows the weighted average of an in-band frequency component.
It is a figure which shows omnidirectional estimated distribution (45 degrees of incident angles).
It is a figure which shows omnidirectional estimated distribution (without audio | voice). It is a
figure which shows the basic principle of the conventional speaker direction detection.
Explanation of sign
[0066]
Reference Signs List 1 microphone unit 2 horizontal microphone array 3 vertical microphone
array 4 speaker 5 signal processor 6 sound source direction detector 7 voice detector 11 short
time Fourier transform 12, 22 cross power spectrum unit 13, 23 frequency vector unit 14, 24
inter-microphone average processing unit 15, 25 band averaging processing unit 16, 26
harmonic selection unit 17, 27 ... Covariance matrixing unit, 18 ... vertical direction estimation
unit, 28 ... horizontal direction estimation unit, 19 ... pitch extraction unit, 21 ... omnidirectional
estimation unit
04-05-2019
17
Документ
Категория
Без категории
Просмотров
0
Размер файла
28 Кб
Теги
jp2007006253
1/--страниц
Пожаловаться на содержимое документа