close

Вход

Забыли?

вход по аккаунту

?

JP2011124872

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2011124872
PROBLEM TO BE SOLVED: To provide a sound source separation device capable of easily
separating a sound source even if there are a plurality of disturbing sounds, and in which the
sound quality of a target sound after separation is good. A sound source separation device
according to the present invention separates a target sound and an interference sound from a
sound reception signal of a microphone. First and second target sound dominant spectra are
generated by first and second linear combination processing for target sound enhancement using
sound reception signals of two microphones arranged at intervals. In addition, a target sound
suppression spectrum is generated by linear combination processing for target sound
suppression using the above-described two sound reception signals. Furthermore, a linear
combination process using the two above-mentioned sound reception signals generates a phase
signal including many signal components of the target sound having directivity in the target
sound direction. Then, the target sound and the interference sound are separated using the first
target sound dominant spectrum, the second target sound dominant spectrum, and the target
sound suppression spectral phase signal. [Selected figure] Figure 1
Sound source separation device, method and program
[0001]
The present invention relates to a sound source separation device, method, and program, and for
example, in a portable device such as a portable telephone, or an on-vehicle device such as a car
navigation system, interference that a desired voice arrives from any direction other than the
voice arrival direction. It can be applied when acquiring separately from sound.
[0002]
04-05-2019
1
When using voice recognition or using telephone message recording, when voice is input with a
microphone, ambient noise may extremely degrade voice recognition accuracy, or the recorded
voice may be difficult to hear due to noise, etc. There is a problem.
[0003]
For this reason, an attempt has been made to selectively acquire only a desired voice by
controlling the directivity characteristics by using a microphone array.
However, it has been difficult to extract desired speech separately from background noise only
by controlling such directional characteristics.
The technique of directivity control by the microphone array itself is a known technique, for
example, a technique related to directivity control by delay sum array (DSA: Delayed Sum Array
or BF: Beam-Forming) or DCMP (Directionally). There are technologies related to directivity
control using a Constrained Minimization of Power) adaptive array.
[0004]
On the other hand, as a technology for separating speech by remote speech, a technology for
narrow-band spectrum analysis of output signals of a plurality of fixed microphones and
assigning sound of that frequency band to the microphone giving the largest amplitude for each
frequency band (SAFIA) (See Patent Document 1). In this voice separation technology by band
selection (BS: Band Selection), in order to obtain a desired voice, the microphone closest to the
sound source emitting the desired voice is selected, and the sound of the frequency band
assigned to that microphone is used. Synthesize voice.
[0005]
Further, as a further technology, Patent Document 2 proposes a method in which a method of
band selection is improved. Hereinafter, the sound source separation method described in Patent
Document 2 will be described with reference to FIG.
04-05-2019
2
[0006]
In the method of Patent Document 2, the two microphones 321 and 322 are arranged in a
direction perpendicular or substantially perpendicular to the arrival direction of the target sound.
[0007]
In the target sound dominant signal generation unit 330, the first target sound dominant signal
generation unit 331 performs delay processing on the sound reception signal X1 (t) of the
microphone 321 and the sound reception signal of the microphone 332 on the time domain or
the frequency domain. And the first target sound dominant signal X1 (t) -D (X2 (t)), and the
second target sound dominant signal is generated. The means 332 is a difference between the
sound reception signal X2 (t) of the microphone 322 and the signal D (X1 (t)) after delaying the
sound reception signal of the microphone 331 on the time domain or the frequency domain. To
generate a second target sound dominant signal X2 (t) -D (X1 (t)).
The target sound inferior signal generation means 340 takes the difference between the sound
reception signals X1 (t) and X2 (t) of the two microphones 321 and 322 on the time domain or
on the frequency domain to obtain the target sound inferior signal X1 ( t) Generate -X2 (t). These
three types of signals X1 (t) -D (X2 (t)), X2 (t) -D (X1 (t)) and X1 (t) -X2 (t) are frequency analyzed
in frequency analysis means 350, respectively. Be done.
[0008]
Then, in the first separation means 361, band selection (or spectral subtraction) is performed
using the spectrum of the first target sound dominant signal and the spectrum of the target
sound inferior signal, and the microphone 321 is installed. Sound coming from the space on the
other side (left space in FIG. 4B described later) is separated, and the second separating means
362 separates the spectrum of the second target sound dominant signal and the target sound
inferior signal. Band selection (or spectral subtraction) is performed using the spectrum to
separate the sound coming from the space on the side where the microphone 322 is installed
(the right space in FIG. 4B). The integration means 363 separates the target sound by spectrum
integration processing using the spectrum output from the first separation means 361 and the
spectrum output from the second separation means 362.
04-05-2019
3
[0009]
As the first target sound superior signal generating means 331, the second target sound superior
signal generating means 332, and the target sound inferior signal generating means 340
described above, a filter called a spatial filter is used.
[0010]
The spatial filter will be described with reference to FIG.
In FIG. 4B, considering a sound source to be input at an angle θ with respect to two
microphones 321 and 322 arranged at an interval d, a distance of d × sin θ between the two
microphones with respect to the distance to the sound source A difference T occurs, and as a
result, a time difference τ represented by equation (1) occurs for the sound from the sound
source to arrive.
[0011]
τ = {d × sin θ} / (propagation speed of sound) Then, when the output of the microphone 322 is
delayed after the output of the microphone 321 by the time difference τ from the output of the
microphone 322, they cancel each other out and the suppression angle θ Sound in the direction
is suppressed. FIG. 4A shows the gain of the spatial filter set to the suppression angle θ after the
suppression processing for each direction of the sound source. The first and second target sound
dominant signal generating means 331 and 332 respectively extract the target sound component
using a spatial filter in which the suppression angle θ is set to, for example, -90 degrees and 90
degrees, and the disturbing sound component Repress On the other hand, the target sound
inferior signal generation means 340 suppresses the target sound component and extracts the
interference sound component by using a spatial filter whose suppression angle θ is 0 degrees.
[0012]
The band selection process in the first separation means 361 or the second separation means
362 is based on the selection process from the two spectra accompanied by the normalization
process shown in equation (2) and the calculation process of the separated spectrum shown in
04-05-2019
4
equation (3) Become. In equations (2) and (3), S (m) is the m-th spectral element after band
selection processing, M (m) is the m-th spectral element of the first or second target sound
dominant signal, N (M) is the m-th spectral element of the target sound inferior signal, D (m) is
the m-th sound signal of the microphone 321 (or the microphone 322) corresponding to the first
separation means 361 (or the second separation means 362) The spectral element of H (m)
represents the mth spectral element of the separated signal.
[0013]
JP-A-10-313497 JP-A-2006-197552
[0014]
In the above-mentioned SAFIA, in the situation where two sounds overlap, both can be separated
well.
However, when the number of sound sources is three or more, although separation is
theoretically possible, the separation performance is extremely degraded. Therefore, in the
presence of a plurality of noise sources, it is difficult to accurately separate the target sound from
the received signal including the plurality of noises.
[0015]
On the other hand, the method described in Patent Document 2 calculates each frequency
characteristic in which the sound signal (audio signal, acoustic signal) from each sound source is
appropriately emphasized, and the amplitude values of the same frequency band in each
frequency characteristic are The interference sound is eliminated by appropriately comparing the
magnitudes of Here, from the equations (2) and (3) described above, the separated spectrum H
(m) is input from √ (M (m) −N (m)) and one of the microphones 321 (or 322) It can be seen
that the phase is obtained using the phase of the signal D (m). The signal D (m) input from the
microphone 321 contains an interference sound in addition to the target sound, and it must be
said that it is unsuitable for use near the final stage for eliminating the interference sound. This
causes the sound quality deterioration after the final sound source separation.
[0016]
04-05-2019
5
Therefore, a sound source separation apparatus, method and program are desired which can
easily separate sound sources even if there are a plurality of disturbing sounds, and which have
good sound quality of the target sound after separation.
[0017]
According to a first aspect of the present invention, there is provided a sound source separation
apparatus for separating a target sound from an interference sound coming from any direction
other than the direction of arrival of the target sound, and (1) a plurality of microphones
arranged at intervals. Among the received sound signals of at least one first target sound by
performing the first linear combination processing for target sound enhancement on the time
axis or on the frequency domain using the sound received signals of two microphones. (1) using
the microphone reception signals of the two microphones used for the generation of the first
target sound dominant spectrum generating means for generating the dominant spectrum and
(2) the first target sound dominant spectrum; Second target sound dominant spectrum
generation means for generating a spectrum of at least one second target sound dominant by
performing a second linear combination process for target sound enhancement on a frequency
domain; The first target sound is obtained by performing linear combination processing for
target sound suppression on the time axis or on the frequency domain using the sound reception
signals of the two microphones used for generation of the target sound dominant spectrum of A
target sound suppression spectrum generating means for generating at least one target sound
suppression spectrum paired with a dominant spectrum and the second target sound dominant
spectrum; (4) sound reception of the plurality of microphones arranged at intervals Phase
generation means for generating a phase signal by performing linear combination processing on
a frequency domain using sound reception signals of a plurality of microphones among the
signals; (5) the first target sound dominant spectrum; [2] having a target sound separation means
for separating a target sound and an interference sound using two target sound dominant
spectrums, the target sound suppression spectrum, and the phase signal; And butterflies.
[0018]
A second invention of the present invention is a sound source separation method for separating a
target sound and an interference sound coming from an arbitrary direction other than the arrival
direction of the target sound, comprising: first target sound superior spectrum generation means;
A sound dominant spectrum generation unit, a target sound suppression spectrum generation
unit, a phase generation unit, and a target sound separation unit; (1) the first target sound
dominant spectrum generation unit includes a plurality of microphones arranged at intervals; At
least one first target sound dominant by performing a first linear combination process for target
sound enhancement on a time axis or on a frequency domain using sound reception signals of
two microphones among sound reception signals. (2) the second target sound dominant
spectrum generating means uses the sound reception signals of the two microphones used for
04-05-2019
6
generating the first target sound dominant spectrum (2) generating a spectrum of at least one
second target sound dominant by performing a second linear combination process for target
sound enhancement on a time axis or a frequency domain, and (3) the target sound suppression
spectrum generation means Is performed by performing linear combination processing for target
sound suppression on the time axis or on the frequency domain using the sound reception
signals of the two microphones used for generating the first target sound dominant spectrum.
Generating at least one target sound suppression spectrum in combination with the first target
sound dominant spectrum and the second target sound dominant spectrum; and (4) the plurality
of phase generating means are spaced apart from each other. A phase signal is generated by
performing linear combination processing in a frequency domain using sound reception signals
of a plurality of microphones among sound reception signals of the microphones, and (5) the
target sound Away means, the first target sound predominant spectrum, the second target sound
predominant spectrum, the target sound suppressing spectrum and, by using the phase signal,
and separating the target sound and the disturbance sound.
[0019]
A third aspect of the present invention is a sound source separation program for separating a
target sound from an interference sound coming from any direction other than the direction of
arrival of the target sound, the computer comprising: At least a first linear combination process
for emphasizing the target sound on the time axis or on the frequency domain using the sound
reception signals of the two microphones among the sound reception signals of the plurality of
arranged microphones, at least First target sound dominant spectrum generating means for
generating one first target sound dominant spectrum; (2) sound reception signals of two
microphones used for generation of the first target sound dominant spectrum; A second target
sound dominant spectrum generating at least one second target sound dominant spectrum by
performing a second linear combination processing for target sound enhancement on a time axis
or on a frequency domain (3) linear combination for target sound suppression on the time axis or
on the frequency domain using the sound reception signals of the two microphones used for
generation of the first target sound dominant spectrum. Target sound suppression spectrum
generation means for generating at least one target sound suppression spectrum that is paired
with the first target sound dominant spectrum and the second target sound dominant spectrum
by performing processing; Phase generation means for generating a phase signal by performing
linear combination processing on a frequency domain using sound reception signals of a plurality
of microphones among sound reception signals of the plurality of microphones arranged; ) An
eye for separating a target sound and an interference sound using the first target sound
dominant spectrum, the second target sound dominant spectrum, the target sound suppression
spectrum, and the phase signal Characterized in that to function as sound separation means.
[0020]
04-05-2019
7
According to the present invention, even if there are a plurality of disturbing sounds, the sound
source can be easily separated, and furthermore, the sound quality of the target sound after
separation can be improved.
[0021]
FIG. 1 is a block diagram showing an entire configuration of a sound source separation device
according to a first embodiment.
It is a block diagram which shows the whole structure of the sound source separation apparatus
which concerns on 2nd Embodiment.
It is a block diagram which shows the structure of the conventional sound source separation
apparatus.
It is explanatory drawing of a spatial filter.
[0022]
(A) First Embodiment Hereinafter, a first embodiment of a sound source separation device,
method and program according to the present invention will be described with reference to the
drawings.
Although the application of the sound source separation device according to the first
embodiment is not limited, for example, the sound source separation device according to the first
embodiment is mounted as a preprocessing device (noise removal device) of a voice recognition
device. Or the like, etc.) in the initial processing stage of captured voice.
[0023]
(A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing the overall
04-05-2019
8
configuration of a sound source separation device according to the first embodiment.
The sound source separation apparatus according to the first embodiment may be configured
exclusively by a combination of discrete parts or the like, a semiconductor chip, or the like, and
an information processing apparatus such as a personal computer including a processor (limited
to one And the system may be configured to be able to perform distributed processing of a
plurality of machines), and installed by installing the sound source separation program (including
fixed data) of the first embodiment Furthermore, the digital signal processor in which the sound
source separation program according to the first embodiment is written may be used, and there
is no limitation on the method of realization, but the functional representation is shown in FIG. be
able to. Even in the case of focusing on software processing, the hardware configuration is
applied to the microphone and the portion of the analog / digital converter.
[0024]
In FIG. 1, the sound source separation device 10 according to the first embodiment mainly
includes an input unit 20, an analysis unit 30, a separation unit 40, a removal unit 50, a
generation unit 60, and a phase generation unit 70.
[0025]
The input means 20 comprises two microphones 21, 22 spaced apart and two analog / digital
converters not shown.
Each of the microphones 21 and 22 is nondirectional, or has moderate directivity in the direction
perpendicular to the straight line connecting the microphones 21 and 22. In addition to the
target sound from the target sound source for which the sound source separation device 10 is
intended, each of the microphones 21 and 22 is an interference sound from another sound
source, a noise in which the sound source is not clear, etc. Call) also capture. The analog / digital
converter (not shown) converts the sound reception signal obtained by capturing the sound and
sound in the space by the corresponding microphones 21 and 22 into a digital signal.
[0026]
The means for inputting the sound signal to be processed is not limited to the microphones 21
04-05-2019
9
and 22. For example, sound reception signals from two microphones may be reproduced and
input from a recording device that records sound, or, for example, sound reception signals from
two microphones provided in a device at the other end of communication may be communicated
by communication. It may be acquired and used as an input signal. Such an input signal may be
an analog signal or may already be converted into a digital signal. Even in the case of input by
recording / reproduction, communication, etc., since capture is initially performed by a
microphone, the term "microphone" is used in the claims including such a case.
[0027]
A digital signal relating to the sound reception signal of the microphone 21 is x1 (n), and a digital
signal relating to the sound reception signal of the microphone 22 is x2 (n). However, n
represents the n-th data (sample). The digital signals x1 (n) and x2 (n) are obtained by analog-todigital conversion of a sound reception signal consisting of an analog signal captured by the
microphone and sampling at every sampling period T. The sampling period T is usually about
31.25 microseconds to about 125 microseconds. The subsequent processing is performed with N
consecutive x1 (n) and x2 (n) in one and the same time interval as one analysis unit (frame).
Here, N = 1024 as an example. For example, when a series of processing of the sound source
separation for the processing target analysis unit is finished, the latter half 3N / 4 data of x1 (n)
and x2 (n) are shifted to the first half, and newly input consecutive By connecting N / 4 data in
the second half, new N consecutive x1 (n) and x2 (n) are generated, and new processing is
performed as one analysis unit, and such processing target is performed. The process of the
analysis unit is repeated.
[0028]
The analysis unit 30 includes frequency analysis units 31 and 32 corresponding to the
microphones 21 and 22, respectively. The frequency analysis unit 31 performs frequency
analysis on the digital signal x1 (n), and the frequency analysis unit 32 performs frequency
analysis on the digital signal x2 (n). In other words, the frequency analysis units 31 and 32
convert the digital signals x1 (n) and x2 (n), which are signals on the time axis, into signals on the
frequency domain. Here, it is assumed that FFT (Fast Fourier Transform) is applied to frequency
analysis. In the FFT processing, a window function is applied to digital signals x1 (n) and x2 (n) in
which N data are continuous. Although various window functions can be applied as the window
function w (n), for example, a Hanning window as shown in equation (4) is applied. The window
process is a process performed in consideration of the connection process of the analysis unit in
the generation unit 60 described later. Although applying a window function is preferable, it is
04-05-2019
10
not an essential process.
[0029]
The signals on the frequency domain output from the frequency analysis units 31 and 32 are D1
(m) and D2 (m), respectively. The signals in the frequency domain (hereinafter referred to as
spectra as appropriate) D1 (m) and D2 (m) are each represented by a complex number. The
parameter m represents the order on the frequency axis, that is, the m-th band.
[0030]
The frequency analysis method is not limited to FFT, and another frequency analysis method
such as DFT (Discrete Fourier Transform) may be applied. Further, depending on the device on
which the sound source separation device 10 of the first embodiment is mounted, the frequency
analysis unit in the processing device for another purpose may be diverted as the configuration
of the sound source separation device 10. For example, when the device on which the sound
source separation device 10 is mounted is an IP telephone, such diversion is possible. In the case
of the IP telephone set, a coded version of the FFT output is inserted into the payload of the IP
packet, and the FFT output can be diverted as the output of the analysis means 30 described
above.
[0031]
The separating means 40 is for extracting a sound in which the sound source is located on a
vertical plane intersecting the line connecting the two microphones 21 and 22, that is, the target
sound. The separation means 40 has three spatial filters 41, 42, 43 and a minimum selection unit
44.
[0032]
The processing at each part of the separating means 40 described below is the property D (m) =
D * (N−m) (Spectrum D (m) (D (m) is D1 (m) or D2 (m)). However, it may be performed in the
range of 0 ≦ m ≦ N / 2 from 1 ≦ m ≦ N / 2−1 and D * (N−m) represents a conjugate
complex number of D (N−m).
04-05-2019
11
[0033]
The spatial filters 41 and 42 are for emphasizing (domining) the target sound with respect to the
disturbance sound.
The spatial filters 41 and 42 are spatial filters having different specific directivity. The spatial
filter 41 is, for example, a spatial filter with 90 degrees on the right with respect to a plane
perpendicular to the line connecting the two microphones 21 and 22, and the suppression angle
θ in FIG. 4 described above is 90 degrees clockwise. It is a spatial filter. On the other hand, the
spatial filter 42 is, for example, a spatial filter having a left 90 degrees with respect to a plane
perpendicular to the line connecting the two microphones 21 and 22 and the suppression angle
θ in FIG. Is a spatial filter in the case of The process of the spatial filter 41 can be represented
mathematically by the equation (5), and the process of the spatial filter 42 can be represented by
the equation (6) mathematically. In the equations (5) and (6), f is a sampling frequency (eg, 1600
Hz). The equations (5) and (6) are linear combination equations of the input spectra D1 (m) and
D2 (m) to the spatial filters 41 and 42, respectively.
[0034]
The suppression angle θ in the spatial filters 41 and 42 is not limited to the clockwise 90
degrees and the counterclockwise 90 degrees described above, but may be somewhat different
from this angle.
[0035]
The spatial filter 43 is for making the target sound inferior to the disturbance sound.
The spatial filter 43 corresponds to the spatial filter when the suppression angle θ in FIG. 4
described above is 0 degree, and extracts the interference sound from the sound source located
in the extension direction of the line connecting the two microphones 21 and 22. In some cases,
the target sound is inferior. The processing of the spatial filter 43 can be expressed
mathematically by equation (7). The equation (7) is a linear combination equation of the input
spectrum D1 (m) and D2 (m) to the spatial filter 43.
04-05-2019
12
[0036]
N (m) = D1 (m) −D2 (m) (7) The minimum selection unit 44 outputs a spectrum E1 (m) obtained
by emphasizing the target sound output from the spatial filter 41 and the spatial filter 42. A
target sound emphasis spectrum M (m) is formed by integrating the target sound with the
spectrum E2 (m). The minimum selection unit 44 sets the absolute value of the output spectrum
E1 (m) from the spatial filter 41 and the absolute value of the output spectrum E2 (m) from the
spatial filter 42 as shown in equation (8) for each band. Of the output spectrum M (m) from the
minimum selection unit 44.
[0037]
The phase generation means 70 uses the output spectrum D1 (m) from the frequency analysis
unit 31 and the output spectrum D2 (m) from the frequency analysis unit 32 to include a large
number of target sound components. To generate a spectrum (hereinafter referred to as a phase
spectrum) F (m) used for The phase generation means 70 adds the output spectrum D1 (m) from
the frequency analysis unit 31 and the output spectrum D2 (m) from the frequency analysis unit
32 as shown in equation (9) to obtain a phase spectrum F (m Generate).
[0038]
F (m) = D1 (m) + D2 (m) (9) (9) The phase generation means 70 which calculates Formula (9) is a
spatial filter having directivity in the target sound direction. Since the characteristic of the phase
spectrum F (m) has directivity in the direction of the target sound, it contains many signal
components of the target sound, and the phase component is continuous because selection
processing for each band is not performed. Yes, they do not have steep characteristics.
[0039]
Incidentally, the phase information used for target sound separation needs to contain many
target sound components, and it is also conceivable to use the phase component of the signal
after band selection. However, discontinuity of the phase component occurs due to the band
selection processing, and when the signal after band selection is used, the sound quality of the
separated target sound is degraded. Therefore, it is appropriate to apply a spatial filter that
04-05-2019
13
implements equation (9).
[0040]
The removing means 50 removes an interference sound from the output spectrum M (m) of the
minimum selecting unit 44, the output spectrum N (m) of the spatial filter 43, and the output
spectrum F (m) of the phase generating means 70. In other words, an output obtained by
separating and extracting only the target sound is obtained. The removing means 50 applies
selection processing from the two spectra M (m) and N (m) accompanied by the normalization
processing shown in equation (10), and the obtained spectrum S (m) to equation (11) It
comprises the process of calculating the separated spectrum H (m) shown.
[0041]
Here, the processing of the equation (10) or the equation (11) is also performed in the range of 0
≦ m ≦ N / 2 in consideration of the relationship between the complex number and the
conjugate complex number described above. Therefore, the removing means 50 determines the
relationship H (m) = H * (N−) between the complex number and the conjugate complex number
from the separated spectrum H (m) in the range of 0 ≦ m ≦ N / 2 obtained according to the
equation (11). m) A separated spectrum H (m) in the range of 0 ≦ m ≦ N−1 is determined
using (where N / 2 + 1 ≦ m ≦ N−1).
[0042]
The generation means 60 converts the separated spectrum (disturbance noise removal spectrum)
H (m), which is a signal in the frequency domain, into a signal on the time axis, and connects the
signal for each analysis unit to recover a continuous signal. It is Note that digital / analog
conversion may be performed as necessary. The generation means 60 performs N-point inverse
FFT processing on the separated spectrum H (m) to obtain the sound source separated signal h
(n), and then generates the current sound source separated signal h (n) as shown in equation
(12). , And the final 3N / 4 data of the sound source separation signal h ′ (n) for the
immediately preceding analysis unit are added to obtain the final separation signal y (n) y (n) = h
(n) n) + h '(n + N / 4) (12) Here, the above-described process is performed while shifting N / 4
data so that data (sample) is overlapped in consecutive analysis units. This is to make waveform
connection smooth, and this method is often used. The time allowed for the above-described
04-05-2019
14
series of processes from the analysis means 30 to the generation means 60 is NT / 4 for one
analysis unit.
[0043]
Note that depending on the application of the sound source separation device 10, the generation
means 60 can be omitted, and generation portions of other devices can be diverted. For example,
if the sound source separation device is used for a speech recognition device, the generation
means 60 can be omitted by using the separated spectrum H (m) as the recognition feature
amount. Also, for example, in the case where the sound source separation device is used for an IP
telephone, since the IP telephone has a generation unit, the generation unit may be diverted.
[0044]
(A-2) Operation of First Embodiment Next, the operation (sound source separation method) of the
sound source separation device 10 according to the first embodiment will be described.
[0045]
Sound reception signals obtained by capturing by the respective microphones 21 and 22 are
converted into digital signals x1 (n) and x2 (n), respectively, and then cut out into analysis units
and provided to the analysis means 30.
[0046]
In the analysis means 30, the digital signal x1 (n) is frequency analyzed by the frequency analysis
unit 31, and the digital signal x2 (n) is frequency analyzed by the frequency analysis unit 32, and
the obtained spectra D1 (m) and D2 ( m) is provided to the spatial filters 41, 42, 43 and the
phase generation means 70.
[0047]
In the spatial filter 41, the calculation shown in the equation (5) to which the spectra D1 (m) and
D2 (m) are applied is executed, and the direction 90 degrees to the right with respect to the
plane perpendicular to the line connecting the two microphones 21 and 22. The spectrum E1 (m)
in which the target sound is emphasized is obtained by suppressing the interference sound of the
above, and in the spatial filter 42, the operation shown in the equation (6) obtained by applying
04-05-2019
15
the spectra D1 (m) and D2 (m) is A spectrum E2 (m) is obtained in which the interference sound
in the direction of 90 degrees on the left side with respect to the plane perpendicular to the line
connecting the two microphones 21 and 22 is suppressed to emphasize the target sound.
In the minimum selection unit 44, as shown in equation (8), for each band, the absolute value of
the output spectrum E1 (m) from the spatial filter 41 and the absolute value of the output
spectrum E2 (m) from the spatial filter 42 A process of selecting the minimum value from the
values is executed to obtain a spectrum M (m) of the target sound emphasis after integration, and
this spectrum M (m) is given to the removal means 50.
[0048]
Further, in the spatial filter 43, the operation shown in the equation (7) to which the spectra D1
(m) and D2 (m) are applied is executed, and a sound source located in the extension direction of
the line connecting the two microphones 21 and 22. The interference sound from is extracted to
obtain a spectrum N (m) in which the target sound is inferior to the interference sound, and this
spectrum N (m) is given to the removal means 50.
[0049]
The phase generation means 70 executes the operation shown in the equation (9) to which the
spectra D1 (m) and D2 (m) are applied, and contains many target sound components, and the
phase spectrum used for target sound separation F (m) is generated, and this phase spectrum F
(m) is given to the removing means 50.
[0050]
In the removal means 50, after selection processing from two spectra M (m) and N (m) with
normalization processing to which the phase spectrum F (m) is applied shown in equation (10) is
performed ( 11) Calculation processing of the separated spectrum H (m) shown in the equation is
executed, and further, enlargement processing of the range of m in the separated spectrum H (m)
is executed, and the separated spectrum H (m) after range extension processing is generated Is
provided to the means 60.
[0051]
In the generation means 60, after the separated spectrum H (m) which is a signal in the
frequency domain is converted into a signal on the time axis, a process of connecting the signals
for each analysis unit as shown in equation (12) is performed. The final separated signal y (n) is
obtained.
04-05-2019
16
[0052]
(A-3) Effects of the First Embodiment According to the first embodiment, since the band selection
is the basic process, the target sound can be easily separated, and furthermore, the target sound
is separated by synthesizing a plurality of sound reception signals. Since the phase information
applied to the signal is obtained, the phase component related to the stable target sound can be
used for the target sound separation even if there are many disturbing sound components in the
received signal, and as a result, after separation The sound quality of the target sound can be
improved.
[0053]
(B) Second Embodiment Next, a second embodiment of the sound source separation device,
method and program according to the present invention will be described with reference to the
drawings.
The sound source separation apparatus according to the first embodiment uses two microphones,
but the second embodiment uses four microphones.
[0054]
FIG. 2 is a block diagram showing the overall configuration of the sound source separation
apparatus according to the second embodiment, and the same or corresponding parts as in FIG. 1
according to the first embodiment are given the same or corresponding reference numerals. ing.
[0055]
In FIG. 2, a sound source separation apparatus 100 according to the second embodiment
includes two sound source separation units 80 -A and 80 -B, a removing unit 51, a generating
unit 60, and a phase generating unit 71.
Each sound source separation unit 80-A, 80-B is provided with one each of input means 20-A,
20-B, analysis means 30-A, 30-B and separation means 40-A, 40-B. ing.
04-05-2019
17
[0056]
The input means 20-A, 20-B, the analysis means 30-A, 30-B, and the separation means 40-A, 40B are respectively the input means 20, the analysis means 30, and the separation means in the
first embodiment. It is similar to 40.
[0057]
However, among the four microphones 21-A, 21-B, 22-A, 22-B provided in the sound source
separation apparatus 100, the microphones 21-A and 22-A are components of the input means
20-A. The microphones 21-B and 22-B are components of the input means 20-B.
For example, it is preferable that the line connecting the microphones 21 -A and 22 -A be
orthogonal to the line connecting the microphones 21 -B and 22 -B.
[0058]
The phase generation means 71 of the second embodiment is provided with the two frequency
analysis spectra DA1 (m) and DA2 (m) outputted from the analysis means 30-A, and outputted
from the analysis means 30-B. Two frequency analysis spectra DB1 (m) and DB2 (m) are given.
The phase generation means 71 adds the four spectrums DA1 (m), DA2 (m), DB1 (m) and DB2
(m) input as shown in equation (13) to obtain a phase spectrum F (m). Generate
[0059]
F (m) = DA1 (m) + DA2 (m) + DB1 (m) + DB2 (m) (13) Also for the phase spectrum F (m) of the
second embodiment, spectra of four microphones are simply added. Since it is a signal, it
contains many signal components of the target sound, and its phase component is continuous
because it does not perform selection processing for each band, and does not have sharp
characteristics.
[0060]
In the removal means 51 of the second embodiment, the output spectrum MA (m) of the
04-05-2019
18
minimum selection unit 44 -A (not shown) of the separation means 40 -A and the spatial filter 43
-A (not shown) Output spectrum NA (m), the output spectrum MB (m) of the minimum selection
unit 44-B (not shown) of the separating means 40-B, and the spatial filter 43-B (not shown). ) And
the output spectrum F (m) of the phase generation means 71 are given.
[0061]
The removing means 50 performs the band selection process with the normalization process
shown in equation (14) using these five MA (m), NA (m), MB (m), NB (m), F (m). Run.
[0062]
The first half of the first condition in equation (14) represents the case where the power of the
target sound dominant spectrum of the sound source separation unit 80-A is larger than the
power of the target sound dominant spectrum of the sound source separation unit 80-B, The first
half of the second condition in equation (14) represents the case where the power of the target
sound dominant spectrum of the sound source separation unit 80-B is larger than the power of
the target sound dominant spectrum of the sound source separation unit 80-A, It represents that
band selection is performed between the sound source separation units 80-A and 80-B.
[0063]
The removal means 51 applies the spectrum S (m) of the band selection result and the output
spectrum F (m) of the phase generation means 71 to calculate the separated spectrum H (m), and
then calculates the separated spectrum H (m). Expanding the range of m is the same as in the
first embodiment.
[0064]
Also according to the second embodiment, since the band selection is basically processed, the
target sound can be easily separated, and the phase component related to the stable target sound
can be separated into the target sound even when there are many interfering sound components
in the received signal. As a result, the sound quality of the target sound after separation can be
enhanced.
[0065]
(C) Other Embodiments In the second embodiment, two microphones 21-A and 22-A of the sound
source separation unit 80-A and two microphones 21-B and 22 of the sound source separation
04-05-2019
19
unit 80-B. Although the case where a total of four microphones of -B were used was shown, by
using one microphone in common by the sound source separation unit 80-A and the sound
source separation unit 80-B, a configuration of three microphones is also possible. good.
In this case, since the number of microphones is small and the sound source separation units 80A and 80-B have a common calculation (for example, frequency analysis calculation), the final
calculation amount is small, which is practical.
In this case, the phase generation means may simply add up the frequency analysis spectra
corresponding to the three microphones, and the frequency analysis spectrum corresponding to
the common microphone may be weighted more than the other frequency analysis spectra. (For
example, doubling) may be added.
[0066]
Further, even in the case of using three microphones, a configuration different from the above
may be adopted.
For example, three microphones are respectively disposed at vertex positions of an equilateral
triangle, and a sound source separation unit using the first and second microphones, a sound
source separation unit using the second and third microphones, a third and A sound source
separation unit using the first microphone may be provided and processed.
[0067]
Furthermore, the number of microphones may be increased to five or more, and similar sound
source separation processing may be performed.
In this case, the phase generation means may add the frequency analysis spectrums
corresponding to the respective microphones.
Further, the removing means selects the sound source processing unit by the same minimum
04-05-2019
20
value search as in the second embodiment, and the band selection spectrum S is selected from
the target sound superior spectrum and the target sound inferior spectrum in the selected sound
source processor. It is sufficient to obtain (m).
[0068]
In the first and second embodiments, much processing is performed on the signal (spectrum) in
the frequency domain, but some of the processing may be performed on the signal on the time
axis.
[0069]
The sound source separation apparatus, method, and program of the present invention are, for
example, the case of separating the speech of an arbitrary speaker from the mixed speech of a
plurality of speakers who perform remote speech, or the speech and other sounds of the speaker
who performs remote speech It can be used to separate the speaker's voice from the mixed
sound with the voice, and more specifically, for example, dialogue with a robot, voice operation
on in-vehicle devices such as a car navigation system, meeting minutes creation, etc. Suitable for
use in
[0070]
10, 100 ... sound source separation device, 20, 20-A, 20-B ... input means, 21, 21-A, 21-B, 22, 22A, 22-B ... microphone, 30, 30-A, 30 -B: Analysis means 31, 32: Frequency analysis part 40, 40-A,
40-B: Separation means, 41 to 43: Spatial filter, 44: Minimum selection part, 50, 51: Removal
means, 60: Generation Means 70, 71: phase generation means 80-A, 80-B: sound source
separation unit.
04-05-2019
21
Документ
Категория
Без категории
Просмотров
0
Размер файла
35 Кб
Теги
jp2011124872
1/--страниц
Пожаловаться на содержимое документа