close

Вход

Забыли?

вход по аккаунту

?

JP2012058360

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2012058360
An object of the present invention is to enable noise elimination processing independent of
microphone spacing. A target sound emphasizing unit 105 performs target sound emphasis
processing on observation signals of microphones 101a and 101b to obtain a target sound
estimation signal. The noise estimation unit 106 performs noise estimation processing on the
observation signals of the microphones 101a and 101b to obtain a noise estimation signal. The
post filtering unit 109 removes noise components remaining in the target sound estimation
signal by post filtering processing using the noise estimation signal to obtain a noise suppression
signal. The correction coefficient calculation unit 107 calculates a correction coefficient to
correct the post-filtering process, that is, to match the noise component remaining in the target
sound estimation signal with the gain of the noise estimation signal. The correction coefficient
changing unit 108 changes, among the correction coefficients calculated by the correction
coefficient calculation unit 107, the coefficient of the band causing spatial aliasing so as to crush
the peak that can be made into the specific frequency. [Selected figure] Figure 1
Noise removal apparatus and noise removal method
[0001]
The present invention relates to a noise removal device and a noise removal method, and more
particularly to a noise removal device and the like for removing noise by enhancing a target
sound and post-filtering processing.
[0002]
03-05-2019
1
For example, a situation in which the user listens to music to be reproduced by a mobile phone, a
personal computer or the like with noise cancellation headphones is assumed.
In this situation, when there is an incoming call, a chat call, etc., it is very troublesome for the
user to start talking after preparing the microphone one by one. It is desirable for the user to
start a call as it is without hands and without preparing a microphone.
[0003]
A noise canceling microphone is installed at the ear of the noise canceling headphone, and it is
conceivable to make a call using this microphone. This makes it possible to realize a call with the
headphones attached. In this case, ambient noise is a problem, so it is desirable to suppress noise
and transmit only voice.
[0004]
For example, Patent Document 1 describes a technique for removing noise by emphasizing target
sound and post-filtering processing. FIG. 31 shows a configuration example of the noise removal
device described in Patent Document 1. In FIG. In this noise removal apparatus, the voice former
(11) emphasizes speech and the blocking matrix (12) emphasizes noise. The noise reduction
means (13) uses emphasis noise to reduce the noise component, as not all of the noise
disappears in the speech enhancement.
[0005]
Furthermore, in this noise removal device, the postfiltering means (14) removes the noise that
has been eliminated. In this case, although the noise reduction means (13) and the output of the
processing means (15) are used, spectral errors occur in the filter characteristics. Therefore,
correction is performed in the output adaptation unit (16).
[0006]
03-05-2019
2
In this case, correction is performed so that the output S1 of the noise reduction means (13) and
the output S2 of the adaptation unit (16) become equal in a section where there is no target
sound and only noise exists. This is expressed by the following equation (1). In the equation (1),
the left side indicates the expected value of the output S2 of the adaptation unit (16), and the
right side indicates the expected value of the output S1 of the noise reduction means (13) in the
section where there is no target sound.
[0007]
[0008]
By such correction, in the post filtering means (14), there is no error of S1 and S2 in the noise
only section, and all noise can be removed, and in the (voice + noise) section, only the noise
component is It can be removed to leave an audio.
[0009]
This correction can be interpreted as correcting the directivity of the filter.
FIG. 32 (a) shows an example of the directivity of the filter before correction, and FIG. 32 (b)
shows an example of the directivity of the filter after correction.
In these figures, the vertical axis shows the gain, and the higher it goes, the higher the gain.
[0010]
In FIG. 32 (a), the solid line a indicates the directivity characteristic created by the beam former
(11) to emphasize the target sound. This directional characteristic emphasizes the front target
sound and lowers the gain of sounds coming from other directions. Further, in FIG. 32A, a broken
line b indicates the directivity characteristic produced by the blocking matrix unit 12. The
directivity characteristic lowers the gain of the target sound direction and estimates noise.
[0011]
03-05-2019
3
Before correction, there is a gain error in the direction of noise between the directivity of the
target sound emphasis (solid line a) and the directivity of noise estimation (solid line b).
Therefore, when the noise estimation signal is subtracted from the target sound estimation signal
in the post-filtering means (14), the noise remains unerased or excessive.
[0012]
Further, in FIG. 32 (b), a solid line a 'indicates the directional characteristic of the target sound
emphasis after the correction. Further, in FIG. 32 (b), a broken line b 'indicates the directivity of
noise estimation after the correction. By the correction coefficient, the gain of the direction of the
noise (noise) in the directivity of the target sound emphasis and the directivity of the noise
estimation can be matched. Therefore, when the noise estimation signal is subtracted from the
target sound estimation signal in the post filtering means (14), the noise elimination or excessive
elimination is reduced.
[0013]
JP, 2009-49998, A
[0014]
The noise suppression technique described in the above-mentioned Patent Document 1 has a
problem that the microphone spacing is not taken into consideration.
That is, in the noise suppression technique described in Patent Document 1, there are cases
where the correction coefficient can not be calculated correctly depending on the microphone
spacing. If the correction coefficient is incorrectly calculated, the target sound may be distorted.
When the microphone spacing is wide, the directional characteristic curve causes spatial aliasing,
which amplifies or attenuates the gain of the unintended azimuth.
[0015]
03-05-2019
4
FIG. 33 shows an example of the directivity of the filter in the case of spatial aliasing, where the
solid line a shows the directivity of the target sound emphasis produced by the beam former
(11), and the broken line b shows the blocking matrix (12 Shows the directional characteristics of
the noise estimate made in In the case of the directivity characteristic example shown in FIG. 33,
noise is also amplified simultaneously with the target sound. In this case, it is meaningless to
obtain a correction coefficient, and the noise suppression performance is degraded.
[0016]
In the noise suppression technique described in the above-mentioned Patent Document 1, it is
premised that the microphone spacing is known in advance, and furthermore, it is the
microphone spacing where spatial aliasing does not occur. This premise is a considerable
limitation. For example, at the sampling frequency of the telephone band (8000 Hz), the
microphone spacing that does not cause spatial aliasing is about 4.3 cm.
[0017]
In order not to cause space aliasing, it is necessary to set the distance between the microphones
(element distance) in advance. Here, assuming that the velocity of sound is c, the distance
between the microphones (element distance) is d, and the frequency is f, the following equation
(2) needs to be satisfied in order to prevent space aliasing. d<c/2f ・・・(2)
[0018]
For example, in the case of a noise canceling microphone installed in noise canceling
headphones, the microphone spacing is the spacing between the left and right ears. That is, in
this case, about 4.3 cm, which is the distance between microphones which does not cause space
aliasing as described above, is impossible.
[0019]
Moreover, in the noise suppression technique described in the above-mentioned patent
document 1, there is a problem that the number of sound sources of ambient noise is not taken
03-05-2019
5
into consideration. That is, in the situation where there are innumerable noise sources in the
surroundings, the ambient sound is randomly input in each frame and each frequency. In this
case, the location to which the gain should be adjusted in the directivity characteristic of the
target sound emphasis and the directivity characteristic of the noise estimation move apart in
each frame and each frequency. Therefore, the correction coefficient constantly changes with
time and becomes unstable, which adversely affects the output sound.
[0020]
FIG. 34 shows the situation where there are innumerable noise sources in the vicinity. The solid
line a indicates the directivity characteristic of the target sound emphasis similar to the solid line
a in FIG. 32A, and the dashed line b indicates the directivity characteristic of the target sound
emphasis similar to the dashed line b in FIG. It shows. If there are innumerable noise sources in
the surroundings, there are many places where the gain of the two directional characteristics
should be matched. In a real environment, since there are innumerable noise sources in the
vicinity as described above, the noise suppression technology described in the above-mentioned
Patent Document 1 can not cope with it.
[0021]
An object of the present invention is to enable noise removal processing independent of
microphone spacing. Another object of the present invention is to enable noise reduction
processing in accordance with the ambient noise conditions.
[0022]
The concept of the present invention includes: a target sound emphasizing unit that performs
target sound enhancement processing on observation signals of a first microphone and a second
microphone arranged at predetermined intervals to obtain a target sound estimation signal; And
a noise estimation unit that performs noise estimation processing on the observation signal of
the second microphone to obtain a noise estimation signal; noise components remaining in the
target sound estimation signal obtained by the target sound enhancement unit; A post-filtering
unit for removing by post-filtering processing using the noise estimation signal obtained by the
estimation unit; a target sound estimation signal obtained by the target sound enhancement unit;
and a noise estimation signal obtained by the noise estimation unit A correction coefficient
calculation unit that calculates a correction coefficient for correcting the post-filtering process
performed by the post-filtering unit for each frequency; And a correction coefficient change unit
03-05-2019
6
for changing a correction coefficient of a band causing spatial aliasing among the correction
coefficients calculated by the correction coefficient calculation unit so as to crush a peak that can
be made to a specific frequency. .
[0023]
In the present invention, the target sound emphasizing unit subjects the observation signals of
the first microphone and the second microphone arranged at predetermined intervals to the
target sound estimation signal to obtain the target sound estimation signal.
As target sound enhancement processing, for example, DS (Delay and Sum) processing, adaptive
beamformer processing, or the like that is conventionally known is used. Further, the noise
estimation unit performs noise estimation processing on the observation signals of the first
microphone and the second microphone to obtain a noise estimation signal. As the noise
estimation processing, for example, NBF (Null-Beam Former) processing, adaptive beam former
processing, or the like, which is conventionally known, is used.
[0024]
The post-filtering unit removes noise components remaining in the target sound estimation
signal obtained by the target sound enhancement unit by post-filtering processing using the
noise estimation signal obtained by the noise estimation unit. As post-filtering processing, for
example, a spectral subtraction method, an MMSE-STSA method, etc., which are conventionally
known, are used. Further, the correction coefficient calculation unit corrects the post-filtering
process performed by the post-filtering unit based on the target sound estimation signal obtained
by the target sound emphasizing unit and the noise estimation signal obtained by the noise
estimation unit. Coefficients are calculated for each frequency.
[0025]
Among the correction coefficients calculated by the correction coefficient calculation unit, the
correction coefficient changing unit changes the correction coefficient of the band causing
spatial aliasing so as to close the peak that can be made to the specific frequency. For example, in
the correction coefficient changing unit, the correction coefficient calculated by the correction
03-05-2019
7
coefficient calculation unit is smoothed in the frequency direction in a band causing spatial
aliasing, and a corrected correction coefficient of each frequency is obtained. Also, for example,
in the correction coefficient changing unit, the correction coefficient of each frequency is
changed to 1 in the band causing spatial aliasing.
[0026]
When the distance between the first microphone and the second microphone, that is, the
microphone distance is wide, spatial aliasing occurs, and the directivity of the target sound
emphasis is such that the sound other than the azimuth of the target sound is also emphasized.
Among the correction coefficients of each frequency calculated by the correction coefficient
calculation unit, a peak can be generated at a specific frequency in a band causing spatial
aliasing. Therefore, if this correction coefficient is used as it is, the peak formed at a specific
frequency as described above adversely affects the output sound and degrades the sound quality.
[0027]
In the present invention, the correction coefficient of the band causing spatial aliasing is changed
so as to crush the peak that can be made to a specific frequency, the adverse effect of this peak
on the output sound can be reduced, and the sound quality is degraded. It can be suppressed.
This enables noise removal processing independent of the microphone spacing.
[0028]
The present invention further includes a target sound segment detection unit that detects a
segment having a target sound based on, for example, the target sound estimation signal
obtained by the target sound enhancement unit and the noise estimation signal obtained by the
noise estimation unit, The correction coefficient calculation unit is configured to calculate the
correction coefficient in a section where there is no target sound based on the target sound
section information obtained by the target sound section detection unit. In this case, only the
noise component is included in the target sound estimation signal, so that the correction
coefficient can be accurately calculated without being influenced by the target sound.
03-05-2019
8
[0029]
For example, in the target sound detection unit, the energy ratio of the target sound estimation
signal and the noise estimation signal is determined, and when the energy ratio is larger than the
threshold, it is determined that the target sound section. Also, for example, in the correction
coefficient calculation unit, the correction coefficient β (f, t) of the frame t of the f th frequency
is the target sound estimation signal Z (f, t) of the frame t of this f th frequency and noise
estimation The signal N (f, t) and the correction coefficient β (f, t-1) of the frame t-1 of the f-th
frequency are used to calculate the following equation.
[0030]
Further, according to another concept of the present invention, there is provided a target sound
emphasizing unit for obtaining a target sound estimation signal by subjecting observation signals
of a first microphone and a second microphone arranged at a predetermined interval to target
sound emphasis processing. A noise estimation unit that performs noise estimation processing on
the observation signals of the first microphone and the second microphone to obtain a noise
estimation signal; noise components remaining in the target sound estimation signal obtained by
the target sound enhancement unit A post-filtering unit for removing the noise estimation signal
obtained by the noise estimation unit by post-filtering processing; a target sound estimation
signal obtained by the target sound enhancement unit; and a noise obtained by the noise
estimation unit A correction factor for calculating a correction factor for correcting the postfiltering process performed by the post-filtering unit based on the estimated signal for each
frequency An ambient noise state estimation unit for processing the observation signals of the
first microphone and the second microphone to obtain information on the number of sound
sources of ambient noise; and of the ambient noise obtained by the ambient noise state
estimation unit Based on the sound source number information, the smoothing frame number is
increased as the sound source number increases, and the correction coefficient calculated by the
correction coefficient calculation unit is smoothed in the frame direction to obtain a corrected
correction coefficient of each frame A noise eliminator comprising:
[0031]
In the present invention, the target sound emphasizing unit subjects the observation signals of
the first microphone and the second microphone arranged at predetermined intervals to the
target sound estimation signal to obtain the target sound estimation signal.
03-05-2019
9
As target sound enhancement processing, for example, DS (Delay and Sum) processing, adaptive
beamformer processing, or the like that is conventionally known is used. Further, the noise
estimation unit performs noise estimation processing on the observation signals of the first
microphone and the second microphone to obtain a noise estimation signal. As the noise
estimation processing, for example, NBF (Null-Beam Former) processing, adaptive beam former
processing, or the like, which is conventionally known, is used.
[0032]
The post-filtering unit removes noise components remaining in the target sound estimation
signal obtained by the target sound enhancement unit by post-filtering using the noise estimation
signal obtained by the noise estimation unit. As post-filtering processing, for example, a spectral
subtraction method, an MMSE-STSA method, etc., which are conventionally known, are used.
Further, the correction coefficient calculation unit corrects the post-filtering process performed
by the post-filtering unit based on the target sound estimation signal obtained by the target
sound emphasizing unit and the noise estimation signal obtained by the noise estimation unit.
Coefficients are calculated for each frequency.
[0033]
The ambient noise state estimation unit processes the observation signals of the first microphone
and the second microphone to obtain information on the number of sources of ambient noise.
For example, in the ambient noise state estimation unit, the correlation coefficient of the
observation signals of the first microphone and the second microphone is calculated, and the
calculated correlation coefficient is used as the sound source number information of the ambient
noise. The correction coefficient change unit increases the number of smoothed frames as the
number of sound sources increases, based on the sound source number information of the
ambient noise obtained by the ambient noise state estimation unit, and the correction coefficient
calculated by the correction coefficient calculation unit Directionally smoothed, the modified
correction coefficients for each frame are obtained.
[0034]
In the situation where there are innumerable noise sources in the surroundings, the sound from
each noise source around each frame and frequency should be input at random, and the gain
03-05-2019
10
should be matched with the directivity characteristics of the target sound emphasis and the noise
estimation directivity. The parts move apart at each frame and each frequency. That is, the
correction coefficient calculated by the correction coefficient calculation unit constantly changes
with time and becomes unstable, which adversely affects the output sound.
[0035]
In the present invention, as the number of sources of ambient noise increases, the number of
smoothed frames is increased, and the one smoothed in the frame direction is used as the
correction coefficient of each frame. This makes it possible to suppress the change in the
correction coefficient in the time direction and reduce the influence on the output sound in a
situation where there are innumerable noise sources in the vicinity. This makes it possible to
perform noise removal processing in accordance with the situation of ambient noise (a realistic
environment in which there are an infinite number of ambient noises).
[0036]
According to the present invention, the correction coefficient of the band causing spatial aliasing
is changed so as to crush the peak that can be made to a specific frequency, the adverse effect of
this peak on the output sound can be reduced, and the sound quality is degraded. Can be
suppressed, and noise removal processing independent of the microphone spacing becomes
possible. Further, according to the present invention, the number of smoothed frames is
increased as the number of sound sources of ambient noise increases, and as the correction
coefficient of each frame, one smoothed in the frame direction is used, and there are innumerable
numbers around In the situation where there is a noise source, it is possible to suppress the
change in the correction coefficient in the time direction to reduce the influence on the output
sound, and it is possible to perform the noise removal processing according to the situation of
the ambient noise.
[0037]
FIG. 1 is a block diagram showing a configuration example of a voice input system as a first
embodiment of the present invention. It is a figure for demonstrating the target sound emphasis
part. It is a figure for demonstrating a noise estimation part. It is a figure for demonstrating a
post-filtering part. It is a figure for demonstrating a correction coefficient calculation part. It is a
03-05-2019
11
figure which shows an example (microphone space | interval d = 2 cm, space aliasing absence) of
the correction coefficient for every frequency calculated by a correction coefficient calculation
part. It is a figure which shows an example (microphone space | interval d = 20 cm, space
aliasing presence) of the correction coefficient for every frequency calculated by a correction
coefficient calculation part. It is a figure which shows that noise (female speaker) exists in 45
degree azimuth | direction. It is a figure which shows an example (microphone distance d = 2 cm,
space aliasing absence, the number of noise sources = 2) of the correction coefficient for every
frequency calculated by a correction coefficient calculation part. It is a figure which shows an
example (microphone space | interval d = 20 cm, space aliasing presence, the number of noise
sources = 2) of the correction coefficient for every frequency calculated by a correction
coefficient calculation part. It is a figure which shows that noise (female speaker) exists in
azimuth of 45 degrees, and noise (male speaker) exists in azimuth of -30 degrees. It is a figure
for demonstrating the method (1st method) of smoothing in a frequency direction, in order to
change the coefficient of the zone which is causing space aliasing so that the peak which can be
made to a specific frequency may be crushed. It is a figure for demonstrating the method (1st
method) of smoothing in a frequency direction, in order to change the coefficient of the zone
which is causing space aliasing so that the peak which can be made to a specific frequency may
be crushed. It is a figure for demonstrating the method (2nd method) to replace with 1 in order
to change the coefficient of the zone which is causing space aliasing so that the peak which can
be made to a specific frequency may be crushed. It is a flowchart which shows the procedure of
the process in a correction coefficient change part. It is a block diagram which shows the
structural example of the audio | voice input system as 2nd Embodiment of this invention. It is a
bar graph which shows an example of the relationship between the number of sound sources of
noise, and correlation coefficient corr. It is a figure which shows an example (microphone space |
interval d = 2 cm) of the correction coefficient for every frequency calculated by a correction
coefficient calculation part, when noise exists in azimuth of 45 degrees. It is a figure which shows
that noise exists in 45 degrees azimuth | direction. It is a figure which shows an example
(microphone space | interval d = 2 cm) of the correction coefficient for every frequency
calculated by a correction coefficient calculation part, when noise exists in several azimuth |
direction. It is a figure which shows that noise exists in several azimuth | direction.
It is a figure which shows that the correction coefficient calculated by the correction coefficient
calculation part changes at random for every flame | frame. It is a figure which shows an
example of the smoothed flame | frame number calculation function used when calculating |
requiring smoothed flame | frame number (gamma) based on correlation coefficient corr (sound
source number information of ambient noise). It is a figure for demonstrating obtaining the
correction coefficient which smooth | blunted the correction coefficient calculated by the
correction coefficient calculation part in the frame direction (time direction), and was changed. It
is a flowchart which shows the procedure of the process in an ambient noise state estimation
part and a correction coefficient change part. It is a block diagram which shows the structural
03-05-2019
12
example of the audio | voice input system as 3rd Embodiment of this invention. It is a flowchart
which shows the procedure of the process in a correction coefficient change part, an ambient
noise state estimation part, and a correction coefficient change part. It is a block diagram which
shows the structural example of the audio | voice input system as 4th Embodiment of this
invention. It is a figure for demonstrating the target sound detection part. It is a figure for
demonstrating the principle of a target sound detection part. It is a block diagram which shows
the structural example of the conventional noise removal apparatus. It is a figure which shows an
example of the directional characteristic of the target sound emphasis before correction |
amendment in the conventional noise removal apparatus after correction | amendment, and the
directional characteristic of noise estimation. It is a figure which shows the directional
characteristic example of the filter in, when space aliasing is caused. It is a figure which shows
the condition where innumerable noise sources exist around.
[0038]
Hereinafter, modes for carrying out the invention (hereinafter referred to as “embodiments”)
will be described. The description will be made in the following order. 1. First Embodiment
Second embodiment 3. Third embodiment 4. Fourth embodiment 5. Modified example
[0039]
<1. First Embodiment> [Configuration Example of Voice Input System] FIG. 1 shows a
configuration example of a voice input system 100 according to a first embodiment. The voice
input system 100 is a system that performs voice input using noise canceling microphones
installed in the left and right headphones of the noise canceling headphone.
[0040]
The voice input system 100 includes microphones 101a and 101b, an A / D converter 102, a
frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement
unit 105, and a noise estimation unit (target sound The suppression unit 106 is included. The
voice input system 100 further includes a correction coefficient calculation unit 107, a correction
coefficient change unit 108, a post filtering unit 109, an inverse fast Fourier transform (IFFT)
unit 110, and a waveform synthesis unit 111.
03-05-2019
13
[0041]
The microphones 101a and 101b collect ambient sound to obtain an observation signal. The
microphones 101a and 101b are arranged side by side at a predetermined interval. In this
embodiment, the microphones 101a and 101b are noise canceling microphones installed on the
left and right headphones of the noise canceling headphone.
[0042]
The A / D converter 102 converts an observation signal obtained from the microphones 101a
and 101b from an analog signal to a digital signal. The frame division unit 103 divides the
observation signal converted into the digital signal by the A / D converter 102 into frames of a
predetermined time length in order to perform processing on a frame-by-frame basis. The fast
Fourier transform unit 104 performs fast Fourier transform (FFT) processing on the framed
signal obtained by the frame division unit 103 to transform it into a frequency spectrum X (f, t)
in the frequency domain. . Here, (f, t) indicates that it is the frequency spectrum of the frame t of
the f-th frequency. That is, f indicates frequency and t indicates time index.
[0043]
The target sound emphasizing unit 105 performs target sound emphasis processing on the
observation signals of the microphones 101a and 101b, and obtains a target sound estimation
signal for each frequency in each frame. As shown in FIG. 2, when the observation signal of the
microphone 101 a is X 1 (f, t) and the observation signal of the microphone 101 b is X 2 (f, t) as
shown in FIG. Get (f, t). The target sound emphasizing unit 105 uses, for example, conventionally
known DS (Delay and Sum) processing or adaptive beamformer processing as target sound
emphasis processing.
[0044]
DS is a technology for matching the phase of the signal input to the microphones 101a and 101b
to the direction of the target sound. The microphones 101a and 101b are noise canceling
microphones installed on the left and right headphones of the noise canceling headphone, and
03-05-2019
14
the user's mouth is always front as viewed from the microphones 101a and 101b.
[0045]
Therefore, in the case of using the DS processing, the target sound emphasizing unit 105 divides
the observation signal X1 (f, t) and the observation signal X2 (f, t) based on the following
equation (3), and then divides by two. A target sound estimation signal Z (f, t) is obtained. Z (f, t)
= {X1 (f, t) + X2 (f, t)} / 2 (3)
[0046]
DS is a technique called a fixed beam former, which is a technique for controlling the directivity
by changing the phase of an input signal. When the microphone interval is known in advance, the
target sound emphasizing unit 105 uses the processing such as adaptive beamformer processing
instead of the DS processing as described above to acquire the target sound estimation signal Z
(f, t ) Can also be obtained.
[0047]
Returning to FIG. 1, the noise estimation unit (target sound suppression unit) 106 performs noise
estimation processing on the observation signals of the microphones 101a and 101b, and
obtains a noise estimation signal for each frequency in each frame. The noise estimation unit 106
estimates sounds other than the target sound (user voice) as noise. That is, the noise estimation
unit 106 performs processing for removing only the target sound and leaving noise.
[0048]
As shown in FIG. 3, the noise estimation unit 106 sets the observation signal of the microphone
101a to X1 (f, t) and the observation signal of the microphone 101b to X2 (f, t), the noise
estimation signal N (f, t). , t). The noise estimation unit 106 uses, for example, an NBF (Null-Beam
Former) process, an adaptive beam former process, or the like, which is conventionally known, as
the noise estimation process.
03-05-2019
15
[0049]
As described above, the microphones 101a and 101b are noise canceling microphones installed
on the left and right headphones of the noise canceling headphone, and the user's mouth is
always front as viewed from the microphones 101a and 101b. Therefore, when using the NBF
process, the noise estimation unit 106 subtracts the observation signal X1 (f, t) and the
observation signal X2 (f, t) based on the following equation (4), and then divides by two. A noise
estimation signal N (f, t) is obtained. N (f, t) = {X1 (f, t) -X2 (f, t)} / 2 (4)
[0050]
NBF is a technique called a fixed beam former, which is a technique for controlling the directivity
by changing the phase of an input signal. When the microphone interval is known in advance, the
noise estimation unit 106 uses a process such as adaptive beamformer processing instead of the
NBF processing as described above to generate the noise estimation signal N (f, t). You can also
get it.
[0051]
Returning to FIG. 1, the post-filtering unit 109 estimates noise components obtained by the noise
estimation unit 106, the noise components remaining in the target sound estimation signal Z (f, t)
obtained by the target sound enhancement unit 105. The signal N (f, t) is removed by postfiltering processing. That is, as shown in FIG. 4, the post-filtering unit 109 generates the noise
suppression signal Y (f, t) based on the target sound estimation signal Z (f, t) and the noise
estimation signal N (f, t). obtain.
[0052]
The post-filtering unit 109 obtains a noise suppression signal Y (f, t) using a known technique
such as a spectral subtraction method or an MMSE-STSA method. The spectral subtraction
method is described, for example, in the document “SF Boll,“ Suppression of acoustic noise in
speech using spectral subtraction, ”IEEE Trans. Acoustics, Speech, and Signal Processing, vol.
27, no. 2, pp. 113-120, 1979. "It is described in. Moreover, the MMSE-STSA method is described
03-05-2019
16
in the document "Y. Ephraimand D. Malah," Speech enhancement using a minimum mean-square
error short-time spectral amplitude estimator ", IEEE Trans. Acoustics, Speech, and Signal
Processing, vol. 32, no. 6, pp. 1109-1121, 1984. "It is described in.
[0053]
Referring back to FIG. 1, the correction coefficient calculation unit 107 calculates the correction
coefficient β (f, t) for each frequency in each frame. The correction coefficient β (f, t) is used to
correct the post-filtering process performed by the above-described post-filtering unit 109, that
is, the gain of the noise component remaining in the target sound estimation signal Z (f, t) This is
to match the gain of the noise estimation signal N (f, t). As shown in FIG. 5, the correction
coefficient calculation unit 107 estimates the target sound estimation signal Z (f, t) obtained by
the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the
noise estimation unit 106. Correction coefficient β (f, t) is calculated for each frequency in each
frame based on
[0054]
In this embodiment, the correction coefficient calculation unit 107 calculates the correction
coefficient β (f, t) based on the following equation (5).
[0055]
The correction coefficient calculation unit 107 has a stable correction coefficient by smoothing
using the correction coefficient β (f, t−1) of the previous frame because the correction
coefficient varies for each frame only with the calculation coefficient of the current frame. It is
asking for β (f, t).
The first term on the right side of the equation (5) is a term that carries the correction coefficient
β (f, t-1) of the previous frame, and the second term on the right side of the equation (5) is a
term for computing the coefficient of the current frame It is. Note that α is a smoothing
coefficient, which is a fixed value such as 0.9 or 0.95, for example, and a weight is placed on the
previous frame.
[0056]
03-05-2019
17
When the noise suppression signal Y (f, t) is obtained using the known technique of the spectral
subtraction method, the above-described post-filtering unit 109 uses the correction coefficient β
(f, t) as in the following equation (6) Do. In this case, the post-filtering unit 109 corrects the noise
estimation signal N (f, t) by multiplying the noise estimation signal N (f, t) by the correction
coefficient β (f, t). In the equation (6), when the correction coefficient β (f, t) = 1, no correction
is performed. Y (f, t) = Z (f, t) -β (f, t) * N (f, t) (6)
[0057]
The correction coefficient changing unit 108 crushes a peak at which the coefficient of the band
causing spatial aliasing among the correction coefficients β (f, t) calculated by the correction
coefficient calculation unit 107 in each frame can be a specific frequency. To change. In practice,
the post-filtering unit 109 uses not the correction coefficient β (f, t) itself calculated by the
correction coefficient calculation unit 107, but the correction coefficient β ′ (f, t) after this
change.
[0058]
As described above, when the microphone interval is wide, the curve of directivity characteristic
causes space aliasing to be folded back, and the directivity characteristic of the target sound
emphasis is such that the sound other than the azimuth of the target sound is also emphasized.
Of the correction coefficients of each frequency calculated by the correction coefficient
calculation unit 107, a peak can be generated at a specific frequency in a band causing spatial
aliasing. If this correction factor is used as it is, a peak formed at a specific frequency adversely
affects the output sound and degrades the sound quality.
[0059]
FIG. 6 and FIG. 7 show an example of the correction coefficient when noise (female speaker)
exists in the 45 ° azimuth as shown in FIG. FIG. 6 shows the case where the microphone spacing
d is 2 cm and there is no spatial aliasing. On the other hand, FIG. 7 shows the case where the
microphone spacing d is 20 cm and there is space aliasing, and a peak is generated at a specific
frequency.
03-05-2019
18
[0060]
An example of the correction coefficients in FIGS. 6 and 7 described above shows the case where
there is one noise. However, in a real environment, the noise is not one. In FIGS. 9 and 10, as
shown in FIG. 11, noise (female speaker) is present at an azimuth of 45 °, and noise (male
speaker) is present at an azimuth of −30 °. An example of the correction coefficient is shown.
[0061]
FIG. 9 shows the case where the microphone distance d is 2 cm and there is no space aliasing. On
the other hand, FIG. 10 shows the case where the microphone distance d is 20 cm, and there is
space aliasing, and a peak is generated at a specific frequency. In this case, although the peak of
the coefficient is complicated as compared with the case of one noise (see FIG. 7), there is a
frequency at which the value of the coefficient falls as in the case of one noise.
[0062]
The correction coefficient changing unit 108 checks the correction coefficient β (f, t) calculated
by the correction coefficient calculation unit 107, and finds the first frequency Fa (t) on the low
frequency side where the value of the coefficient falls. As shown in FIG. 7 and FIG. 10, the
correction coefficient changing unit 108 determines that spatial aliasing occurs in a band of Fa
(t) or more. Then, as described above, among the correction coefficients β (f, t) calculated by the
correction coefficient calculation unit 107, the correction coefficient changing unit 108 can set
the coefficient of the band causing spatial aliasing to a specific frequency. Change to collapse the
peak.
[0063]
The correction coefficient changing unit 108 changes the correction coefficient of the band
causing spatial aliasing by using, for example, the first method or the second method. When the
first method is used, the correction coefficient changing unit 108 obtains the corrected
correction coefficient β ′ (f, t) of each frequency as follows. As shown in FIGS. 12 and 13,
03-05-2019
19
among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit
107, the correction coefficient changing unit 108 has a correction coefficient for a band causing
spatial aliasing, as shown in FIGS. Smoothing in the direction yields the modified correction
factor β '(f, t) for each frequency.
[0064]
By smoothing in the frequency direction in this way, it is possible to crush the peak of the
coefficient that appears excessively. The section length of the smoothing can be set arbitrarily,
and in FIG. 12, the length of the arrow is shortened to indicate that the section length is set short.
Further, in FIG. 13, it is shown that the section length is set to be long by increasing the length of
the arrow.
[0065]
On the other hand, in the case of using the second method, the correction coefficient changing
unit 108 corrects the correction coefficient of the band causing spatial aliasing among the
correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107. As
shown in 14, it is replaced by 1 to obtain a modified correction coefficient β ′ (f, t). Note that
FIG. 14 is logarithmic notation, so it is not 1 but 0. The second method uses the fact that the
correction coefficient approaches 1 when the image is extremely smoothed in the first method.
This second method has the advantage that the smoothing operation can be omitted.
[0066]
The flowchart of FIG. 15 shows the procedure of the process (for one frame) in the correction
coefficient changing unit 108. The correction coefficient changing unit 108 starts the process in
step ST1, and then proceeds to the process of step ST2. In step ST2, the correction coefficient
changing unit 108 acquires the correction coefficient β (f, t) from the correction coefficient
calculation unit 107. Then, in step ST3, the correction coefficient changing unit 108 searches for
the coefficient of each frequency f from the low band in the current frame t, and the first
frequency Fa (t) on the low band side where the coefficient value drops is locate.
[0067]
03-05-2019
20
Next, in step ST4, the correction coefficient changing unit 108 checks a flag indicating whether
or not to smooth a band higher than Fa (t), that is, a band causing spatial aliasing. Note that this
flag is set in advance by a user operation. When the flag is on (ON), the correction coefficient
changing unit 108, in step ST5, among the correction coefficients β (f, t) calculated by the
correction coefficient calculation unit 107, the coefficients in the band more than Fa (t) To obtain
a modified correction factor β '(f, t) for each frequency f. After the process of step ST5, the
correction coefficient changing unit 108 ends the process in step ST6.
[0068]
Further, when the correction coefficient changing unit 108 turns off the flag in step ST4, in step
ST7, among the correction coefficients β (f, t) calculated by the correction coefficient calculation
unit 107, the band more than Fa (t) The correction coefficient of is replaced with "1" to obtain a
correction coefficient .beta. '(F, t). After the process of step ST7, the correction coefficient
changing unit 108 ends the process in step ST6.
[0069]
Returning to FIG. 1, the inverse fast Fourier transform (IFFT) unit 110 performs an inverse fast
Fourier transform process on the noise suppression signal Y (f, t) output from the post filtering
unit 109 for each frame. The inverse fast Fourier transform unit 110 performs processing
reverse to that of the above-described Fourier transform unit 104, converts a frequency domain
signal into a time domain signal, and obtains a framing signal.
[0070]
The waveform synthesis unit 111 synthesizes the framed signals of the respective frames
obtained by the inverse fast Fourier transform unit 110 and restores them into a time-series
continuous speech signal. The waveform synthesis unit 111 constitutes a frame synthesis unit.
The waveform synthesis unit 111 outputs the noise-suppressed speech signal SAout as an output
of the speech input system 100.
03-05-2019
21
[0071]
The operation of the voice input system 100 shown in FIG. 1 will be briefly described. Ambient
sound is collected by the microphones 101a and 101b arranged at predetermined intervals, and
an observation signal is obtained. The observation signals obtained by the microphones 101 a
and 101 b are supplied from the A / D converter 102 to the frame division unit 103 after being
converted from analog signals to digital signals. Then, in the frame division unit 103, the
observation signals from the microphones 101a and 101b are divided into frames of a
predetermined time length and framed.
[0072]
The framed signals of each frame obtained by being framed by the frame division unit 103 are
sequentially supplied to the fast Fourier transform unit 104. The fast Fourier transform unit 104
performs fast Fourier transform (FFT) processing on the framed signal, and the observation
signal X1 (f, t) of the microphone 101a and the observation signal of the microphone 101b as
signals in the frequency domain. And X2 (f, t) is obtained.
[0073]
The observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104
are supplied to the target sound emphasis unit 105. The target sound emphasizing unit 105
subjects the observation signals X1 (f, t) and X2 (f, t) to conventionally known DS processing or
adaptive beamformer processing, etc. An estimated signal Z (f, t) is obtained. For example, when
DS processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are
subjected to addition processing and then divided by 2 to obtain the target sound estimation
signal Z (f, t) (See equation (3)).
[0074]
Also, the observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit
104 are supplied to the noise estimation unit 106. The noise estimation unit 106 subjects the
observation signals X1 (f, t) and X2 (f, t) to conventional NBF processing or adaptive beamformer
processing, etc. N (f, t) is obtained. For example, when NBF processing is used, the observation
03-05-2019
22
signal X1 (f, t) and the observation signal X2 (f, t) are subjected to subtraction processing and
then divided by 2 to obtain a noise estimation signal N (f, t). (See equation (4)).
[0075]
The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105
and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to
the correction coefficient calculation unit 107. In correction coefficient calculation section 107,
correction coefficients β (f, t) for correcting the post-filtering process are calculated based on
target sound estimation signal Z (f, t) and noise estimation signal N (f, t). In a frame, it is
calculated for each frequency (see equation (5)).
[0076]
The correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is
supplied to the correction coefficient change unit 108. Of the correction coefficients β (f, t)
calculated by the correction coefficient calculation unit 107, the correction coefficient changing
unit 108 changes the coefficient of the band causing spatial aliasing so as to crush the peak that
can be made to a specific frequency. Thus, the corrected correction coefficient β ′ (f, t) is
obtained.
[0077]
In this correction coefficient changing unit 108, the correction coefficient β (f, t) calculated by
the correction coefficient calculation unit 107 is checked to find the first frequency Fa (t) on the
low frequency side where the coefficient value drops. It is determined that spatial aliasing occurs
in a band higher than Fa (t). Then, in the correction coefficient changing unit 108, among the
correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the
coefficient of the band higher than Fa (t) collapses the peak that can be made to a specific
frequency. Be changed.
[0078]
03-05-2019
23
For example, among the correction coefficients β (f, t) calculated by the correction coefficient
calculation unit 107, the correction coefficients in the band more than Fa (t) are smoothed in the
frequency direction, and the correction coefficients changed for each frequency are β ′ (f, t) is
obtained (see FIGS. 12 and 13). Further, for example, among the correction coefficients β (f, t)
calculated by the correction coefficient calculation unit 107, the correction coefficient in the
band of Fa (t) or more is replaced by 1 and the correction coefficient β ′ (f , t) are obtained (see
FIG. 14).
[0079]
The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105
and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to
the post filtering unit 109. Further, the correction coefficient β ′ (f, t) changed by the
correction coefficient changing unit 108 is supplied to the post filtering unit 109. In this post
filtering unit 109, noise components remaining in the target sound estimation signal Z (f, t) are
removed by post filtering processing using the noise estimation signal N (f, t). The correction
coefficient β ′ (f, t) is used to correct this post-filtering process, that is, the gain of the noise
component remaining in the target sound estimation signal Z (f, t) and the noise estimation signal
N (f, t). Is used to match the gain of
[0080]
In this post-filtering unit 109, for example, a known technique such as a spectral subtraction
method or an MMSE-STSA method is used to obtain the noise suppression signal Y (f, t). For
example, when the spectral subtraction method is used, the noise suppression signal Y (f, t) is
obtained based on the following equation (7). Y (f, t) = Z (f, t)-β '(f, t) * N (f, t) (7)
[0081]
The noise suppression signal Y (f, t) of each frequency output for each frame from the postfiltering unit 109 is supplied to the inverse fast Fourier transform unit 110. The inverse fast
Fourier transform unit 110 performs inverse fast Fourier transform processing on the noise
suppression signal Y (f, t) of each frequency for each frame to obtain a framed signal converted
into a time domain signal. Be The framing signals of each frame are sequentially supplied to the
waveform synthesis unit 111. The waveform synthesis unit 111 synthesizes the framed signals of
03-05-2019
24
the respective frames to obtain a noise-suppressed speech signal SAout as an output of the
speech input system 100 continuous in time series.
[0082]
As described above, in the voice input system 100 shown in FIG. 1, the correction coefficient β
(f, t) calculated by the correction coefficient calculation unit 107 is changed by the correction
coefficient change unit 108. In this case, among the correction coefficients β (f, t) calculated by
the correction coefficient calculation unit 107, the coefficient of the band causing spatial aliasing
(the band higher than Fa (t)) can be a peak that can be a specific frequency. It is modified to
collapse to obtain a modified correction factor β '(f, t). The post-filtering unit 109 uses this
modified correction coefficient β ′ (f, t).
[0083]
Therefore, it is possible to reduce the adverse effect that the peak of the coefficient that can be
made to the specific frequency of the band causing spatial aliasing has on the output sound, and
to suppress the deterioration of the sound quality. This enables noise removal processing
independent of the microphone spacing. Therefore, since the microphones 101a and 101b are
noise cancellation microphones installed in the headphones, noise can be efficiently corrected
even when the microphone spacing is wide, and noise is removed with less distortion. Processing
is performed.
[0084]
<2. Second Embodiment> [Configuration Example of Voice Input System] FIG. 16 shows a
configuration example of a voice input system 100A according to a second embodiment.
Similarly to the voice input system 100 shown in FIG. 1 described above, the voice input system
100A is also a system that performs voice input using noise canceling microphones installed in
the left and right headphones of the noise canceling headphone. In FIG. 16, parts corresponding
to FIG. 1 are given the same reference numerals, and the detailed description thereof will be
omitted as appropriate.
[0085]
03-05-2019
25
The voice input system 100A includes microphones 101a and 101b, an A / D converter 102, a
frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement
unit 105, and a noise estimation unit 106. doing. In addition, the voice input system 100A
includes a correction coefficient calculation unit 107, a post filtering unit 109, an inverse fast
Fourier transform (IFFT) unit 110, a waveform synthesis unit 111, an ambient noise state
estimation unit 112, and a correction coefficient change. It has a part 113.
[0086]
The ambient noise state estimation unit 112 processes the observation signals of the
microphones 101a and 101b to obtain information on the number of sources of ambient noise.
The ambient noise state estimation unit 112 calculates the correlation coefficient corr of the
observation signal of the microphone 101a and the observation signal of the microphone 101b
for each frame based on the following equation (8), and uses it as the number-of-sounds
information of ambient noise. . In the equation (8), x1 (n) indicates time axis data of the
microphone 101a, x2 (n) indicates time axis data of the microphone 101b, and N indicates the
number of samples.
[0087]
[0088]
The bar graph in FIG. 17 shows an example of the relationship between the number of noise
sources and the correlation coefficient corr.
Generally, as the number of sound sources increases, the correlation between the observation
signals of the microphones 101a and 101b decreases. Theoretically, the correlation coefficient
corr approaches 0 as the number of sound sources increases. Therefore, the number of sources
of ambient noise can be estimated by the correlation coefficient corr.
[0089]
03-05-2019
26
Referring back to FIG. 16, the correction coefficient changing unit 113 calculates the correction
coefficient calculation unit 107 based on the correlation coefficient corr (information on the
number of sound sources of ambient noise) obtained by the ambient noise state estimation unit
112 in each frame. The corrected correction coefficient β (f, t) is changed. That is, the correction
coefficient changing unit 113 increases the number of smoothed frames as the number of sound
sources increases, smoothes the coefficients calculated by the correction coefficient calculation
unit 107 in the frame direction, and changes the correction coefficient β ′ (f , t). In practice, the
post-filtering unit 109 uses not the correction coefficient β (f, t) itself calculated by the
correction coefficient calculation unit 107, but the correction coefficient β ′ (f, t) after this
change.
[0090]
FIG. 18 shows an example of the correction coefficient (microphone distance d is 2 cm) when
noise is present at the 45 ° azimuth as shown in FIG. On the other hand, as shown in FIG. 21,
FIG. 20 shows an example (microphone distance d is 2 cm) of the correction coefficient when
noise exists in a plurality of azimuths. As described above, even if the microphone spacing is a
proper spacing that does not cause spatial aliasing, the correction coefficient becomes unstable
as the number of noise sources increases. As a result, as shown in FIG. 22, the correction
coefficient changes randomly for each frame. If this correction factor is used as it is, it adversely
affects the output sound and degrades the sound quality.
[0091]
The correction coefficient changing unit 113 calculates the number of smoothed frames γ based
on the correlation coefficient corr (information on the number of sound sources of ambient
noise) obtained by the ambient noise state estimation unit 112 in each frame. The correction
coefficient changing unit 113 obtains the number of smoothed frames γ using, for example, a
number of smoothed frames calculation function as shown in FIG. In this case, when the
correlation of the observation signals of the microphones 101a and 101b is large, that is, when
the value of the correlation coefficient corr is large, the number of smoothed frames γ is
required to be small.
[0092]
03-05-2019
27
On the other hand, when the correlation between the observation signals of the microphones
101a and 101b is small, that is, when the value of the correlation coefficient corr is small, the
number of smoothed frames γ can be determined large. Note that the correction coefficient
changing unit 113 does not need to actually perform arithmetic processing, and the number of
smoothed frames γ is calculated according to the correlation coefficient corr from a table
storing the correspondence between the correlation coefficient corr and the number of smoothed
frames γ. May be read out.
[0093]
The correction coefficient changing unit 113 smoothes the correction coefficient β (f, t)
calculated by the correction coefficient calculation unit 107 in each frame in the frame direction
(time direction) as shown in FIG. To obtain a modified correction factor β ′ (f, t) of In this case,
smoothing is performed with the number of smoothed frames γ obtained as described above.
The correction coefficient β ′ (f, t) of each frame changed in this manner changes gently in the
frame direction (time direction).
[0094]
The flowchart in FIG. 25 shows the procedure of processing (one frame) in the ambient noise
state estimation unit 112 and the correction coefficient change unit 113. Each unit starts
processing in step ST11. Thereafter, in step ST12, the ambient noise state estimation unit 112
acquires data frames x1 (t) and x2 (t) of the observation signals of the microphones 101a and
101b. Then, in step ST13, the ambient noise state estimation unit 112 calculates a correlation
coefficient corr (t) indicating the degree of correlation of the observation signals of the
microphones 101a and 101b (see equation (8)).
[0095]
Next, in step ST14, the correction coefficient changing unit 113 uses the value of the correlation
coefficient corr (t) calculated by the ambient noise state estimation unit 112 in step ST13 to
calculate the number of smoothed frames (FIG. 23). Refer to) and calculate the number of
smoothed frames γ. Then, in step ST15, the correction coefficient changing unit 113 changes
the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 by
03-05-2019
28
smoothing with the smoothed frame number γ calculated in step ST14. The correction
coefficient β ′ (f, t) is obtained. After the process of step ST15, each unit ends the process at
step ST16.
[0096]
The rest of the voice input system 100A shown in FIG. 16 is configured in the same manner as
the voice input system 100 shown in FIG. 1 although the detailed description is omitted.
[0097]
The operation of the voice input system 100A shown in FIG. 16 will be briefly described.
Ambient sound is collected by the microphones 101a and 101b arranged at predetermined
intervals, and an observation signal is obtained. The observation signals obtained by the
microphones 101 a and 101 b are supplied from the A / D converter 102 to the frame division
unit 103 after being converted from analog signals to digital signals. Then, in the frame division
unit 103, the observation signals from the microphones 101a and 101b are divided into frames
of a predetermined time length and framed.
[0098]
The framed signals of each frame obtained by being framed by the frame division unit 103 are
sequentially supplied to the fast Fourier transform unit 104. The fast Fourier transform unit 104
performs fast Fourier transform (FFT) processing on the framed signal, and the observation
signal X1 (f, t) of the microphone 101a and the observation signal of the microphone 101b as
signals in the frequency domain. And X2 (f, t) is obtained.
[0099]
The observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104
are supplied to the target sound emphasis unit 105. The target sound emphasizing unit 105
subjects the observation signals X1 (f, t) and X2 (f, t) to conventionally known DS processing or
adaptive beamformer processing, etc. An estimated signal Z (f, t) is obtained. For example, when
03-05-2019
29
DS processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are
subjected to addition processing and then divided by 2 to obtain the target sound estimation
signal Z (f, t) (See equation (3)).
[0100]
Also, the observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit
104 are supplied to the noise estimation unit 106. The noise estimation unit 106 subjects the
observation signals X1 (f, t) and X2 (f, t) to conventional NBF processing or adaptive beamformer
processing, etc. N (f, t) is obtained. For example, when NBF processing is used, the observation
signal X1 (f, t) and the observation signal X2 (f, t) are subjected to subtraction processing and
then divided by 2 to obtain a noise estimation signal N (f, t). (See equation (4)).
[0101]
The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105
and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to
the correction coefficient calculation unit 107. In correction coefficient calculation section 107,
correction coefficients β (f, t) for correcting the post-filtering process are calculated based on
target sound estimation signal Z (f, t) and noise estimation signal N (f, t). In a frame, it is
calculated for each frequency (see equation (5)).
[0102]
Also, the framing signal of each frame obtained by being framed by the frame division unit 103,
that is, observation signals x1 (n) and x2 (n) of the microphones 101a and 101b are supplied to
the ambient noise state estimation unit 112. . That is, in the ambient noise state estimation unit
112, the correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the
microphones 101a and 101b is determined and used as the number information of the ambient
noise (see equation (8)) ).
[0103]
03-05-2019
30
The correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is
supplied to the correction coefficient change unit 113. Further, the correction coefficient
changing unit 113 is also supplied with the correlation coefficient corr obtained by the ambient
noise condition estimating unit 112. The correction coefficient changing unit 113 calculates the
correction coefficient β (the correction coefficient calculation unit 107 based on the correlation
coefficient corr (information on the number of sound sources of ambient noise) obtained by the
ambient noise state estimation unit 112 in each frame. f, t) are changed.
[0104]
First, the correction coefficient changing unit 113 obtains the number of smoothed frames based
on the correlation coefficient corr. In this case, the number of smoothed frames γ is small when
the value of the correlation coefficient corr is large, and large when the value of the correlation
coefficient corr is small (see FIG. 23). Next, in the correction coefficient changing unit 113, the
correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is
smoothed in the frame direction (time direction) by the number of smoothing frames γ The
modified correction coefficient β ′ (f, t) of is obtained (see FIG. 24).
[0105]
The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105
and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to
the post filtering unit 109. Further, the correction coefficient β ′ (f, t) changed by the
correction coefficient changing unit 113 is supplied to the post filtering unit 109. In this post
filtering unit 109, noise components remaining in the target sound estimation signal Z (f, t) are
removed by post filtering processing using the noise estimation signal N (f, t). The correction
coefficient β ′ (f, t) is used to correct this post-filtering process, that is, the gain of the noise
component remaining in the target sound estimation signal Z (f, t) and the noise estimation signal
N (f, t). Is used to match the gain of
[0106]
In this post-filtering unit 109, for example, a known technique such as a spectral subtraction
method or an MMSE-STSA method is used to obtain the noise suppression signal Y (f, t). For
example, when the spectral subtraction method is used, the noise suppression signal Y (f, t) is
03-05-2019
31
obtained based on the following equation (9). Y (f, t) = Z (f, t)-β '(f, t) * N (f, t) (9)
[0107]
The noise suppression signal Y (f, t) of each frequency output for each frame from the postfiltering unit 109 is supplied to the inverse fast Fourier transform unit 110. The inverse fast
Fourier transform unit 110 performs inverse fast Fourier transform processing on the noise
suppression signal Y (f, t) of each frequency for each frame to obtain a framed signal converted
into a time domain signal. Be The framing signals of each frame are sequentially supplied to the
waveform synthesis unit 111. The waveform synthesis unit 111 synthesizes the framed signals of
the respective frames to obtain a noise-suppressed speech signal SAout as an output of the
speech input system 100 continuous in time series.
[0108]
As described above, in the voice input system 100A shown in FIG. 16, the correction coefficient
β (f, t) calculated by the correction coefficient calculation unit 107 is changed by the correction
coefficient change unit 113. In this case, the ambient noise state estimation unit 112 obtains the
correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the microphones 101a
and 101b as the number-of-sounds information of the ambient noise. Then, based on the sound
source number information, the correction coefficient changing unit 113 obtains the smoothed
frame number γ so as to increase as the sound source number increases, and the correction
coefficient β (f, t) is smoothed in the frame direction. Thus, the modified correction factor β '(f,
t) of each frame is obtained. The post-filtering unit 109 uses this modified correction coefficient
β ′ (f, t).
[0109]
Therefore, in the situation where there are innumerable noise sources in the surroundings, it is
possible to suppress the change in the frame direction (time direction) of the correction
coefficient and reduce the influence on the output sound. This makes it possible to perform noise
removal processing in accordance with the ambient noise conditions. Therefore, the microphones
101a and 101b are noise canceling microphones installed in headphones, and even when there
are many noise sources in the surroundings, noise correction can be performed efficiently and
distortion is small. Good noise removal processing is performed.
03-05-2019
32
[0110]
<3. Third Embodiment> [Configuration Example of Voice Input System] FIG. 26 shows a
configuration example of a voice input system 100B according to a third embodiment. Similarly
to the voice input systems 100 and 100A shown in FIGS. 1 and 16 described above, the voice
input system 100B also performs voice input using noise canceling microphones installed on the
left and right headphones of the noise canceling headphone. It is a system. In FIG. 26, parts
corresponding to FIGS. 1 and 16 are assigned the same reference numerals, and the detailed
description thereof will be omitted as appropriate.
[0111]
The voice input system 100B includes microphones 101a and 101b, an A / D converter 102, a
frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement
unit 105, and a noise estimation unit 106. A correction coefficient calculation unit 107 is
provided. In addition, this voice input system 100B includes a correction coefficient changing
unit 108, a post filtering unit 109, an inverse fast Fourier transform (IFFT) unit 110, a waveform
combining unit 111, an ambient noise state estimating unit 112, and a correction coefficient
change. It has a part 113.
[0112]
The correction coefficient changing unit 108 crushes a peak at which the coefficient of the band
causing spatial aliasing among the correction coefficients β (f, t) calculated by the correction
coefficient calculation unit 107 in each frame can be a specific frequency. Then, the correction
coefficient β ′ (f, t) is obtained. Although the detailed description is omitted, the correction
coefficient changing unit 108 is the same as the correction coefficient changing unit 108 of the
voice input system 100 shown in FIG. The correction coefficient changing unit 108 constitutes a
first correction coefficient changing unit.
[0113]
03-05-2019
33
The ambient noise state estimation unit 112 calculates, for each frame, the correlation coefficient
corr between the observation signal of the microphone 101a and the observation signal of the
microphone 101b, and uses it as the number-of-sounds information of the ambient noise.
Although the detailed description is omitted, the ambient noise state estimation unit 112 is the
same as the ambient noise state estimation unit 112 of the voice input system 100A shown in
FIG.
[0114]
The correction coefficient changing unit 113 corrects the correction coefficient β ′ changed by
the correction coefficient changing unit 108 based on the correlation coefficient corr
(information on the number of sound sources of ambient noise) obtained by the ambient noise
state estimating unit 112 in each frame. Further modify (f, t) to obtain a correction factor β ′
′ (f, t). Although the detailed description is omitted, the correction coefficient changing unit 113
is the same as the correction coefficient changing unit 113 of the voice input system 100A
shown in FIG. The correction coefficient changing unit 113 constitutes a second correction
coefficient changing unit. In practice, the post-filtering unit 109 uses not the correction
coefficient β (f, t) itself calculated by the correction coefficient calculation unit 107 but the
correction coefficient β ′ ′ (f, t) after this change.
[0115]
The other parts of the voice input system 100B shown in FIG. 26 are configured in the same
manner as the voice input systems 100 and 100A shown in FIGS.
[0116]
The flowchart in FIG. 27 shows the procedure of processing (one frame) in the correction
coefficient changing unit 108, the ambient noise state estimating unit 112, and the correction
coefficient changing unit 113.
Each unit starts processing in step ST21. Thereafter, in step ST22, the correction coefficient
changing unit 108 acquires the correction coefficient β (f, t) from the correction coefficient
calculation unit 107. Then, in step ST23, the correction coefficient changing unit 108 searches
for the coefficient of each frequency f from the low band in the current frame t, and the first
frequency Fa (t) on the low band side where the value of the coefficient drops is locate.
03-05-2019
34
[0117]
Next, in step ST24, the correction coefficient changing unit 108 checks a flag indicating whether
or not to smooth a band higher than Fa (t), that is, a band causing spatial aliasing. Note that this
flag is set in advance by a user operation. When the flag is on, the correction coefficient changing
unit 108 smoothes in the frequency direction the coefficients in the band of Fa (t) or more
among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit
107 in step ST25. Thus, the corrected correction coefficient β ′ (f, t) of each frequency f is
obtained. Further, when the flag is off at step ST24, the correction coefficient changing unit 108
corrects the correction coefficient of the band more than Fa (t) among the correction coefficients
β (f, t) calculated by the correction coefficient calculation unit 107 at step ST27. Is replaced by
“1” to obtain a correction coefficient β ′ (f, t).
[0118]
After the process of step ST25 or step ST26, in step ST27, the ambient noise state estimation unit
112 acquires data frames x1 (t) and x2 (t) of observation signals of the microphones 101a and
101b. Then, in step ST28, the ambient noise state estimation unit 112 calculates a correlation
coefficient corr (t) indicating the degree of correlation of the observation signals of the
microphones 101a and 101b (see equation (8)).
[0119]
Next, in step ST29, the correction coefficient changing unit 113 uses the value of the correlation
coefficient corr (t) calculated by the ambient noise state estimation unit 112 in step ST28 to
calculate the number of smoothed frames (FIG. 23). Refer to) and calculate the number of
smoothed frames γ. Then, in step ST30, the correction coefficient changing unit 113 changes
the correction coefficient β ′ (f, t) changed by the correction coefficient changing unit 108 by
smoothing with the smoothed frame number γ calculated in step ST29. The correction
coefficient β ′ ′ (f, t) is obtained. After the process of step ST30, each unit ends the process at
step ST31.
[0120]
03-05-2019
35
The operation of the voice input system 100B shown in FIG. 26 will be briefly described. Ambient
sound is collected by the microphones 101a and 101b arranged at predetermined intervals, and
an observation signal is obtained. The observation signals obtained by the microphones 101 a
and 101 b are supplied from the A / D converter 102 to the frame division unit 103 after being
converted from analog signals to digital signals. Then, in the frame division unit 103, the
observation signals from the microphones 101a and 101b are divided into frames of a
predetermined time length and framed.
[0121]
The framed signals of each frame obtained by being framed by the frame division unit 103 are
sequentially supplied to the fast Fourier transform unit 104. The fast Fourier transform unit 104
performs fast Fourier transform (FFT) processing on the framed signal, and the observation
signal X1 (f, t) of the microphone 101a and the observation signal of the microphone 101b as
signals in the frequency domain. And X2 (f, t) is obtained.
[0122]
The observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104
are supplied to the target sound emphasis unit 105. The target sound emphasizing unit 105
subjects the observation signals X1 (f, t) and X2 (f, t) to conventionally known DS processing or
adaptive beamformer processing, etc. An estimated signal Z (f, t) is obtained. For example, when
DS processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are
subjected to addition processing and then divided by 2 to obtain the target sound estimation
signal Z (f, t) (See equation (3)).
[0123]
Also, the observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit
104 are supplied to the noise estimation unit 106. The noise estimation unit 106 subjects the
observation signals X1 (f, t) and X2 (f, t) to conventional NBF processing or adaptive beamformer
processing, etc. N (f, t) is obtained. For example, when NBF processing is used, the observation
signal X1 (f, t) and the observation signal X2 (f, t) are subjected to subtraction processing and
03-05-2019
36
then divided by 2 to obtain a noise estimation signal N (f, t). (See equation (4)).
[0124]
The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105
and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to
the correction coefficient calculation unit 107. In correction coefficient calculation section 107,
correction coefficients β (f, t) for correcting the post-filtering process are calculated based on
target sound estimation signal Z (f, t) and noise estimation signal N (f, t). In a frame, it is
calculated for each frequency (see equation (5)).
[0125]
The correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is
supplied to the correction coefficient change unit 108. Of the correction coefficients β (f, t)
calculated by the correction coefficient calculation unit 107, the correction coefficient changing
unit 108 changes the coefficient of the band causing spatial aliasing so as to crush the peak that
can be made to a specific frequency. Thus, the corrected correction coefficient β ′ (f, t) is
obtained.
[0126]
In addition, the framing signal of each frame obtained by framing by the frame division unit 103,
that is, the observation signals x1 (n) and x2 (n) of the microphones 101a and 101b are supplied
to the ambient noise state estimation unit 112. Ru. The ambient noise state estimation unit 112
obtains the correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the
microphones 101a and 101b, and obtains the correlation coefficient corr as the sound source
number information of the ambient noise ((( 8) See the equation).
[0127]
The corrected correction coefficient β ′ (f, t) obtained by the correction coefficient changing
unit 108 is further supplied to the correction coefficient changing unit 113. Further, the
03-05-2019
37
correction coefficient changing unit 113 is also supplied with the correlation coefficient corr
obtained by the ambient noise condition estimating unit 112. The correction coefficient changing
unit 113 corrects the correction coefficient β ′ obtained by the correction coefficient
calculation unit 107 based on the correlation coefficient corr (information on the number of
sound sources of ambient noise) obtained by the ambient noise state estimation unit 112 in each
frame. (f, t) is further changed.
[0128]
First, the correction coefficient changing unit 113 obtains the number of smoothed frames based
on the correlation coefficient corr. In this case, the number of smoothed frames γ is small when
the value of the correlation coefficient corr is large, and large when the value of the correlation
coefficient corr is small (see FIG. 23). Next, in the correction coefficient changing unit 113, the
correction coefficient β ′ (f, t) obtained by the correction coefficient calculation unit 107 is
smoothed in the frame direction (time direction) by the number of smoothed frames γ. A
modified correction factor β ′ ′ (f, t) of the frame is obtained (see FIG. 24).
[0129]
The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105
and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to
the post filtering unit 109. Further, the correction coefficient β ′ ′ (f, t) changed by the
correction coefficient changing unit 113 is supplied to the post filtering unit 109. In this post
filtering unit 109, noise components remaining in the target sound estimation signal Z (f, t) are
removed by post filtering processing using the noise estimation signal N (f, t). The correction
coefficient β ′ ′ (f, t) is used to correct this post-filtering process, that is, the gain of the noise
component remaining in the target sound estimation signal Z (f, t) and the noise estimation signal
N (f, t). Is used to match the gain of
[0130]
In this post-filtering unit 109, for example, a known technique such as a spectral subtraction
method or an MMSE-STSA method is used to obtain the noise suppression signal Y (f, t). For
example, when the spectral subtraction method is used, the noise suppression signal Y (f, t) is
obtained based on the following equation (10). Y (f, t) = Z (f, t)-β "(f, t) * N (f, t) (10)
03-05-2019
38
[0131]
The noise suppression signal Y (f, t) of each frequency output for each frame from the postfiltering unit 109 is supplied to the inverse fast Fourier transform unit 110. The inverse fast
Fourier transform unit 110 performs inverse fast Fourier transform processing on the noise
suppression signal Y (f, t) of each frequency for each frame to obtain a framed signal converted
into a time domain signal. Be The framing signals of each frame are sequentially supplied to the
waveform synthesis unit 111. The waveform synthesis unit 111 synthesizes the framed signals of
the respective frames to obtain a noise-suppressed speech signal SAout as an output of the
speech input system 100 continuous in time series.
[0132]
As described above, in the voice input system 100B shown in FIG. 26, the correction coefficient
β (f, t) calculated by the correction coefficient calculation unit 107 is changed by the correction
coefficient change unit 108. In this case, among the correction coefficients β (f, t) calculated by
the correction coefficient calculation unit 107, the coefficient of the band causing spatial aliasing
(the band higher than Fa (t)) can be a peak that can be a specific frequency. It is modified to
collapse to obtain a modified correction factor β '(f, t).
[0133]
Further, in the voice input system 100B shown in FIG. 26, the correction coefficient β ′ (f, t)
changed by the correction coefficient changing unit 108 is further changed by the correction
coefficient changing unit 113. In this case, the ambient noise state estimation unit 112 obtains
the correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the microphones
101a and 101b as the number-of-sounds information of the ambient noise. Then, based on the
sound source number information, the correction coefficient changing unit 113 obtains the
smoothed frame number γ so as to increase as the sound source number increases, and the
correction coefficient β ′ (f, t) is smoothed in the frame direction. To obtain the modified
correction factor β ′ ′ (f, t) of each frame. The post-filtering unit 109 uses this modified
correction coefficient β ′ ′ (f, t).
03-05-2019
39
[0134]
Therefore, it is possible to reduce the adverse effect that the peak of the coefficient that can be
made to the specific frequency of the band causing spatial aliasing has on the output sound, and
to suppress the deterioration of the sound quality. This enables noise removal processing
independent of the microphone spacing. Therefore, since the microphones 101a and 101b are
noise cancellation microphones installed in the headphones, noise can be efficiently corrected
even when the microphone spacing is wide, and noise is removed with less distortion. Processing
is performed.
[0135]
In addition, in the situation where there are innumerable noise sources in the surroundings, it is
possible to suppress the change in the frame direction (time direction) of the correction
coefficient and reduce the influence on the output sound. This makes it possible to perform noise
removal processing in accordance with the ambient noise conditions. Therefore, the microphones
101a and 101b are noise canceling microphones installed in headphones, and even when there
are many noise sources in the surroundings, noise correction can be performed efficiently and
distortion is small. Good noise removal processing is performed.
[0136]
<4. Fourth Embodiment> [Configuration Example of Voice Input System] FIG. 28 shows a
configuration example of a voice input system 100C according to a fourth embodiment. Similarly
to the voice input systems 100, 100A, and 100B shown in FIGS. 1, 16, and 26, the voice input
system 100C also uses the noise canceling microphones installed on the left and right
headphones of the noise canceling headphone. It is a system that performs input. In FIG. 28,
parts corresponding to FIG. 26 are assigned the same reference numerals, and the detailed
description thereof will be omitted as appropriate.
[0137]
The voice input system 100C includes microphones 101a and 101b, an A / D converter 102, a
frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement
03-05-2019
40
unit 105, and a noise estimation unit 106. A correction coefficient calculation unit 107C is
provided. Further, this voice input system 100C includes correction coefficient changing units
108 and 113, a post filtering unit 109, an inverse fast Fourier transform (IFFT) unit 110, a
waveform synthesis unit 111, an ambient noise state estimation unit 112, and a purpose. A
sound section detection unit 114 is provided.
[0138]
The target sound segment detection unit 114 detects a segment having a target sound. As shown
in FIG. 29, the target sound segment detection unit 114 detects the target sound estimation
signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal
N (f, t) obtained by the noise estimation unit 106. Based on t), in each frame, it is determined
whether it is a target sound section, and target sound section information is output.
[0139]
The target sound segment detection unit 114 obtains an energy ratio of the target sound
estimation signal Z (f, t) and the noise estimation signal N (f, t). The following equation (11)
indicates the energy ratio.
[0140]
The target sound section detection unit 114 determines whether the energy ratio is larger than a
threshold. Then, when the energy ratio is larger than the threshold value, the target sound
segment detection unit 114 determines that the target sound segment is present and outputs
“1” as target sound segment detection information, as shown in the following equation (12).
Otherwise, it determines that it is not a target sound section and outputs "0".
[0141]
[0142]
03-05-2019
41
In this case, the target sound is in the front as shown in FIG. 30, and when there is the target
sound, the gain difference between the target sound estimation signal Z (f, t) and the noise
estimation signal N (f, t) is large. In the case of noise only, the difference in their gain is small.
The same processing can be performed when the microphone spacing is known and the target
sound is not in front but in an arbitrary direction.
[0143]
The correction coefficient calculation unit 107C calculates the correction coefficient β (f, t) in
the same manner as the correction coefficient calculation unit 107 of the voice input systems
100, 100A, and 100B in FIGS. 1, 16, and 26. However, unlike the correction coefficient
calculation unit 107, the correction coefficient calculation unit 107C determines whether to
calculate the correction coefficient β (f, t) based on the target sound period information from the
target sound period detection unit 114. Do. That is, the correction coefficient calculation unit
107C newly calculates and outputs the correction coefficient β (f, t) in a frame having no target
sound, and does not calculate the correction coefficient β (f, t) in the other frames. The
correction coefficient β (f, t) same as that of the previous frame is output as it is.
[0144]
The other parts of the voice input system 100C shown in FIG. 28 are configured in the same
manner as the voice input system 100B shown in FIG. Therefore, in the voice input system 100C,
the same effect as the voice input system 100B shown in FIG. 26 can be obtained.
[0145]
Further, in the voice input system 100C, the correction coefficient calculation unit 107C further
calculates the correction coefficient β (f, t) in a section where there is no target sound. In this
case, since the target sound estimation signal Z (f, t) contains only noise components, the
correction coefficient β (f, t) can be accurately calculated without being affected by the target
sound, and as a result, Noise removal processing is performed.
03-05-2019
42
[0146]
<5. Modified Example> In the above embodiment, the microphones 101a and 101b are noise
canceling microphones provided on the left and right headphones of the noise canceling
headphone. However, it is also conceivable that the microphones 101a and 101b are
microphones installed in the personal computer main body.
[0147]
Further, in the voice input systems 100 and 100A shown in FIGS. 1 and 16 as well as the voice
input system 100C shown in FIG. 28, the target sound section detection unit 114 is provided, and
the correction coefficient calculation unit 107 is a frame without a target sound. Only the
correction coefficient β (f, t) may be calculated.
[0148]
The present invention can be applied to a system for making a call by using a noise canceling
microphone installed in noise canceling headphones or a microphone installed in a personal
computer.
[0149]
100, 100A, 100B, 100C ... voice input system 101a, 101b ... microphone 102 ... A / D converter
103 ... frame division unit 104 ... fast Fourier transform (FFT) unit 105 · · · Target sound emphasis
unit 106 Noise estimation unit (target sound suppression unit) 107, 107C Correction coefficient
calculation unit 108 Correction coefficient changing unit 109 Post filtering unit 110 Inverse fast
Fourier Transform (IFFT) unit 111 ... Waveform synthesis unit 112 ... Ambient noise state
estimation unit 113 ... Correction coefficient change unit 114 ... Target sound section detection
unit
03-05-2019
43
Документ
Категория
Без категории
Просмотров
0
Размер файла
63 Кб
Теги
jp2012058360
1/--страниц
Пожаловаться на содержимое документа