close

Вход

Забыли?

вход по аккаунту

?

JP2017516126

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2017516126
Abstract A noise suppressor has first (401) and second transducers (403) that generate first and
second frequency domain signals from frequency conversion of first and second microphone
signals. A time unit responsive to a difference indicator for an absolute value time-frequency tile
value of the first frequency domain signal and a second absolute value time-frequency tile value
of the second frequency domain signal, a gain unit (405, 407, 409) Determine the tile gain. A
scaler (411) generates a third frequency domain signal by scaling the time frequency tile value of
the first frequency domain signal by the time frequency tile gain. The resulting signal is
converted to the time domain by a third converter (413). A designator (405, 407, 415)
designates the time frequency tile of the first frequency domain signal as an utterance tile or
noise tile, and the gain unit (409) comprises an utterance tile or noise of the time frequency tile.
The gain is determined according to the designation as a tile.
Noise suppression
[0001]
The present invention relates to noise suppression, and more particularly, but not exclusively, to
the suppression of non-stationary diffuse noise based on signals captured from two microphones.
[0002]
The capture of audio, especially speech, has become increasingly important in recent decades.
03-05-2019
1
In fact, speech capture has become increasingly important for a variety of applications including
telecommunications, teleconferencing, gaming and the like. However, the problem in many
scenarios and applications is that the desired speech source is typically not the only audio source
in the environment. Rather, in a typical audio environment, there are many other audio / noise
sources captured by the microphone. One of the key issues presented to many speech capture
applications is the question of how best to extract speech in noisy environments. Several
different approaches for noise suppression have been proposed to address this problem.
[0003]
One of the most difficult tasks in speech enhancement is the suppression of non-stationary
diffusion noise. Diffuse noise is, for example, an acoustic (noise) sound field in a room where the
noise comes from all directions. A typical example is the so-called "bobble" noise in cafeterias and
restaurants where there are many noise sources distributed throughout the room.
[0004]
When recording a desired speaker in a room using a microphone or microphone array, the
desired speech is captured in addition to background noise. Speech enhancement can be used in
an attempt to modify the microphone signal so that background noise is reduced while the
desired speech is as unaffected as possible. When the noise is diffuse, one proposed approach
estimates the spectral amplitude of the background noise and the resulting spectral amplitude of
the enhanced signal is as similar as possible to the spectral amplitude of the speech signal
desired. Yes, trying to correct the spectral amplitude. In this approach, the phase of the acquired
signal is not changed.
[0005]
FIG. 1 shows an example of a noise suppression system according to the prior art. In this
example, an input signal is received from two microphones. One microphone is considered to be
the reference microphone and the other is the main microphone capturing the desired audio
source, in particular the speech. Thus, the reference microphone signal x (n) and the main
microphone signal are received. These signals are transformed in the frequency domain in the
converters 101, 103 and the absolute values in the individual time frequency tiles are generated
03-05-2019
2
by the absolute value units 105, 107. The resulting absolute value is input to unit 109 to
calculate the gain. The resulting gain is multiplied by the frequency domain value of the main
signal in multiplier 111, thereby producing a frequency spectrum compensated output signal,
which is transformed in the other transform unit 113 into the time domain.
[0006]
This approach can be considered best in the frequency domain. First, the frequency domain
signal is generated by calculating the short time Fourier transform (STFT) of, for example,
overlapping Hanning windowed blocks of the time domain signal. STFT is generally a function of
both time and frequency and is represented by the two arguments t k and ω l. Where tk = kB is
discrete time, k is frame index, B is frame shift, ω l = lω 0 is the (discrete) frequency, l is
frequency index, ω 0 Represents the fundamental frequency interval.
[0007]
Let Z (t k, ω l) be the (complex) microphone signal to be enhanced. It consists of the desired
speech signal Z s (tk, ω l) and the noise signal Z n (tk, ω l): Z (tk, ω l) = Z s (tk, ω l) + Z n (tk,) ω
l) This microphone signal is input to a post processor. The post-processor performs noise
suppression by modifying the spectral amplitude of the input signal while leaving the phase
unchanged. The operation of the postprocessor can be described by a gain function. The gain
function typically has the following form for spectral amplitude subtraction:
[0008]
Here, | · | is an absolute value operation. The output signal is calculated as Q (t k, ω l) = Z (t k, ω
l) * G (t k, ω l). After being converted back to the time domain, the time is taken into account by
combining the current and previous frames, taking into account that the original time signal has
been windowed and time overlapped (the duplicate addition procedure has been performed) The
domain signal is reconstructed. [0009] The gain function can be generalized as follows: [0010]
For = 1, this describes the gain function for spectral amplitude subtraction. For α = 2, this again
describes the gain function for spectral power that is often used. The following description
focuses on spectral amplitude subtraction, but it will be understood that the principles given may
also apply to spectral power subtraction in particular. The amplitude spectrum of noise at | Z n (t
k, ω l) | is generally unknown. Therefore, it is necessary to use an estimated value; [0012] [^
03-05-2019
3
with | Z n (t k, ω l) |] instead. Since the estimate is not always accurate, the oversubtraction
factor γ n for noise is used (ie, the noise is scaled by a factor greater than 1). However, this can
lead to negative values for; [0013], which is undesirable. For that reason, the gain function is
limited to zero or some small positive value. [0014] For the above gain function, this results in
the following: [0015] For stationary noise, | Z n (tk, ω l) | is the amplitude spectrum between
silences | Z ( It can estimate by measuring and averaging tk, ω l) |. However, for non-stationary
noise, an estimate of | Z n (t k, ω l) | can not be derived from such an approach. This is because
the characteristics change with time. This tends to prevent that an accurate estimate is generated
from a single microphone signal. Instead, it has been proposed to use an additional microphone
to be able to estimate | Z n (t k, ω l) |. As a specific example, consider a scenario where there are
two microphones in the room, one microphone is located near the desired speaker (main
microphone) and the other microphone is further from the speaker (reference microphone) Can.
In this scenario, it can be assumed that the main microphone contains the desired speech and
noise components, and it is assumed that the reference microphone signal contains no speech at
all, but only the noise signal recorded at the position of the reference microphone It can be done.
The microphone signals are for the main and reference microphones respectively: Z (tk, ω l) = Z
s (tk, ω l) + Z n (tk, ω l) X (tk, ω l) = X n (tk , ω l). [0017] In order to relate noise components in
the microphone signal, the so-called coherence term is defined as follows. Where E {·} is the
expectation operator. The coherence term is a measure of the average correlation between the
amplitude of the noise component in the main microphone signal and the amplitude of the
reference microphone signal. [0019] Since C (tk, ω l) does not depend on the instantaneous
audio in the microphone, but on the spatial characteristics of the noise field, the variation of C
(tk, ω l) as a function of time is Z Much less than the time variation of n and X n. [0020] As a
result, C (tk, ω l) temporally sets | Z n (tk, ω l) | and | X n (tk, ω l) | It can be estimated
relatively accurately by averaging. An approach to do so is disclosed in US Pat. The document
specifically describes a method in which explicit speech detection is not required to determine C
(t k, ω l). As in the case of stationary noise, the equations for the gain functions for the two
microphones can be derived as follows: [0022] Since X contains no speech, it is considered that
the absolute value of X multiplied by the coherence term C (t k, ω l) gives an estimate of the
noise component in the main microphone signal. As a consequence, the equation given above is
(estimated) by scaling the frequency domain signal, ie by Q (tk, ωl) = Z (tk, ωl) * G (tk, ωl) It can
be used to shape the spectrum of the first microphone signal to correspond to the speech
component.
[0023] However, while the described approach may provide advantageous performance in many
scenarios, it may provide less than optimal performance in some scenarios. In particular, noise
suppression may not be optimal in some scenarios. In particular, for diffuse noise, the
improvement in signal-to-noise ratio (SNR) may be limited and often the so-called SNR
03-05-2019
4
improvement (SNRI) is practically limited to the order of 6-9 dB. While this may be acceptable for
some applications, in many scenarios, significant noise components tend to remain, resulting in
degradation of perceived speech quality. Furthermore, although other noise suppression
techniques can be used, these too tend to be suboptimal, eg complex, lacking in flexibility,
impractical, computationally demanding, complex hardware (eg multiple microphones) There is a
tendency to provide demanding and / or non-optimal noise suppression.
[0024] Thus, improved noise suppression would be advantageous. In particular, noise that allows
for reducing complexity, increasing flexibility, facilitating implementation, reducing cost (eg, not
requiring multiple microphones), improving noise suppression and / or improving performance.
Suppression would be advantageous.
[0025] U.S. Pat. No. 7,602,926 U.S. Pat. No. 7,146,012
[0026] Thus, the present invention seeks to preferably mitigate, alleviate or eliminate one or
more of the above mentioned disadvantages singly or in any combination.
[0027] According to an aspect of the invention, a noise suppressor is provided for suppressing
noise in a first microphone signal. The noise suppressor is: a first converter that generates a first
frequency domain signal from frequency conversion of a first microphone signal, wherein the
first frequency domain signal is represented by a time frequency tile value. A second transducer
for generating a second frequency domain signal from frequency conversion of a second
microphone signal, the second frequency domain signal being represented by a time frequency
tile value, Between a second converter and a first monotonic function of an absolute value time
frequency tile value of the first frequency domain signal and a second monotonic function of an
absolute value time frequency tile value of the second frequency domain signal A gain unit for
determining the time frequency tile gain as a non-negative monotonic function of the difference
indicator indicative of the difference between the time frequency tile value of the first frequency
domain signal by the time frequency tile gain; And a scaler for generating an output frequency
domain signal by scaling. The noise suppressor further comprises a designator for specifying a
time frequency tile of the first frequency domain signal as an utterance tile or a noise tile, and
the gain unit is configured to determine the time of the first frequency domain signal. When the
time-frequency tile is designated as a noise tile for the time-frequency tile gain of a timefrequency tile in response to the designation as a speech tile or noise tile of the frequency tile,
the time-frequency tile is an utterance tile The time-frequency tile gains are configured to be
determined such that lower gain values are determined than specified.
03-05-2019
5
[0028] The present invention may provide improved and / or facilitated noise suppression in
many embodiments. In particular, the present invention may allow for improved suppression of
non-stationary and / or diffuse noise. An increased signal or speech to noise ratio can often be
achieved. In particular, the approach may actually increase the upper bound for potential SNR
improvement. In fact, in many practical scenarios, the present invention may allow to improve
the SNR of the noise-suppressed signal from about 6-8 dB to more than 20 dB.
[0029] The approach can typically provide improved noise suppression, and in particular, can
allow improved suppression of noise without corresponding speech suppression. An improved
signal to noise ratio of the suppressed signal can often be achieved.
[0030] The gain unit is configured to separately determine different time frequency tile gains for
at least two time frequency tiles. In many embodiments, the time frequency tiles may be divided
into multiple sets of time frequency tiles, and the gain units are configured to determine gain
independently and / or separately for each set of time frequency tiles It may be done. In many
embodiments, the gains for time frequency tiles of a set of time frequency tiles are determined by
the first frequency domain signal and the second frequency in time frequency tiles belonging to
the set of time frequency tiles. It may depend on the attribute of only the area signal.
[0031] The gain unit may determine, for the time frequency tile, a different gain if this is
designated as a speech tile than if it is designated as a noise tile. The gain unit may in particular
be arranged to calculate the gain for the time frequency tile by evaluating a function depending
on the designation of the time frequency tile. In some embodiments, the gain unit evaluates the
gain for the time frequency tile and a different function when the time frequency tile is
designated as the speech tile than when it is designated as the noise tile May be configured to
calculate. The functions, equations, algorithms and / or parameters used in determining the time
frequency tile gain differ from those specified as noise tiles when the time frequency tile is
specified as an utterance tile. It is also good.
[0032] The time frequency tiles may in particular correspond to one bin of frequency transforms
in one time segment / frame. In particular, the first and second converters may use block
processing to convert successive segments of the first and second signals. A time frequency tile
may correspond to a set (typically one) of transform bins in one segment / frame.
[0033] Designation as speech or noise (temporal frequency) tiles may be performed for each
03-05-2019
6
temporal frequency tile in some embodiments. However, often the designation may be applied to
a group of time frequency tiles. In particular, the designation may apply to all time frequency
tiles in a certain time segment. Thus, in some embodiments, the first microphone signal may be
segmented into time segments / frames that are individually converted to the frequency domain,
designated as speech or noise tiles of the time frequency tile May be common to all time
frequency tiles of one segment / frame.
[0034] In some embodiments, the noise suppressor may further comprise a third converter for
generating the output signal from the frequency to time conversion of the output frequency
domain signal. In other embodiments, the output frequency domain signal may be used directly.
For example, speech recognition or speech enhancement may be performed in the frequency
domain, so output frequency domain signals may be used directly without the need for
conversion to the time domain.
[0035] According to an optional feature of the invention, the gain unit is configured to determine
a gain value for a time frequency tile gain of a time frequency tile as a function of the difference
index of the time frequency tile.
[0036] This may provide for efficient noise suppression and / or facilitated implementation. In
particular, many embodiments can lead to efficient noise suppression that can be efficiently
adapted to the signal characteristics and still be implemented without the need for high
computational load or extremely complex processing.
[0037] The function may in particular be a monotonic function of the difference indicator, and
the gain value may in particular be proportional to the difference value.
[0038] According to an optional feature of the invention, at least one of the first monotonic
function and the second monotonic function is dependent on whether the time frequency tile is
designated as a speech tile or a noise tile .
[0039] This may provide for efficient noise suppression and / or facilitated implementation. In
particular, many embodiments can lead to efficient noise suppression that can be efficiently
adapted to the signal characteristics and still be implemented without the need for high
computational load or extremely complex processing.
03-05-2019
7
[0040] The at least one of the first monotonic function and the second monotonic function is a
time frequency tile for time frequency tile values of the same absolute value of the first or second
frequency domain signal, respectively, for a time frequency tile. When is specified as a speech
tile, it provides different output values than when it is specified as a noise tile.
[0041] According to an optional feature of the invention, the second monotonic function is a time
frequency using a scale value depending on whether the time frequency tile is designated as a
speech time frequency tile or a noise time frequency tile. The scaling of the absolute value time
frequency tile value of the second frequency domain signal for the tile is included.
[0042] This may provide for efficient noise suppression and / or facilitated implementation. In
particular, many embodiments can lead to efficient noise suppression that can be efficiently
adapted to the signal characteristics and still be implemented without the need for high
computational load or extremely complex processing.
[0043] According to an optional feature of the invention, a gain unit generates a noise coherence
estimate indicating a correlation between the amplitude of the second microphone signal and the
amplitude of the noise component of the first microphone signal. Configured such that at least
one of the first monotonic function and the second monotonic function depend on the noise
coherence estimate.
[0044] This may provide for efficient noise suppression and / or facilitated implementation. The
noise coherence estimate is in particular an estimate of the correlation between the amplitude of
the first microphone signal and the amplitude of the second microphone signal in the absence of
speech, ie when the speech source is inactive. It may be Noise coherence estimates may be
determined based on the first and second microphone signals and / or the first and second
frequency domain signals in some embodiments. In some embodiments, noise correlation
estimates may be generated based on a separate calibration or measurement process.
[0045] According to an optional feature of the invention, the first monotonic function and the
second monotonic function are noise coherence estimates for the amplitude relationship between
the first microphone signal and the second microphone signal. And the expected value of the
difference indicator is negative if the time frequency tile is designated as a noise tile.
03-05-2019
8
[0046] According to an optional feature of the invention, a gain unit comprises at least one of the
first monotonic function and the second monotonic function, the first microphone signal
corresponding to a noise coherence estimate and the first microphone signal. The expected value
of the difference indicator for the amplitude relationship between the two microphone signals is
configured to change differently for time frequency tiles specified as noise tiles than for time
frequency tiles specified as speech tiles ing.
[0047] According to an optional feature of the invention, the gain differences for time frequency
tiles designated as speech and noise tiles are: signal level of said first microphone signal; signal
level of said second microphone signal And at least one value from the group consisting of
signal-to-noise estimates for the first microphone signal.
[0048] This may provide for efficient noise suppression and / or facilitated implementation. In
particular, many embodiments can lead to efficient noise suppression that can be efficiently
adapted to the signal characteristics and still be implemented without the need for high
computational load or extremely complex processing.
[0049] According to an optional feature of the invention, the difference measure for a time
frequency tile depends on whether the time frequency tile is designated as a noise tile or a
speech tile.
[0050] This may provide for efficient noise suppression and / or facilitated implementation.
[0051] According to an optional feature of the invention, the designator designates the time
frequency tile of the first frequency domain signal as an utterance tile or a noise tile as an
absolute value of the first frequency domain signal. A time frequency tile value and an absolute
value of the second frequency domain signal are configured to be responsive to the difference
value generated in response to the difference indicator for the noise tile relative to the time
frequency tile value.
[0052] This may allow a particularly advantageous designation. In particular, a reliable
designation can be achieved, while at the same time allowing reduced complexity. In particular,
corresponding or typically the same functionality may be used for both tile specification and gain
determination.
03-05-2019
9
[0053] In many embodiments, the designator is configured to specify a time frequency tile as a
noise tile if the value of the difference is less than a threshold.
[0054] According to an optional feature of the invention, the designator is configured to filter
difference values across time frequency tiles. The filtering includes time frequency tiles that
differ in both time and frequency.
[0055] This provides an improved designation of time frequency tiles in many scenarios and
applications, and as a result provides improved noise suppression.
[0056] According to an optional feature of the invention, the gain unit is configured to filter gain
values across multiple time frequency tiles. The filtering includes time frequency tiles that differ
in both time and frequency.
[0057] This can provide substantially improved performance, and can typically tolerate
substantially improved signal to noise ratios. The approach may improve noise suppression by
applying filtering to the gain values for time frequency tiles. Here, the filtering is both frequency
and time filtering.
[0058] According to an optional feature of the invention, a gain unit filters at least one of an
absolute value time frequency tile value of the first frequency domain signal and an absolute
value time frequency tile value of the second frequency domain signal. Configured. The filtering
includes time frequency tiles that differ in both time and frequency.
[0059] This can provide substantially improved performance, and can typically tolerate
substantially improved signal to noise ratios. The approach may improve noise suppression by
applying filtering to signal values for time frequency tiles. Here, the filtering is both frequency
and time filtering.
[0060] In many embodiments, a gain unit is configured to filter both an absolute value time
frequency tile value of the first frequency domain signal and an absolute value time frequency
tile value of the second frequency domain signal. Here, the filtering includes time frequency tiles
that differ in both time and frequency.
03-05-2019
10
[0061] According to an optional feature of the invention, the noise suppressor is further
configured as an audio beamformer configured to generate the first microphone signal and the
second microphone signal from signals from a microphone array. Have.
[0062] This can improve performance and can tolerate an improved signal to noise ratio of the
suppressed signal. In particular, the present approach may allow the reference signal with
reduced contribution from the desired source to be processed by the algorithm to provide
improved designation and / or noise suppression.
[0063] According to an optional feature of the invention, the noise suppressor is further
adaptively canceling the signal component of the first microphone signal correlated with the
second microphone signal from the first microphone signal. Have a bowl.
[0064] This can improve performance and can tolerate an improved signal to noise ratio of the
suppressed signal. In particular, the present approach may allow the reference signal with
reduced contribution from the desired source to be processed by the algorithm to provide
improved designation and / or noise suppression.
[0065] According to an optional feature of the invention, the difference measure comprises a
first value given as a monotonic function of an absolute value time frequency tile value of the
first frequency domain signal and a second value of the second frequency domain signal. It is
determined as the difference between the absolute value and the second value given as a
monotonic function of the time frequency tile value.
[0066] According to an aspect of the present invention, there is provided a method of
suppressing noise in a first microphone signal comprising: generating a first frequency domain
signal from frequency conversion of the first microphone signal, Frequency domain signals are
represented by time frequency tile values; generating a second frequency domain signal from
frequency conversion of a second microphone signal, the second frequency domain signal being
a time frequency tile A time-frequency tile gain in response to the step of being represented by a
value; and the difference measure for the absolute value time-frequency tile value of the first
frequency-domain signal and the absolute value time-frequency tile value of the second
frequency-domain signal Scaling the time frequency tile values of the first frequency domain
signal by the time frequency tile gain Generating an output frequency domain signal according
03-05-2019
11
to, the method further comprising: designating a time frequency tile of the first frequency
domain signal as an utterance tile or a noise tile, the time frequency tile gain being A method is
provided, which is determined in response to the designation of the first frequency domain signal
as a speech or noise tile of a time frequency tile.
[0067] In some embodiments, the method may further include generating an output signal from
frequency to time conversion of the output frequency domain signal.
[0068] These and other aspects, features and advantages of the present invention will be
apparent from, and elucidated with reference to, the embodiments described hereinafter.
[0069] Embodiments of the invention will be described, by way of example only, with reference
to the drawings. FIG. 1 shows an example of a noise suppressor according to the prior art. It is a
figure which shows the example of the noise suppression performance about the noise
suppressor of a prior art. It is a figure which shows the example of the noise suppression
performance about the noise suppressor of a prior art. FIG. 7 illustrates an example of a noise
suppressor in accordance with some embodiments of the present invention. FIG. 5 illustrates an
example of a noise suppressor configuration in accordance with some embodiments of the
present invention. FIG. 2 shows an example of a converter from time domain to frequency
domain. FIG. 2 shows an example of a converter from the frequency domain to the time domain.
FIG. 6 shows an example of elements of a noise suppressor according to some embodiments of
the invention. FIG. 6 shows an example of elements of a noise suppressor according to some
embodiments of the invention. FIG. 5 illustrates an example of a noise suppressor configuration
in accordance with some embodiments of the present invention. FIG. 5 illustrates an example of a
noise suppressor configuration in accordance with some embodiments of the present invention.
[0070] The inventor of the present application recognizes that the performance of the prior art
approach of FIG. 1 gives a non-optimum performance for non-stationary / spreading noise, and a
diagram for non-stationary / spreading noise. It has been recognized that improvements can be
made by introducing specific concepts that can mitigate or eliminate the performance limitations
experienced by one system.
[0071] Specifically, the inventors have come to realize that the approach of FIG. 1 for diffuse
noise has a limited signal to noise ratio improvement (SNRI) range. In particular, the inventor has
found that other adverse effects can be introduced when increasing the excess subtraction factor
γ n in the conventional function as described above, in particular the increase of the speech
03-05-2019
12
attenuation during speech It came to recognize what could be done.
[0072] This can be understood by looking at the characteristics of the ideal spherical isotropic
diffuse noise field. When two microphones are arranged such distance apart by a distance d to
provide the microphone signals X 1 (tk, ω l) and X 2 (tk, ω l) respectively, the wavenumber k =
ω / c (c The following equation holds, using the speed of sound) and the variance σ <2> of the
real part and imaginary part of X 1 (tk, ω 1) and X 2 (tk 1, ω 1) distributed in a Gaussian.
[0073] The coherence function between X 1 (t k, ω l) and X 2 (t k, ω l) is given by
[0074] From this coherence function, X 1 (t k, ω l) and X 2 (t k, ω l) will be uncorrelated for
higher frequencies and large distances. For example, if the distance is greater than 3 meters, X 1
(t k, ω l) and X 2 (t k, ω l) are substantially uncorrelated for frequencies above 200 Hz.
[0075] Using these characteristics, C (t k, ω l) = 1, and the gain function then reduces.
[0076] Assuming that there is no speech, that is, Z (tk, ωl) = Zn (tk, ωl), and looking at the
numerator, | Z (tk, ωl) | and | X (tk, ωl) | Become a Rayleigh distribution. This is because real
and imaginary parts are Gaussian and independent. It is assumed that γ n = 1 and θ = 0.
Consider the variable d = | Z (tk, [omega] l) |-| X (tk, [omega] l) |.
[0077] The mean of the difference of the two random variables is equal to the difference of the
mean: E {d} = 0.
[0078] The variance of the difference of the two probability signals is equal to the sum of the
individual variances: var (d) = (4-π) σ <2>.
[0079] Limiting d to 0 (ie, negative values are forced to 0), the power of d is half the value of the
variance of d because the distribution of d is symmetric around 0: E {d < 2>} = (4- (pi)) <2> / 2.
[0080] Here, comparing the power of the residual signal with the power of the input signal (2σ
<2>), the following is obtained for the suppression due to the post-processor: A = -10 log 10 (1-π
03-05-2019
13
/ 4) = 6.68 dB .
[0081] Thus, the attenuation is limited to a relatively low value of less than 7 dB for the case
where only background noise is present.
[0082] Bounded variable db = MAX ((| Z (tk, ω l) | -γ n | X (tk, ω l) |), 0, desired to increase
noise suppression by increasing γ n Considering), it is possible to derive A = -10log10 {(? N / 2)
(-? + (2 /? N) + 2arctan (? N))} for the decay of the post-processor.
[0083] The decay is a function of the oversubtraction factor γ n, so some exemplary values may
be:
[0084] As can be seen, a large overdamping factor is required to reach a noise suppression of,
for example, 10 dB or more.
[0085] Next, considering the influence of noise subtraction on the remaining utterance
amplitude, | Z (tk, ωl) | ≦ | Zs (tk, ωl) | + | Zn (tk, ωl) |.
[0086] Thus, subtraction of the noise component from | Z (t k, ω l) | leads easily to
oversubtraction, even for γ n as small as one.
[0087] The power of | Z (tk, ω l) | and (| Z (tk, ω l) | − | Z s (tk, ω l) |) is the speech amplitude v
= | Z s (tk, ω l) | And noise power (2σ <2>) as a function of (or can be determined by simulation
or numerical analysis). FIG. 2 shows the result in the case of 2σ <2> = 1.
[0088] As can be seen from FIG. 2, for large v, the powers of | Z (tk, ωl) | and | Zs (tk, ωl) | are
close to each other. As a result, subtraction of the noise estimate | X (t k, ω l) | leads to
oversubtraction.
[0089] Utterance attenuation
03-05-2019
14
[0090] For v> 2, the speech attenuation is about 2 dB. For smaller v, especially v <1, not all noise
is suppressed because of the large dispersion of ds = Z (tk, ω1) -X (tk, ω1). For those values, d s
can be negative and, as in the noise only case, they are clipped so that θ ≧ 0. For larger v, d s is
not negative and limiting to 0 does not affect performance.
[0091] If the oversubtraction factor γ n is increased, speech attenuation will increase as shown
in FIG. FIG. 3 corresponds to FIG. 1, but E {(| Z (tk, ω 1) | -γ n | X (tk, ω 1) |) <2>} for γ n = 1
and γ n = 1.8 respectively. Is given and compared to the desired output.
[0092] For v> 2, an increase in speech distortion in the range of 4 to 5 dB is seen. For v <2, the
output increases for γ n = 1.8. This can be prevented by limiting to 0 as discussed above.
[0093] The 4 dB gain of noise suppression when going from γ n = 1 to γ n = 1.8 is canceled by
a 2 to 3 dB greater speech attenuation, thus leading to an SNR improvement of only 1 to 2 dB or
so. This is typical for diffuse-like noise fields. The overall SNR improvement is limited to about 12
dB.
[0094] Thus, while the approach may lead to an effective noise suppression in practice with an
improved SNR, this suppression is still practically limited to a relatively modest SNR
improvement of not more than 10 dB.
[0095] FIG. 4 illustrates an example of a noise suppressor in accordance with some embodiments
of the present invention. The noise suppressor of FIG. 4 may provide substantially higher SNR
improvement for diffusive noise than is typically possible in the system of FIG. In fact,
simulations and practical tests have shown that SNR improvements of more than 20-30 dB are
typically possible.
[0096] The noise suppressor comprises a first transducer 401 that receives a first microphone
signal from a microphone (not shown). The first microphone signal may be captured, filtered,
amplified, etc. as known in the art. Additionally, the first microphone signal may be a digital time
domain signal generated by sampling the analog signal.
[0097] The first transducer 401 is configured to generate a first frequency domain signal by
applying a frequency transform to the first microphone signal. In particular, the first microphone
03-05-2019
15
signal is divided into time segments / intervals. Each time segment / interval contains a set of
samples, which are converted to a set of frequency domain samples, for example by FFT. Thus,
the first frequency domain signal is represented by frequency domain samples, each frequency
domain sample corresponding to a particular time interval and a particular frequency interval.
Each such frequency interval and time interval is typically known in the art as a time frequency
tile. Thus, the first frequency domain signal is represented by the values for each of the plurality
of time frequency tiles, ie, by the time frequency tile values.
[0098] The noise suppressor further comprises a second transducer 403 that receives a second
microphone signal from a microphone (not shown). The second microphone signal may be
captured, filtered, amplified, etc. as known in the art. Additionally, the second microphone signal
may be a digital time domain signal generated by sampling the analog signal.
[0099] The second transducer 403 is configured to generate a second frequency domain signal
by applying a frequency transform to the second microphone signal. In particular, the second
microphone signal is divided into time segments / intervals. Each time segment / interval
contains a set of samples, which are converted to a set of frequency domain samples, for example
by FFT. Thus, the second frequency domain signal is represented by the values for each of the
plurality of time frequency tiles, ie by the time frequency tile values.
[0100] The first and second microphone signals are hereinafter referred to as z (n) and x (n)
respectively, and the first and second frequency domain signals are vectored
[0101] Referenced by (Each vector contains all M frequency tile values for a given processing /
transformation time segment / frame. ) In use, z (n) is assumed to contain noise and speech,
while x (n) is assumed to contain noise only. Furthermore, the noise components of z (n) and x (n)
are assumed to be uncorrelated. (These components are assumed to be uncorrelated in time.
However, it is typically assumed that there is a relationship between the mean amplitudes, which
is represented by the coherence term. ) Such an assumption is that the first microphone (which
captures z (n)) is located in close proximity to the speaker while the second microphone is
located some distance from the speaker and the noise is For example, it tends to be effective in a
scenario distributed indoors. Such a scenario is illustrated in FIG. 5, where the noise suppressor
is depicted as a SUPP unit.
[0102] Following the transformation to the frequency domain, it is assumed that the real and
imaginary components of the temporal frequency values have a Gaussian distribution. This
03-05-2019
16
assumption is typically accurate, for example, for scenarios where noise originates from diffuse
sound fields, for sensor noise and for some other noise sources experienced in many practical
scenarios.
[0103] FIG. 6 shows an example of functional elements of a possible implementation of the first
and second transformation units 401, 403. In this example, a serial to parallel converter
produces overlapping blocks (frames) of 2B samples, which are then Hanning windowed and
transformed into the frequency domain by a fast Fourier transform (FFT).
[0104] The first converter 401 is coupled to the first absolute value unit 405. A first absolute
value unit 405 determines the absolute value of the time frequency tile value, thereby producing
an absolute value time frequency tile value for the first frequency domain signal.
[0105] Likewise, the second converter 403 is coupled to the second magnitude unit 407. The
second absolute value unit 407 determines the absolute value of the time frequency tile value,
thereby generating an absolute value time frequency tile value for the second frequency domain
signal.
[0106] The first and second absolute value units 405, 407 are fed to a gain unit 409. Gain unit
409 is configured to determine a gain for the time frequency tile based on an absolute value time
frequency tile value of the first frequency domain signal and an absolute value time frequency
tile value of the second frequency domain signal. The gain unit 409 is thus vector
[0107] Calculate the time-frequency tile gain referenced by.
[0108] The gain unit 409 is more particularly a predicted time frequency tile of a first frequency
domain signal generated from a time frequency tile value of a first frequency domain signal and a
time frequency tile value of a second frequency domain signal. Determine a difference indicator
that indicates the difference between the values. Thus, the difference indicator may be a
predicted difference indicator. In some embodiments, the prediction may simply be that the time
frequency tile value of the second frequency domain signal is a direct prediction of the time
frequency tile value of the first frequency domain signal.
[0109] The gain is then determined as a function of the difference measure. Specifically, a
03-05-2019
17
difference measure may be determined for each time frequency tile, and the gain may be set such
that the higher the difference measure (ie, the stronger the indication of the difference) the
higher the gain. Thus, the gain may be determined as a monotonically increasing function of the
distance indicator.
[0110] As a result, the time frequency tile gain is determined, but the gain is relatively accurate
for time frequency tiles where the differential index is relatively low, ie the value of the first
frequency domain signal is from the value of the second frequency domain signal For predictable
time frequency tiles, the difference measure is lower for relatively low time frequency tiles, ie, for
time frequency tiles in which the value of the first frequency domain signal can not be effectively
predicted from the value of the second frequency domain signal. . Thus, the gain for a time
frequency tile that has a high probability that the first frequency domain signal contains
significant speech components has a low probability that the first frequency domain signal has a
low probability that contains speech components. Determined higher than the gain for the tile.
The generated time frequency tile gain is a scalar value in this example.
[0111] Gain unit 409 is coupled to scaler 411, which is input the gain and proceeds to scale the
time frequency tile values of the first frequency domain signal by these time frequency tile gains.
In particular, in the scaler 411, the signal vector
[0112] Is the gain vector
[0113] The resulting signal vector is multiplied by each element
[0114] give.
[0115] The scaler 411 thus produces a third frequency domain signal, also called output
frequency domain signal. This corresponds to the first frequency domain signal but with spectral
shaping corresponding to the expected speech component. Because the gain values are scalar
values, individual time frequency tile values of the first frequency domain signal may be scaled in
amplitude, while time frequency tile values of the third frequency domain signal are
corresponding values of the first frequency domain signal. Have the same phase as
[0116] The gain unit 409 is coupled to an optional third converter 413 which receives the third
03-05-2019
18
frequency domain signal. The third converter 413 is configured to generate an output signal
from the frequency to time conversion of the third frequency domain signal. Specifically, the
third converter 413 may perform an inverse conversion of the conversion of the first frequency
domain signal by the first converter 401. In some embodiments, the third (output) frequency
domain signal may be used directly, for example, with frequency domain speech recognition or
speech enhancement. In such an embodiment, there is no need for a third converter 413.
[0117] Specifically, as shown in FIG. 7, the third frequency domain signal is
[0118] May be converted back to the time domain, and then the first B pieces of the current
(latest) frame (transformed segment) due to the overlap and windowing of the first microphone
signal by the first converter 401 The time domain signal may be reconstructed by adding the last
B samples of the previous frame to the samples of. Finally, the resulting block
[0119] Can be converted to a continuous output signal stream q (n) by a parallel to serial
converter.
[0120] However, the noise suppressor of FIG. 4 does not calculate the time-frequency tile gain
based solely on the difference measure. Rather, the noise suppressor is configured to designate
the time frequency tile as being a speech (time frequency) tile or as noise (time frequency tile)
and to determine the gain in dependence of the designation. There is. Specifically, if the function
for determining the gain for a given time-frequency tile as a function of the difference index is
said to belong to the noise frame if the time-frequency tile is specified to belong to the speech
frame Different from when specified.
[0121] The noise suppressor of FIG. 4 specifically includes a designator 415 configured to
designate the time-frequency tile of the first frequency-domain signal as an utterance tile or noise
tile.
[0122] It will be appreciated that there are many different techniques and techniques for
determining whether a signal component corresponds to an utterance. Furthermore, it will be
appreciated that any such approach may be used as appropriate. For example, a time frequency
tile belonging to a signal portion may be designated as a speech time frequency tile if the signal
portion is presumed to include speech components, and otherwise it may be designated as noise.
03-05-2019
19
[0123] Thus, in many embodiments, the designation of time-frequency tiles is a designation to
speech and non-speech tiles. In fact, noise tiles may be considered equivalent to non-speech tiles
(in fact, all non-speech can be considered as noise since the desired signal component is the
speech component).
[0124] In many embodiments, designation of the time frequency tile as an utterance or noise
(time frequency) tile is based on comparison of the first and second microphone signals and / or
comparison of the first and second frequency domain signals. May be In particular, the closer the
correlation between the amplitudes of the signals, the less likely the first microphone signal
contains significant speech components.
[0125] Designation of time-frequency tiles as speech or noise tiles (where each category may, in
some embodiments, include further subdivisions into subcategories) is, in some embodiments,
each time While implemented for frequency tiles individually, in many embodiments may be
implemented for groups of time frequency tiles.
[0126] In particular, in the example of FIG. 4, the designator 415 is configured to generate one
designation for each time segment / transform block. Thus, for each time segment, it may be
estimated whether the first microphone signal contains a significant speech component. If
included, all time frequency tiles of that time segment are designated as speech time frequency
tiles, otherwise they are designated as noise time frequency tiles.
[0127] In the example of FIG. 4, the designator 415 is coupled to the first and second magnitude
units 405, 407 and is configured to designate time frequency tiles based on the magnitudes of
the first and second frequency domain signals. Be done. However, it is understood that in many
embodiments the designation may alternatively or additionally be based on, for example, the first
and second microphone signals and / or the first and second frequency domain signals. I will.
[0128] Designator 415 is coupled to gain unit 409. The gain unit 409 is input with designation
of time frequency tile. That is, gain unit 409 receives information about which time frequency
tiles are designated as speech tiles and which time frequency tiles are designated as noise tiles.
[0129] The gain unit 409 is configured to calculate a time frequency tile gain in response to
specifying the time frequency tile of the first frequency domain signal as an utterance tile or a
03-05-2019
20
noise tile.
[0130] Thus, the gain calculation depends on the designation, and the resulting gain is different
for time frequency tiles designated as speech tiles than for time frequency tiles designated as
noise tiles. This difference or dependency may, for example, have two alternative algorithms or
functions for the gain unit 409 to calculate the gain value from the difference index, and based
on the designation between the two functions for the time frequency tile It may be implemented
by the gain unit 409 by being configured to select. Alternatively or additionally, gain unit 409
may use different parameter values for a single function, the parameter values depending on said
designation.
[0131] Gain unit 409 is configured to determine a lower gain value for the time frequency tile
gain when the corresponding time frequency tile is specified as a noise tile than when it is
specified as a speech tile. Thus, if all other parameters used to determine gain are invariant, gain
unit 409 calculates lower gain values for noise tiles than for speech tiles.
[0132] In the example of FIG. 4, the designation is segment / frame based. That is, the same
designation applies to all time frequency tiles of a time segment / frame. Thus, the gain for time
segments / frames estimated to contain sufficient speech is set higher (for all other parameters
equal) than for time segments estimated to not contain sufficient speech.
[0133] In many embodiments, the difference value for a given time frequency tile may depend on
whether the time frequency tile is designated as a noise tile or an utterance tile. Thus, in some
embodiments, the same function may be used to calculate the gain from the difference measure,
but the calculation of the difference measure itself may depend on the designation of the time
frequency tile.
[0134] In many embodiments, the difference measure may be determined as a function of an
absolute value time frequency tile value of each of the first and second frequency domain signals.
[0135] In fact, in many embodiments, the difference measure may be determined as the
difference between the first and second values. Here, the first value is generated as a function of
at least one time frequency tile value of the first frequency domain signal, and the second value
as a function of at least one time frequency tile value of the second frequency domain signal It is
03-05-2019
21
generated. However, the first value may not be dependent on the at least one time frequency tile
value of the second frequency domain signal, the second value being the at least one time
frequency of the first frequency domain signal It does not have to depend on tile values.
[0136] The first value for the first time frequency tile may in particular be generated as a
monotonically increasing function of the absolute value time frequency tile value of the first
frequency domain signal in the first time frequency tile. Similarly, the second value for the first
time frequency tile may in particular be generated as a monotonically increasing function of the
absolute value time frequency tile value of the second frequency domain signal in the second
time frequency tile.
[0137] At least one of the functions for calculating the first and second values may depend on
whether the time frequency tile is designated as a speech time frequency tile or a noise time
frequency tile. For example, the first value may be higher if the temporal frequency tile is an
utterance tile than if it is a noise tile. Alternatively or additionally, the second value may be lower
if the temporal frequency tile is an utterance tile than if it is a noise tile.
[0138] An example of a function for calculating the gain function may in particular be the
following function:
[0139] Where α is a factor less than 1 and C (tk, ω 1) is an estimated coherence term
representing the correlation between the amplitude of the first frequency domain signal and the
amplitude of the second frequency domain signal And the oversubtraction factor γ n is a design
parameter. For some applications, C (t k, ω l) can be approximated as one. The excess
subtraction factor γ n is typically in the range of 1 to 2.
[0140] Typically, the gain function is limited to positive values, and typically a minimum gain
value is set. Thus, the above function
[0141] It may be determined as
[0142] Thereby, the maximum attenuation of noise suppression can be set by θ, which must be
greater than or equal to zero. For example, if the minimum gain value is set to θ = 0.1, the
maximum attenuation is 20 dB. Since the unconstrained gain function will be lower (in practice
03-05-2019
22
between 30 and 40 dB), this results in background noise that sounds more natural. This is
particularly appreciated for communication applications.
[0143] In the present example, the gain is thus determined as a function of the numerator which
is the difference indicator. Furthermore, the difference indicator is determined as the difference
between the two terms (values). The first term / value is a function of the absolute value of the
time frequency tile value of the first frequency domain signal. The second term / value is a
function of the absolute value of the time frequency tile value of the second frequency domain
signal. Furthermore, the function for calculating the second value further depends on whether
the time frequency tile is designated as noise or speech time frequency tile (ie, the time
frequency tile is either noise or speech frame) Depends on the department).
[0144] In the present example, gain unit 409 is configured to determine a noise coherence
estimate C (tk 1, ω 1) indicative of a correlation between the amplitude of the second
microphone signal and the amplitude of the first microphone signal. Ru. The function for
determining the second value (or possibly the first value) depends in this case on the noise
coherence estimate. This allows a more appropriate determination of the appropriate gain value.
This is because the second value more accurately reflects the expected or estimated noise
component in the first frequency domain signal.
[0145] It will be appreciated that any suitable technique for determining the noise coherence
estimate C (t k, ω l) may be used. For example, in one calculation that may be performed, the
first and second frequency domain signals are compared, telling the speaker not to speak, and
the noise correlation estimates C (tk, ω l) for each time frequency tile are It may be determined
simply as an average of the ratio of time frequency tile values of the first frequency domain
signal and the second frequency domain signal.
[0146] In many embodiments, the dependence of gain on whether a time-frequency tile is
designated as a speech tile or a noise tile is not a constant value, but is itself dependent on one or
more parameters. . For example, the factor α may not be constant in some embodiments, but
rather may be a function of the characteristics (whether direct or derivative) of the received
signal.
[0147] In particular, the gain difference may depend on at least one of the signal level of the first
microphone signal; the signal level of the second microphone signal; and the signal to noise
estimate for the first microphone signal. These values may be average values over time frequency
03-05-2019
23
tiles, in particular over frequency values and segments. These may in particular be (relatively
long-term) indicators for the signal as a whole.
[0148] In some embodiments, the factor α may be given as α = f (−v <2> / 2σ <2>). Here, v is
the amplitude of the first microphone signal, and σ <2> is the energy / dispersion of the second
microphone signal. Thus, in this example, α depends on the signal to noise ratio for the first
microphone signal. This may provide improved perceived noise suppression. In particular, for
low signal to noise ratios, strong noise suppression is performed, thereby improving eg the
intelligibility of the resulting signal. However, for higher signal to noise ratios, the effect is
reduced, thereby reducing distortion.
[0149] Thus, the function f (−v <2> / 2σ <2>) can be determined and used to adapt the
calculation of the gain for the speech signal. The function depends on (−v <2> / 2σ <2>), which
corresponds to the SNR, ie the energy v <2> of the speech signal for the noise energy 2σ <2>.
[0150] Various functions and techniques for determining gain based on the difference between
the absolute values of the first and second microphone signals and the designation of the tile as
speech or noise may be used in various embodiments. Will be understood.
[0151] In fact, while the particular approach described above may provide particularly
advantageous performance in many embodiments, in other embodiments many other functions
and techniques may be used depending on the particular characteristics of the application. It is
also good.
[0152] The difference index may be calculated as: d (t k, ω l) = f 1 (| Z (t k, ω l) |) −f 2 (| X (t k,
ω l) |). Here, f 1 (x) and f 2 (x) can be chosen to be any monotonic function that meets the
individual preferences and requirements of the individual embodiments. Typically, the functions f
1 (x) and f 2 (x) are monotonically increasing functions.
[0153] Thus, the difference measure is a first monotonic function f 1 (x) of the absolute value
time frequency tile value of the first frequency domain signal and a second monotonic function
of the absolute value time frequency tile value of the second frequency domain signal The
difference between f 1 (x) is shown. In some embodiments, the first and second monotonic
functions may be the same function. However, in most embodiments, the two functions are
different.
03-05-2019
24
[0154] Furthermore, one or both of the functions f 1 (x) and f 2 (x) may depend on various other
parameters and indicators, such as the overall averaged power level of the microphone signal,
frequency, etc. .
[0155] In many embodiments, one or both of the functions f 1 (x) and f 2 (x) may depend on
signal values for other frequency tiles. For example, Z (tk, ωl), | Z (tk, ωl) |, f1 (| Z (tk, ωl) |), X
(tk, ωl), | X (tk, ωl) Or one or more averages of f 2 (| X (tk, ω 1) |) (ie, the average of the values
for the various indices of k and / or l). In many embodiments, averaging over neighborhoods
extending in both time and frequency dimensions may be performed. Although specific examples
based on the specific differential index equation given above will be described later, it will be
understood that the corresponding approach may be applied to other algorithms or functions
that determine the differential index.
[0156] An example of a possible function for determining the difference measure is, for example:
d (tk, ω 1) = | Z (tk, ω 1) | <α> -γ · | X (tk, ω 1) | <β Including>. Here, α and β are design
parameters, and typically α = β as follows.
[0157] Here, σ (ω 1) is a suitable weighting function used to provide the desired spectral
characteristics of noise suppression. (For example, this may increase noise suppression for higher
frequencies, eg, containing relatively large amounts of noise energy but likely containing
relatively small amounts of speech energy, but may include relatively large amounts of speech
energy It may be used to reduce noise suppression for mid frequencies that are likely to contain
relatively small amounts of noise energy as a feature. In particular, σ (ω 1) may be used to
provide the desired spectral characteristics of noise suppression while keeping the spectral
shaping of the speech at a low level.
[0158] It will be understood that these functions are merely exemplary, and that many other
equations and algorithms can be envisioned to calculate a distance indicator that indicates the
difference between the absolute values of the two microphone signals.
[0159] In the above equation, the factor γ represents the factor introduced to bias the difference
indicator towards negative values. While these examples introduce this bias as a simple scale
factor applied to the time-frequency tile of the second microphone signal, it will be appreciated
that many other approaches are possible.
03-05-2019
25
[0160] In fact, any suitable method of constructing the first and second functions f 1 (x) and f 2
(x) may be used to provide a bias towards negative values at least for the noise tile . The bias is,
in particular, a bias that produces the expected value of the difference indicator that is negative
when there is no speech, as in the previous examples. In fact, if both the first and second
microphone signals contain only random noise (e.g. the sample values may be distributed
symmetrically and randomly around the mean value), then the expected value of the difference
indicator is It is not zero but negative. In the previous example, this was achieved by the
oversubtraction factor γ, which in the absence of speech leads to negative results.
[0161] In order to compensate for differences in signal levels of the first and second microphone
signals when speech is present, the gain unit, as mentioned earlier, comprises the amplitude of
the second microphone signal and the noise component of the first microphone signal. A noise
coherence estimate may be determined that indicates a correlation between the amplitude of the
The noise coherence estimate may, for example, be generated as an estimate of the ratio between
the amplitudes of the first and second microphone signals. Noise coherence estimates may be
determined for individual frequency bands, in particular for each time frequency tile. Various
techniques for estimating the amplitude / magnitude relationship between two microphone
signals are known to those skilled in the art and will not be described in further detail. For
example, average amplitude estimates for different frequency bands may be determined during
periods of time where there is no speech (eg, by dedicated manual measurement or by automatic
detection of speech pauses).
[0162] In the present system, at least one of the first and second monotonous functions f 1 (x)
and f 2 (x) may compensate for the amplitude difference. In the previous example, the second
monotonic function compensated for the amplitude difference by scaling the absolute value of
the second microphone signal by the value C (t k, ω l). In other embodiments, the compensation
may alternatively or additionally be performed by a first monotonic function, for example by
scaling the absolute value of the first microphone signal by 1 / C (tk, ω 1) Good.
[0163] Furthermore, in most embodiments, if the first monotonic function and the second
monotonic function correspond to the estimated correlation between the amplitude relationship
between the first microphone signal and the second microphone signal and the time frequency It
is such that when the tile is designated as a noise tile, a negative expectation of the difference
measure is generated.
03-05-2019
26
[0164] In particular, the noise coherence estimate may be calculated such that the estimated or
expected absolute value difference (especially for a particular frequency band) between the first
microphone signal and the second microphone signal is It may be shown to correspond to the
ratio given by tk, ω l). In such a case, the first monotonic function and the second monotonic
function specify that the corresponding time-frequency tile value has an absolute value equal to
C (tk, ω l) (and that the time-frequency tile is a noise tile) And the generated difference indicator
is chosen to be negative.
[0165] For example, the noise coherence estimate
[0166] It may be determined as (In practice, the values may be generated, for example, by
averaging a suitable number of values in different time frames. ) In such cases, the first and
second monotonic functions f 1 (x) and f 2 (x) are
[0167] Then, it is selected to have an attribute that the difference index d (t k, ω l) has a negative
value. That is, the first and second monotonic functions f 1 (x) and f 2 (x) are for noise tiles
[0168] It is chosen to be
[0169] In the previous example, this is because the difference index d (tk, ω1) = | Z (tk, ω1) | -γ
n C (tk, ω1) | X (tk, ω1) | This was achieved by having an oversubtracting factor γ n with a
value.
[0170] In this example, f 1 (x) = x and f 2 (x) = γ n C (tk, ω l) x, but there are infinite other
monotonous functions, which may be used instead It will be understood. Furthermore, in this
example, the compensation for the noise level difference between the first and second
microphone signals and the bias towards the negative difference index value are compensated to
the second monotonic function f 2 (x) Is achieved by including However, it will be appreciated
that in other embodiments, this may alternatively or additionally be achieved by including a
compensation factor in the first monotonic function f 1 (x).
[0171] Furthermore, in the described approach, the gain depends on whether the time-frequency
tile is designated as an utterance or a noise tile. In many embodiments, this may be achieved by
depending on whether the difference measure is specified as a time frequency tile as a speech or
03-05-2019
27
noise tile.
[0172] Specifically, the gain unit specifies the expected value of the difference index when the
time-frequency tile absolute value actually corresponds to the noise / coherence estimate,
whether the time-frequency tile is designated as an utterance tile or as a noise tile It may be
configured to change at least one of the first monotonous function and the second monotonous
function so as to differ depending on whether it is different or not.
[0173] As an example, the expected value for the difference measure when the relative noise
level between the two microphone signals is as expected according to the noise coherence
estimate, the tile is designated as a noise tile If the tile is specified as a speech tile, it may be a
negative value.
[0174] In many embodiments, the expected value may be negative for both the speech and noise
tiles, but more negative for the noise tiles than for the speech tiles (ie, greater in magnitude /
absolute value) Is large).
[0175] In many embodiments, the first and second monotonic functions f 1 (x) and f 2 (x) may
also include bias values that are modified depending on whether the tile is a speech tile or a
noise tile Good. As a specific example, the previous example is | Z (tk, ω1) | -γ n C (tk, ω1) | X
(tk, ω1) | For noise frame | Z (tk, ω1) | −γ s · α · C (tk, ω 1) | X (tk, ω 1) | For the speech
frame, we used the difference index given by γ n> γ s.
[0176] Alternatively, in this example, the difference index is d (tk, ω1) = | Z (tk, ω1) | -γ (D (tk,
ω1)) · C (tk, ω1) | X (tk) , ω l) |. Here, D (t k, ω l) is a value indicating whether the tile is a noise
tile or an utterance tile.
[0177] For completeness, the requirement that the calculated difference measure have specific
attributes for specific values / attributes of the input signal value provides an objective reference
for the actual function used, which is Note that it does not depend on the signal value of or on
the actual signal being processed. In particular,
[0178] The requirement provides a limiting criterion for the function used.
03-05-2019
28
[0179] It will be appreciated that many different functions and techniques may be used to
determine the gain based on the difference measure. The gain is generally constrained to a nonnegative value to avoid phase inversion and associated degradation. In many embodiments, it
may be advantageous to constrain the gain not to fall below a certain minimum gain (thereby
ensuring that any particular frequency band / tile is not completely attenuated).
[0180] For example, in many embodiments, the gain is simply the smallest gain with gain (eg, G
(tk, ω 1) = MAX (φ · d (tk, ω 1), θ). May be determined by scaling the difference indicator while
ensuring that the gain is kept above (which may be zero) to ensure that it is not negative. Here,
φ is a preferred selected scale factor for a particular embodiment (eg determined by trial and
error) and θ is a non-negative value.
[0181] In many embodiments, the gain may be a function of other parameters. For example, in
many embodiments, the gain may be dependent on the attributes of at least one of the first and
second microphone signals. In particular, scale factors may be used to normalize the difference
measure. As a specific example, the gain is
[0182] It may be determined as That is, φ (t k, ω 1) = 1 / | Z (t k, ω 1) |. For example, d (tk, ω1)
= | Z (tk, ω1) | -γ (D (tk, ω1)) · C (tk, ω1) | X (tk, ω1) | d (tk, ω1) = | Z (tk, ω1) | -γ n C (tk,
ω1) | X (tk, ω1) | For noise frame d (tk, ω1) = | Z (tk, ω 1) | −γ s · α · C (tk, ω 1) | X (tk 1, ω
1) | corresponds to the above example by putting on the speech frame).
[0183] Thus, the gain calculation may include normalization.
[0184] In other embodiments, more complex functions may be used. For example, a non-linear
function may be used to determine the gain as a function of the difference indicator, for example
G (t k, ω l) = MAX (δ · log d (t k, ω l), θ). Here, δ may be a constant.
[0185] In general, the gain can be determined as the non-negative function of the difference
measure: G (tk, ωl) = f3 (d (tk, ωl)).
[0186] Typically, the gain can be determined as a monotonic function of the difference indicator,
in particular as a monotonically increasing function. Thus, typically, the difference measure
03-05-2019
29
indicates a greater difference between the first and second microphone signals, whereby the
time-frequency tile has a large amount of speech (which is mainly located near the speaker)
Higher gain results when reflecting the increased probability including (captured by the
microphone signal of
[0187] Similar to the algorithm or function for determining the difference measure, the function
for determining the gain may further depend on other parameters or characteristics. In fact, in
many embodiments, the gain function may depend on the characteristics of one or both of the
first and second microphone signals. For example, as noted above, the function may include
normalization based on the absolute value of the first microphone signal.
[0188] Another example of a possible function to calculate the gain from the difference measure
is
[0189] May be included. Here, σ (ω 1) is a suitable weighting function.
[0190] A rigorous approach to determining gain depending on time frequency tile values and
designation as speech or noise tiles is chosen to provide the desired operating characteristics and
performance for a particular embodiment and application. It will be understood that it may be
done.
[0191] Thus, the gain may be determined as G (t k, ω l) = f 4 (α (t k, ω l), d (t k, ω l)). Here, α
(tk, ω l) reflects whether the tile is designated as an utterance tile or a noise tile, and f 4 is a
time-frequency tile value of the first and second microphone signals. It may be any suitable
function or algorithm that includes components that reflect the difference between absolute
values.
[0192] Thus, the gain value for the time frequency tile depends on whether the tile is designated
as a speech time frequency tile or a noise time frequency tile. In fact, for a given time frequency
tile, gains are determined such that when the time frequency tile is specified as a noise tile, a
lower gain value is determined than when the time frequency tile is specified as an utterance tile.
[0193] The gain value may be determined by first determining the difference measure and then
03-05-2019
30
determining the gain value from the difference measure. The dependence on noise / speech
designation may be included in the determination of the difference measure, in the determination
of the gain from the difference measure, or in the determination of both the difference measure
and the gain.
[0194] Thus, in many embodiments, the difference measure may depend on whether the time
frequency tile is designated as a noise frequency tile or a speech frequency tile. For example, one
or both of the above functions f 1 (x) and f 2 (x) may depend on values indicating whether the
time frequency tile is designated as noise or speech. The dependency may be such that when the
time frequency tile is designated as an utterance tile (for the same microphone signal value), a
larger difference indicator is calculated than when it is designated as a noise tile.
[0195] For example, in the example given above for the calculation of the gain G (tk, ω l), the
numerator may be considered as a difference indicator, so the difference indicator may specify
whether the tile is designated as an utterance tile or noise It differs depending on whether it is
specified as a tile.
[0196] More generally, the difference index is d (tk, ω1) = f5 (α (tk, ω1), f1 (| Z (tk, ω1) |) -f2 (|
X (tk, ω1) May be indicated by)). Here, α (tk, ω l) depends on whether the tile is specified as an
utterance tile or a noise tile, and f 5 is a noise tile when α indicates that the tile is an utterance
tile Depends on α so that the difference index is larger than when.
[0197] Alternatively or additionally, the function for determining the gain value from the
difference measure may depend on the speech / noise designation. Specifically, the following
function may be used: G (tk, ωl) = f6 (d (tk, ωl), α (tk, ωl)) where α (tk, ω) l) depends on
whether the tile is specified as a speech tile or a noise tile, f 6 has more gain than when it is a
noise tile when α indicates that the tile is a speech tile Depends on α to be As mentioned
earlier, any suitable approach may be used to designate the time frequency tiles as speech or
noise tiles. However, in some embodiments, the designation is advantageously also based on the
value of the difference determined by calculating the difference measure under the assumption
that the time frequency tile is a noise tile Good. Thus, the differential index function for the noise
time frequency tile can be calculated. If this difference indicator is low enough, it indicates that
the time frequency tile value of the first frequency domain signal is predictable from the time
frequency tile value of the second frequency domain signal. This is typically the case if the first
frequency domain signal does not contain significant speech components. Thus, in some
embodiments, a tile is designated as a noise tile if the difference measure calculated using the
noise tile calculation is less than a threshold. Otherwise, the tile is designated as an utterance tile.
03-05-2019
31
[0198] An example of such an approach is shown in FIG. As shown, the designator 415 of FIG. 4
calculates the difference value for the time frequency tile by evaluating the distance indicator
assuming that the time frequency tile is actually a noise tile. You may have. The resulting
difference value is input to the tile designator 803. The tile designator 803 designates the tile as
being a noise tile if the value of the distance is less than a given threshold, otherwise proceeds to
designating as a speech tile.
[0199] This approach provides very efficient and accurate detection and designation of tiles as
speech or noise tiles. Furthermore, facilitated implementation and operation are achieved by
reusing the function for calculating gains as part of the designator. For example, for all timefrequency tiles designated as noise tiles, the calculated difference measure can be used directly
to determine the gain. Recalculation of the difference measure is only required by the gain unit
409 for time frequency tiles designated as speech tiles.
[0200] In some embodiments, low pass filtering / smoothing (/ average) may be included in the
specification based on the value of the difference. The filtering may in particular be across
different time frequency tiles in both the frequency domain and the time domain. Thus, filtering
may be performed over time frequency tiles in at least one of the time segments, as well as over
the value of the difference of time frequency tiles belonging to different (nearby) time segments /
frames. The inventors have realized that such filtering can provide substantial performance
improvement and substantially improved designation, and thus provide substantially improved
noise suppression.
[0201] In some embodiments, low pass filtering / smoothing (/ average) may be included in the
gain calculation. The filtering may in particular be across different time frequency tiles in both
the frequency domain and the time domain. Thus, filtering may be performed over time
frequency tile values belonging to different (nearby) time segments / frames and over a plurality
of time frequency tiles in at least one of said time segments. The inventors have come to realize
that such filtering can provide substantial performance improvement and substantially improved
perceived noise suppression.
[0202] Smoothing (ie low pass filtering) may in particular be applied to the calculated gain
values. Alternatively or additionally, filtering may be applied to the first and second frequency
domain signals prior to gain calculation. In some embodiments, filtering may be applied to the
03-05-2019
32
parameters of the gain calculation, eg, to the difference measure.
[0203] Specifically, in some embodiments, gain unit 409 may be configured to filter gain values
across multiple time frequency tiles. Here, the filtering includes time frequency tiles that differ in
both time and frequency.
[0204] Specifically, the output value may be calculated using an averaged / smoothed version of
the unclipped gain:
[0205] In some embodiments, after gain averaging, the gain lower limit may be determined. This
is, for example,
[0206] By calculating as. Here, G (t k, ω l) is calculated as a monotonic function of the difference
index, but is not restricted to non-negative values. In fact, unclipped gains may have negative
values for differential indicators that are negative.
[0207] In some embodiments, the gain unit calculates at least one of an absolute value time
frequency tile value of the first frequency domain signal and an absolute value time frequency
tile value of the second frequency domain signal, which are gain values. It may be configured to
filter before it is used. Thus, virtually, in this example, filtering is performed on the input to the
gain calculation, not on the output.
[0208] An example of this approach is shown in FIG. This example corresponds to the example of
FIG. 8, but a low pass filter 901 is added which performs low pass filtering of the absolute values
of the time frequency tile values of the first and second frequency domain signals. In this
example, the absolute time frequency tile value
[0209] Is the filtered and smoothed vector
[0210] (In the figure
[0211] Given as).
03-05-2019
33
[0212] In this example, the previously described functions for determining the gain values are
for noise and speech tiles respectively:
[0213] May be replaced by Here, ¯ means smoothing (average) over neighboring values in the (t,
ω) plane.
[0214] The filtering may in particular use a uniform window such as a rectangular window in
time and frequency or a window based on characteristics of human hearing. In the latter case,
the filtering may in particular be in accordance with so-called critical bands. The critical band
refers to the frequency bandwidth of the "hearing filter" created by the cochlea. For example, an
octave band or a bark critical band may be used.
[0215] The filtering may be frequency dependent. In particular, at low frequencies, the average
may be over only a few frequency bins. On the other hand, more frequency bins may be used at
higher frequencies.
[0216] Smoothing / filtering may be performed by averaging over neighboring values. たとえば
:
[0217] Here, for example, N = 1, and W (m, n) is a 3 × 3 matrix with 1/9 each weight. N can also
depend on the critical band, in which case it can depend on the frequency index l. For higher
frequencies, N will typically be larger than for lower frequencies.
[0218] In some embodiments, filtering is performed by filtering the difference indicator, for
example,
[0219] It may be by calculating as
[0220] As discussed below, filtering / smoothing may provide substantial performance
improvement.
03-05-2019
34
[0221] Specifically, when filtering in the (t k, ω l) plane, the variance of the noise components,
particularly at | Z (t k, ω l) | and | X (t k, ω l) |, is substantially reduced.
[0222] If there is no speech, ie, | Z (t k, ω l) | = | Z n (t k, ω l) | and assuming C (t k, ω l) = 1,
[0223] となる。 Here, | Z (tk, ωl) | and | X (tk, ωl) | are smoothed over L independent values.
[0224] Smoothing does not change the mean. よって
[0225] である。
[0226] The variance of the difference of two stochastic signals is equal to the sum of the
individual variances:
[0227] If we limit d 〔d with bar to 0, the distribution of d d is symmetrical around 0, so the
power of d d is half the value of the variance of d d:
[0228] Here, comparing the power of the residual signal with the power of the input signal (2σ
<2>), the following can be obtained for noise suppression caused by the post-processor: A = −10
log 10 ((4-π) / 4 L) = 6.68 + 10 log 10 L dB.
[0229] As an example, when averaged over nine independent values, an additional 9.5 dB of
suppression is obtained.
[0230] Excessive damping combined with smoothing further increases damping. variable
[0231] If we consider the smoothing as compared to the unsmoothed value,
[0232] Cause a decrease in the
03-05-2019
35
[0233] Distribution will be more concentrated around the expected value. Expected value is
negative,
[0234] Given by
[0235] Closed-form expressions for the sum (or difference) of independent Rayleigh random
variables can not be obtained for ≧ 3. However, simulation results for attenuation in dB for
various smoothing factors L and oversubtraction factors γ n are presented in the table below.
Here, the first row corresponds to no smoothing. In this table, the rows show different
oversubtraction factors (whose values are given in the first column) and the columns show
different average areas (the number of tiles to be averaged are presented in the first row) Show.
[0236] As can be seen, very high attenuation is achieved.
[0237] For speech, the filtering / smoothing effect is very different than for noise.
[0238] First, | X (t k, ω l) | has no speech information, and hence す る d does not include a
“negative” speech contribution. Furthermore, the utterance components in neighboring time
frequency tiles in the (t k, ω l) plane will not be independent. As a result, smoothing will not have
much effect on the speech energy in d. Thus, the overall effect of smoothing is an increase in
SNR, as filtering results in substantially reduced variance for noise but much less impact on
speech components. This may be used for gain value determination and / or designation of time
frequency tiles as described above.
[0239] As an example, in many embodiments, the difference indicator is
[0240] It may be determined as Here, f a and f b are monotonic functions, and K 1 through K 8
are integer values that define the average neighborhood for the time frequency tile. Typically, the
total number of time frequency tile values summed in the values K 1 to K 8 or at least each sum
may be identical. However, in instances where the number of values is different for two sums, the
corresponding functions f a (x) and f b (x) may include compensation for differences in the
number of values.
03-05-2019
36
[0241] The functions f a (x) and f b (x) may, in some embodiments, include weighting of values in
the sum. That is, it may depend on the index of the sum. Same thing,
[0242] Thus, in this example, time frequency tile values for both the first and second frequency
domain signals are averaged / filtered over the neighborhood of the current tile.
[0243] Examples of functions include the exemplary functions given above. In many
embodiments, f 1 (x) or f 2 (x) further depends on a noise coherence estimate indicating an
average difference between noise levels of the first and second microphone signals. May be One
or both of the functions f 1 (x) or f 2 (x) specifically include scaling by a scale factor that reflects
the estimated average noise level difference between the first and second microphone signals It
is also good. One or both of the functions f 1 (x) or f 2 (x) may in particular depend on the
coherence term C (t k, ω l) described above.
[0244] As mentioned earlier, the difference measure is a first value generated as a monotonic
function of the absolute value of the time frequency tile value for the first microphone signal and
the absolute value of the time frequency tile for the second microphone signal. Calculated as the
difference between the value and the monotonic function, that is, d (tk, ω1) = f1 (| Z (tk, ω1) |) f2 (| X (tk, ω1) |) Ru. Here, f 1 (x) and f 2 (x) are monotonous functions of x (typically
monotonically increasing functions). In many embodiments, the functions f 1 (x) and f 2 (x) may
simply be absolute value scaling.
[0245] A particular advantage of such an approach is that a difference measure based on
absolute value based subtraction can take both positive and negative values when only noise is
present. This is particularly suitable for averaging / smoothing / filtering. In that case, for
example, fluctuations around the zero mean tend to cancel each other. However, when speech is
present, this is mainly only in the first microphone signal, ie mainly in | Z (t k, ω l) |. Thus, for
example, smoothing or filtering over neighboring time frequency tiles tends to reduce the noise
contribution in the difference measure but not the speech component. Thus, a particularly
advantageous synergy can be achieved by combining the mean and the difference absolute value
based difference index.
[0246] The above description describes a scenario where only one of the microphones captures
speech while the other microphones capture only diffuse noise without speech components (e.g.,
03-05-2019
37
as illustrated in FIG. 5) The speaker is relatively close, and the reference microphone has focused
on (almost) no pickup).
[0247] Thus, in this example, it is assumed that the reference microphone signal x (n) has almost
no speech, and the noise components in z (n) and x (n) originate from the diffuse sound field. The
distance between the microphones is relatively large, and the coherence between the noise
components of the plurality of microphones is approximately zero.
[0248] However, in practice, the microphones are often placed in close proximity, as a result the
two effects can be more significant. That is, both microphones may begin to capture elements of
the desired speech, which means that the coherence between microphone signals at low
frequencies can not be ignored.
[0249] In some embodiments, the noise suppressor may further comprise an audio beamformer
configured to generate the first microphone signal and the second microphone signal from the
signals from the microphone array . This example is shown in FIG.
[0250] The microphone array may have only two microphones in some embodiments, but
typically has more. A beamformer depicted as a BMF unit may generate a plurality of different
beams directed in different directions, which may each generate one of the first and second
microphone signals .
[0251] The beamformer may in particular be an adaptive beamformer in which one beam can be
directed towards the speech source using a suitable adaptive algorithm. At the same time, other
beams can be adapted to generate a notch (or in particular a null) in the direction of the speech
source.
[0252] For example, while U.S. Pat. Nos. 5,985,859 and 6,086,095 disclose examples of adaptive
beamformers that focus on speech, they also provide (almost) speech-free reference signals. Such
an approach may be used to generate a first microphone signal as the beamformer's primary
output and a second microphone signal as the beamformer's secondary output.
[0253] This may address the problem of the presence of speech in more than one microphone of
the system. Noise components are obtained in both beamformer signals and are also Gaussian
03-05-2019
38
distributed with respect to diffuse noise. The coherence function between the noise components
in z (n) and x (n) also depends on sinc (kd) as described above. That is, at higher frequencies, the
coherence is nearly zero, and the noise suppressor of FIG. 4 can be used effectively.
[0254] Because of the smaller distance between the microphones, sinc (kd) will not be zero for
lower frequencies, and as a result the coherence between z (n) and x (n) will not be zero.
[0255] In some embodiments, the noise suppressor further comprises an adaptive canceller for
canceling from the first microphone signal a signal component of the first microphone signal that
is correlated with the second microphone signal. May be
[0256] An example of a noise suppressor with both the suppressor of FIG. 4 and the beamformer
and adaptive canceller of FIG. 10 is shown in FIG.
[0257] In this example, the adaptive canceller implements an additional adaptive noise
cancellation algorithm that removes noise that is correlated with noise in x (n) at z (n). For such
an approach, the coherence between x (n) and the residual signal r (n) (by definition) is zero.
[0258] It will be appreciated that the above description has described embodiments of the
invention with reference to various functional circuits, units and processors for clarity. However,
it will be apparent that any suitable distribution of functionality between different functional
circuits, units or processors may be used without detracting from the invention. For example,
functionality illustrated to be performed by separate processors or controllers may be performed
by the same processor or controller. Thus, reference to a specific functional unit or circuit should
only be viewed as referring to the preferred means for providing the described functionality,
rather than indicating the exact logical or physical structure or organization. .
[0259] The invention can be implemented in any suitable form including hardware, software,
firmware or any combination of these. The invention may optionally be implemented at least
partially as computer software running on one or more data processors and / or digital signal
processors. The elements and components of an embodiment of the present invention may be
implemented physically, functionally and logically in any suitable manner. In fact, functions may
be implemented in a single unit, in multiple units, or as part of other functional units. Thus, the
present invention may be implemented in a single unit, or physically and functionally distributed
between different units, circuits and processors.
03-05-2019
39
[0260] Although the present invention has been described in the context of some embodiments,
it is not intended to be limited to the specific form set forth herein. Rather, the scope of the
present invention is limited only by the appended claims. Further, while certain features may
appear to be described in the context of particular embodiments, one skilled in the art will
appreciate that various features of the described embodiments may be combined in accordance
with the present invention. You will recognize the good thing. In the claims, the term comprising
/ comprising does not exclude the presence of other elements or steps.
[0261] Furthermore, although individually listed, a plurality of means, elements, circuits or
method steps may be implemented by eg a single circuit, unit or processor. Furthermore, even if
individual features are included in different claims, they can possibly be advantageously
combined, and it is not possible that combinations of features are realized that are included in
different claims. It does not imply that it is not and / or not advantageous. Also, the inclusion of a
feature in one category of claims does not imply a limitation to this category, but rather that the
features are equally applicable to claims in other categories as appropriate. Indicates
Furthermore, the order of features in the claims does not imply any particular order in which
they must operate. In particular, the order of the individual steps in the method claims does not
imply that the steps have to be performed in that order. Rather, the steps may be performed in
any suitable order. In addition, singular references do not exclude a plurality. References to "a",
"first", "second" etc do not exclude a plurality. Reference signs in the claims are provided merely
as a clarifying example and shall not be construed as limiting the scope of the claims in any way.
03-05-2019
40
Документ
Категория
Без категории
Просмотров
0
Размер файла
67 Кб
Теги
jp2017516126
1/--страниц
Пожаловаться на содержимое документа