close

Вход

Забыли?

вход по аккаунту

?

JP2017530396

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2017530396
Abstract Recording is usually a mixture of signals from several sources. The direction of the
dominant sound source in the recording may be known or determined using a sound source
localization algorithm. Multiple beamformers may be used to separate or focus on the target
sound source. In one embodiment, each beamformer points in the direction of the dominant
source, and the output from the beamformer is processed to focus on the target source.
Depending on whether the beamformer pointing to the target sound source has a power greater
than that of the other beamformers, using the reference signal or the scaled output of the
beamformer pointing to the target sound source to the target sound source The corresponding
signal can be determined. The scaling factor may depend on the ratio of the beamformer output
pointing to the target source to the maximum of the other beamformer outputs.
Method and apparatus for emphasizing a sound source
[0001]
TECHNICAL FIELD The present invention relates to a method and apparatus for enhancing a
sound source, and more particularly to a method and apparatus for enhancing a sound source
from a noisy recording.
[0002]
(Background) During recording, there are usually a mixture of several sources that prevent the
listener from recognizing or concentrating on the source of interest (eg, target speech or music,
environmental noise and other speech) Interference from
03-05-2019
1
The ability to separate and focus sound sources of interest from noisy recordings is sought in
applications such as, but not limited to, audio / video conferencing, speech recognition, hearing
aids and audio zoom.
[0003]
SUMMARY In accordance with an embodiment of the present principles, and as described below,
a method for processing an audio signal, the audio signal comprising a first signal from at least a
first audio source and a second audio. Processing the audio signal to produce a first output using
a first beamformer pointing to a first direction, the method comprising: mixing a second signal
from the source; Processing the audio signal to produce a second output using a second
beamformer pointing in a second direction, the first direction corresponding to the first audio
source, A second direction corresponding to a second audio source, and processing the enhanced
first output and the second output to generate an enhanced first signal. Is presented. According
to another embodiment of the present principles, an apparatus for performing these steps is also
presented.
[0004]
In accordance with an embodiment of the present principles, and as described below, a method
for processing an audio signal, the audio signal comprising at least a first signal from a first audio
source and a second audio source. Processing the audio signal to produce a first output using a
first beamformer pointing to a first direction, the mixing of a second signal, the first direction
Processing the audio signal to produce a second output using a second beamformer pointing to a
second direction, corresponding to the first audio source, A direction of the second audio source
corresponding to a second audio source, determining that the first output is dominant between
the first output and the second output, an enhanced first output, and Processing the second
output to generate the emphasized first signal And processing to generate an emphasized first
signal if it is determined that the first output is dominant, wherein the first output is dominant
based on the reference signal. If not determined, the method of generating an emphasized first
signal is presented based on a first output weighted by a first factor. According to another
embodiment of the present principles, an apparatus for performing these steps is also presented.
[0005]
03-05-2019
2
According to an embodiment of the present principles, a computer readable storage medium
having stored therein instructions for processing an audio signal, the audio signal comprising a
first signal from at least a first audio source and a first signal according to the method described
above A computer readable storage medium is presented that is a mixture from a second signal
from two audio sources.
[0006]
1 illustrates an exemplary audio system that emphasizes a target sound source.
1 illustrates an exemplary audio enhancement system in accordance with an embodiment of the
present principles. Fig. 6 illustrates an exemplary method for performing audio enhancement in
accordance with an embodiment of the present principles. 1 illustrates an exemplary audio
enhancement system in accordance with an embodiment of the present principles. Fig. 1 shows
an exemplary audio zoom system with three beamformers according to an embodiment of the
present principles. Fig. 1 shows an exemplary audio zoom system with five beamformers
according to an embodiment of the present principles. FIG. 1 shows a block diagram of an
exemplary system that can use an audio processor in accordance with an embodiment of the
present principles.
[0007]
Detailed Description FIG. 1 shows an exemplary audio system that emphasizes a target sound
source. An audio capture device (105), for example a mobile phone, makes a noisy recording (eg
speech from a man in direction θ 1, a speaker playing music in direction θ 2, noise from
background, and music in direction θ k A mixture of playing instruments, where θ 1, θ 2, ... Or
θ k represents the spatial orientation of the sound source relative to the microphone array).
Based on the user's request, eg, from the user interface to focus on male speech, the audio
enhancement module 110 performs enhancement for the requested sound source and outputs an
enhanced signal. It should be noted that the audio enhancement module 110 may be located on a
separate device from the audio capture device 105 or may be incorporated as a module of the
audio capture device 105.
[0008]
03-05-2019
3
There are approaches that can be used to emphasize the target audio source from noisy
recordings. For example, audio source separation has been known as a powerful way of
separating multiple sound sources from their mixtures. The separation approach still needs
improvement, for example in challenging cases with high reverberation or an unknown number
of sources and more than the number of sensors. Also, isolation techniques are not currently
suitable for real-time applications that use limited processing power.
[0009]
Another approach, known as beamforming, uses spatial beams pointing at the target sound
source to enhance the target sound source. Beamforming is often used in conjunction with postfiltering techniques for further suppression of diffuse noise. One advantage of beamforming is
that the computational requirements are less expensive because of the small number of
microphones, and thus are suitable for real time applications. However, when the number of
microphones is small (e.g., two or three microphones for current mobile devices), the generated
beam patterns are not narrow, which makes it difficult to suppress background noise and
interference from unwanted sources. Several existing studies have also proposed combining
beamforming with spectral subtraction to satisfy recognition and speech enhancement in mobile
devices. In these studies, the target source direction is usually assumed to be known, and the null
beamforming taken into consideration may not be robust to reverberation effects. Furthermore,
the spectral subtraction step may also add artifacts to the output signal.
[0010]
The present principles relate to methods and systems for emphasizing sound sources from noisy
recordings. In accordance with the novel aspects of the present principles, our proposed method
provides several signal processing techniques, such as, but not limited to, source localization,
beamforming, and several beamforming pointing at different source directions in space. With
post-processing based on the output of the instrument, they can efficiently highlight any target
source. In general, emphasis will improve the quality of the signal from the target source. Our
proposed method can be used in real-time applications such as voice conferencing and audio
zoom, even in mobile devices with light computational load and without limitation, with limited
processing power . According to another novel aspect of the present principles, progressive audio
zoom (0% to 100%) may be performed based on an enhanced sound source.
03-05-2019
4
[0011]
FIG. 2 shows an exemplary audio enhancement system 200 according to an embodiment of the
present principles. System 200 receives an audio recording as an input and provides an
enhanced signal as an output. To perform audio enhancement, system 200 uses several signal
processing modules, including a sound source localization module 210 (optional), multiple
beamformers (220, 230, 240) and a post processor 250. In the following, we describe each
signal processing block in more detail.
[0012]
(Source Localization) Given an audio recording, if the direction of the dominant source is
unknown, then using source localization algorithms such as generalized cross correlation with
phase transformation (GCC-PHAT), those directions (arrival (Also known as Direction DoA) can be
estimated. As a result, different sources θ 1, θ 2,. . . , Θ k can be determined, where K is the
total number of dominant sources. If DoA is known in advance, for example when we point the
smartphone in one direction to capture video, we know that the source of interest is directly in
front of the microphone array (θ 1 = 90 degrees), we do not need to perform a sound source
localization function to detect DoA, or we only perform sound source localization to detect DoA
of the dominant interference source.
[0013]
(Beam Forming) Given the DoA of the dominant source, beam forming can be used as a powerful
technique to emphasize specific source directions in space while suppressing signals from other
directions. In one embodiment, we emphasize the corresponding sound source using several
beamformers that point to different directions of the enhanced dominant sound source. The short
time Fourier transform (STFT) coefficients (signal in the time-frequency domain) of the observed
time domain mixed signal x (t) are represented by x (n, f), where n is the time frame index Yes, f
is the frequency bin index. The output of the jth beamformer (emphasising the source in
direction θj) can be calculated as: where w j (n, f) is a steering vector pointing to the target
direction of beamformer j Where H denotes vector conjugate transpose. w j (n, f) may be used in
different ways for different types of beamformers, eg using minimum variance zero distortion
response (MVDR), robust MVDR, delay addition (DS) and generalized sidelobe canceller (GSC)
May be calculated.
03-05-2019
5
[0014]
Post-processing The beamformer output is usually not good enough to isolate the interference,
and applying post-processing directly to this output can lead to strong signal distortion. One
reason is that the enhanced source usually includes (1) non-linear signal processing in
beamforming, and (2) a large amount of musical noise (artifacts) due to errors in estimating the
dominant source direction It is. The above reasons can lead to more signal distortion at high
frequencies, as DoA errors can cause large phase differences. Therefore, we propose to apply
post-processing to the outputs of some beamformers. In one embodiment, post-processing can be
based on the reference signal x I and the output of the beamformer, where the reference signal is
an input microphone, eg a microphone facing a target sound source in a smartphone, a camera in
a smartphone It can be a microphone, or one of the microphones close to the mouth in
Bluetooth® headphones. The reference signal may also be a linear combination of more complex
signals generated from multiple microphone signals, such as multiple microphone signals. In
addition, temporal frequency masking (and optional spectral subtraction) can be used to generate
the enhanced signal.
[0015]
In one embodiment, the enhanced signal is generated, for example, for the source j, where x I (n,
f) is the STFT coefficient of the reference signal, and α and β are tuning constants. , In one
example α = 1, 1.2 or 1.5, β = 0.05-0.3. The characteristic values of α and β may be adapted
based on the application. One fundamental assumption in equation (2) is that the sources are
hardly duplicated in the time frequency domain, so source j is dominant at time frequency point
(n, f) (ie, If the output of beamformer j is greater than the outputs of all other beamformers, the
reference signal can be considered as a good approximation of the target source. Thus, we set the
enhanced signal as the reference signal x I (n, f) to reduce distortion (artifacts) caused by
beamforming, as included in s j (n, f) be able to. Otherwise, we assume that the signal is noise or
a mixture of noise and target sound source and we set noise to a small value β * s j (n, f) to
reduce noise or noise And to suppress the mixing of the target sound source.
[0016]
In another embodiment, post-processing can also use a noise suppression method of spectral
subtraction. Mathematically, it can be shown as follows. In this equation, the phase (x I (n, f))
03-05-2019
6
indicates the phase information of the signal x I (n, f), which is the frequency-dependent spectral
power of noise affecting the sound source j that can be updated continuously is there. In one
embodiment, if the frame is detected as a noise frame, the noise level can be set to the signal
level of that frame, or it updates smoothly with a forgetting factor that takes into account
previous noise values be able to.
[0017]
In another embodiment, post-processing performs "cleaning" on the beamformer output to obtain
a more robust beamformer. This can be done adaptively with filters as follows. In this equation,
the β j factor depends on the amount that can be regarded as the time frequency signal to
interference ratio. For example, we can set β as follows to perform “soft” post-processing
“cleaning”. In this equation, ε is a small constant, for example ε = 1. Thus, if | s j (n, f) | is
much larger than all the other | s i (n, f) |, then the cleaned output is and s j (n, f) If much smaller
than the other s i (n, f), the cleaned output is
[0018]
We can also set β as follows to do a “hard” (binary) cleaning.
[0019]
β j is also intermediate (ie “soft”) by adjusting its value according to the level difference
between | s j (n, f) | and | s i (n, f) |, iεj It can be set by the method between cleaning and "hard"
cleaning.
[0020]
These approaches described above ("soft" / "hard" / intermediate cleaning) can also be extended
to filtering of x I (n, f) instead of s j (n, f).
Note that in this case the β factor is again calculated using the beamformer output s j (n, f)
(instead of the original microphone signal) to utilize beamforming.
[0021]
03-05-2019
7
For the above approach, we can also add memory effects to avoid regular false positives or
glitches in the highlighted signal.
For example, we may average the amounts indicated in the post-processing decision, eg replace
the following sum: In this equation, M is the number of frames considered for decision.
[0022]
In addition, after signal enhancement as described above, other post-filtering techniques can be
used to further suppress diffuse background noise.
[0023]
In the following, in order to simplify the notation, we refer to the method as shown in equations
(2), (4) and (7) as bin separation and the method as in equation (3) Called spectral subtraction.
[0024]
FIG. 3 shows an exemplary method 300 for performing audio enhancement in accordance with
an embodiment of the present principles.
Method 300 begins at step 305.
In step 310, the method performs initialization and determines, for example, whether it is
necessary to determine the direction of the dominant sound source using a sound source
localization algorithm. If necessary, the method selects an algorithm for sound source localization
and sets its parameters. The method may also determine which beamforming algorithm to use or
the number of beamformers, eg, based on user configuration.
[0025]
At step 320, sound source localization is used to determine the dominant sound source direction.
03-05-2019
8
It should be noted that step 320 can be omitted if the dominant sound source direction is known.
At step 330, it uses multiple beamformers. Each beamformer points in different directions to
emphasize and emphasizes the corresponding sound source. The direction for each beamformer
may be determined from source localization. If the direction of the target sound source is known,
we may also sample the direction in the 360 ° field of view. For example, if the direction of the
target sound source is known to be 90 °, we can sample a 360 ° field of view using 90 °, 0 °
and 180 °. For example, different methods can be used for beamforming, such as, but not
limited to, minimum variance zero distortion response (MVDR), robust MVDR, delayed add (DS)
and generalized side lobe canceller (GSC). At step 340, it performs post-processing on the
beamformer output. Post-processing may be based on an algorithm as shown in equations (2)-(7)
and may also be performed with spectral subtraction and / or other post-filtering techniques.
[0026]
FIG. 4 shows a block diagram of an exemplary system 400 that can utilize audio enhancement in
accordance with an embodiment of the present principles. The microphone array 410 records
noisy recordings that need to be processed. The microphone may record audio from one or more
speakers or devices. The noisy recordings may also be pre-recorded and stored on a storage
medium. Source localization module 420 is optional. When the sound source localization module
420 is used, the sound source localization module 420 can be used to determine the direction of
the dominant sound source. Beam forming module 430 applies multiple beam forming pointing
in different directions. Based on the beamformer output, post processor 440 performs postprocessing, eg, using one of the methods shown in equations (2)-(7). After post-processing, the
emphasized sound source can be reproduced by the speaker 450. The output sound may also be
stored on a storage medium or transmitted to the receiver through a communication channel.
[0027]
The various modules shown in FIG. 4 may be implemented in one device or distributed across
several devices. For example, all modules may be included in, but not limited to, tablets or cell
phones. In another example, the sound source localization module 420, the beamforming module
430 and the post processor 440 may be located on a computer or a cloud separately from other
modules. In yet another embodiment, the microphone array 410 or the speaker 450 can be a
stand-alone module.
[0028]
03-05-2019
9
FIG. 5 shows an exemplary audio zoom system 500 that can use the present principles. In audio
zoom applications, the user may be interested in only one sound source direction in space. For
example, if the user directs the mobile device in a particular direction, it can be assumed that the
particular direction that the mobile device is pointing to is the target sound source DoA. In the
audio video capture example, the DoA direction can be assumed to be the direction that the
camera faces. Next, the interferers are out-of-range sources (at the side and behind the audio
capture device). Thus, in audio zoom applications, sound source localization can be optional, as
the DoA direction can usually be inferred from the audio capture device.
[0029]
In one embodiment, the main beamformer is set to point to the target direction θ, while some
other beamformers (by way of example) have more noise and more noise for the user during
post-processing. Other non-target directions (e.g., [theta] -90 [deg.], [Theta] -45 [deg.], [Theta]
+45 [deg.], [Theta] +90 [deg.]) Are indicated to capture interference.
[0030]
The audio system 500 uses four microphones m 1 to m 4 (510, 512, 514, 516).
The signals from each microphone are transformed from the time domain to the time frequency
domain using, for example, FFT modules (520, 522, 524, 526). Beamformers 530, 532 and 534
perform beamforming based on time frequency signals. In one example, beamformers 530, 532
and 534 may point to directions 0 °, 90 °, 180 °, respectively, and sample the sound field
(360 °). Post processor 540 performs post processing based on the outputs of beamformers
530, 532, and 534 using, for example, one of the methods shown in equations (2)-(7). If a
reference signal is used for the post processor, post processor 540 may use the signal from the
microphone (eg, m 4) as the reference signal.
[0031]
The output of post processor 540 is inversely transformed from the time frequency domain to
the time domain using, for example, IFFT module 550. For example, based on the audio zoom
factor α (with a value of 0 to 1) provided by the user request through the user interface, mixers
03-05-2019
10
560 and 570 generate the right output and the left output, respectively.
[0032]
The output of the audio zoom is a linear mixture of the enhanced output from the IFFT module
550 and the left and right microphone signals (m 1 and m 4) according to the zoom factor α.
The output is stereo with output left and output right. In order to maintain the stereo effect, the
α maximum should be less than 1 (eg 0.9).
[0033]
Frequency and spectral subtraction can be used in the post processor in addition to the methods
shown in equations (2)-(7). A psychoacoustic frequency mask can be calculated from the binned
output. The principle is that frequency bins with levels outside of the psychoacoustic mask are
not used to generate the output of spectral subtraction.
[0034]
FIG. 6 shows another exemplary audio zoom system 600 that can use the present principles. In
system 600, five beamformers are used instead of three. In particular, the beamformer points in
the directions 0 °, 45 °, 90 °, 135 ° and 180 ° respectively.
[0035]
Audio system 600 also uses four microphones m 1 -m 4 (610, 612, 614, 616). The signals from
each microphone are converted from time domain to time frequency domain using, for example,
FFT modules (620, 622, 624, 626). Beamformers 630, 632, 634, 636 and 638 perform
beamforming based on the time frequency signals, which indicate directions 0 °, 45 °, 90 °,
135 ° and 180 °, respectively. Post processor 640 performs post processing based on the
outputs of beamformers 630, 632, 634, 636 and 638, for example, using one of the methods
shown in equations (2)-(7). If a reference signal is used for the post processor, post processor
540 may use the signal from the microphone (e.g., m3) as the reference signal. The output of
post processor 640 is transformed from the time frequency domain back to the time domain
03-05-2019
11
using, for example, IFFT module 660. Based on the audio zoom factor, the mixer 670 generates
an output.
[0036]
The subjective quality of either post-processing approach varies with the number of
microphones. In one embodiment, only bin separation is preferred if only two microphones are
used, while bin separation and spectral subtraction is preferred if four microphones are used.
[0037]
The present principles can be applied when there are multiple microphones. In systems 500 and
600, we assume that the signal is from four microphones. If only two microphones are present,
the mean value (m 1 + m 2) / 2 can be used as m 3 in post processing, using spectral subtraction
if necessary. It should be noted here that the reference signal may be from one microphone close
to the target sound source or an average value of the microphone signal. For example, if there
are three microphones, the reference signal for spectral subtraction should be (m 1 + m 2 + m 3)
/ 3, or m 3 directly if m 3 faces the source of interest Can.
[0038]
In general, the present embodiment emphasizes beamforming in the target direction using the
beamforming output in several directions. By performing beamforming in several directions we
sample the sound field (360 °) in multiple directions and then post-process the beamformer
output to get the signal from the target direction It can be "cleaned".
[0039]
An audio zoom system, such as system 500 or 600, can also be used for audio conferencing, can
enhance speaker speech from different locations, and use multiple beamformers to point multiple
directions. It is fully applicable. In audio conferencing, the position of the recording device is
often fixed (e.g. placed on a table at a fixed position), while different speakers are located at
arbitrary places. Source localization and tracking (e.g., for tracking moving speakers) can be used
03-05-2019
12
to learn the location of the sources before directing the beamformer to these sources. In order to
improve the accuracy of source localization and beamforming, dereverberation techniques can be
used to preprocess the input mixed signal so as to reduce the reverberation effect.
[0040]
FIG. 7 shows an audio system 700 that can use the present principles. The input to system 700
can be an audio stream (eg, mp3 file), an audiovisual stream (eg, mp4 file), or a signal from a
different input. The input may also be from storage or may be received from a communication
channel. If the audio signal is compressed, it is decoded before being emphasized. Audio
processor 720 performs audio enhancement using, for example, method 300 or system 500 or
600. The requirements for audio zoom may be separate from or included in the requirements for
video zoom.
[0041]
Based on a user request from user interface 740, system 700 may receive an audio zoom factor,
which may control the mixing ratio of the microphone signal and the enhanced signal. In one
embodiment, the audio zoom factor can also be used to adjust the weighting value of β j to
control the amount of noise remaining after post-processing. Subsequently, audio processor 720
may mix the enhanced audio signal and the microphone signal to produce an output. The output
module 730 may play, store or transmit audio to the receiver.
[0042]
The implementations described herein may be implemented, for example, in a method or process,
an apparatus, a software program, a data stream, or a signal. Even though described only in the
context of a single form of implementation (e.g., described as a method only), implementations of
the described features may also be implemented in other forms (e.g., an apparatus or program).
The apparatus may be realized, for example, by appropriate hardware, software and firmware.
The method refers to a general processing unit including, for example, a computer, a
microprocessor, an integrated circuit or a programmable logic device, and may be implemented
in an apparatus such as a processor. The processor also includes communication devices, such as,
for example, computers, cell phones, portable / personal digital assistants ("PDAs"), and other
devices that facilitate communication between end users.
03-05-2019
13
[0043]
As with "one embodiment", "embodiment", "one implementation" or "implementation" of the
present principles, references to those other variations may be made to the specific features,
structures, etc. described in connection with the embodiments. Features are meant to be included
in at least one embodiment of the present principles. Thus, as with the phrases "in one
embodiment", "in embodiments", "in one implementation" or "in implementation" appearing in
various places throughout the specification, any other variations are not necessarily all the same
implementation It does not mean the form.
[0044]
In addition, the present application or its claims may refer to "determining" various information.
Determining the information may include, for example, one or more of estimating information,
calculating information, predicting information or retrieving information from memory.
[0045]
Furthermore, the present application or its claims may refer to "accessing" various information.
Information access may include, for example, receiving information, retrieving information (eg
from memory), storing information, processing information, transmitting information, moving
information, copying information, deleting information, calculating information, information It
may include one or more of a determination, a prediction of information or an estimation of
information.
[0046]
In addition, the present application or its claims may refer to "receiving" various information.
Reception, as well as access, is intended to be a broad term. Receiving information may include,
for example, one or more of accessing information or retrieving information (eg, from memory).
Furthermore, reception typically involves storing information, processing information,
transmitting information, moving information, copying information, deleting information,
03-05-2019
14
calculating information, determining information, predicting information or estimating
information, etc. , Included in any way during operation.
[0047]
As would be apparent to one skilled in the art, implementations may generate various signals
formatted to convey information that may be stored or transmitted, for example. The information
may include, for example, instructions for performing the method, or data generated by one of
the described implementations. For example, the signal may be formatted to convey the bit string
of the described embodiment. Such signals may be formatted, for example, as electromagnetic
waves (eg, using the radio frequency portion of the spectrum) or baseband signals. The format
may include, for example, encoding the data stream and modulating the carrier with the encoded
data stream. The information conveyed by the signal may be, for example, analog or digital
information. The signals may be transmitted over a variety of different wired or wireless links, as
is known. The signal may be stored on a processor readable medium.
03-05-2019
15
Документ
Категория
Без категории
Просмотров
0
Размер файла
28 Кб
Теги
jp2017530396
1/--страниц
Пожаловаться на содержимое документа