close

Вход

Забыли?

вход по аккаунту

?

JP2008141718

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2008141718
PROBLEM TO BE SOLVED: A conventional adaptive method of acoustic echo canceller has a low
adaptive accuracy and can not generate a filter that sufficiently suppresses acoustic echo.
SOLUTION: In the present invention, using a microphone array having a plurality of microphone
elements, a band in which only a speaker sound is present is determined from a phase difference
between microphones, and acoustic echo suppression performance is achieved by adapting only
for that band. Means for generating a high filter. [Selected figure] Figure 2
Acoustic echo canceller system
[0001]
Acoustic echo cancellation technology for teleconferencing systems or teleconferencing systems
with speakers and microphones.
[0002]
There is a teleconferencing system or a teleconferencing system which has a speaker and a
microphone on both sides, and can be connected by a network and can talk in voice with a
person at a remote place.
In this system, there is a problem that the sound outputted from the speaker is mixed in the
microphone. Therefore, it has been practiced to remove the speaker output sound (acoustic echo)
mixed in the microphone by using the acoustic echo canceller technology. If the acoustic
04-05-2019
1
environment of the conference room is invariant, it is possible to first learn how to transmit the
sound in space (impulse response) only once and use the impulse response to completely remove
the acoustic echo. However, when the conference participant moves the seat, the acoustic path of
the acoustic echo fluctuates, so the learned impulse response and the actual impulse response
become unmatched, and the acoustic echo can not be completely removed. In the worst case, the
residual echo turns around and the volume gradually increases, causing a phenomenon called
howling, which makes it quite difficult to talk.
[0003]
Therefore, a method has been proposed which aims to always completely eliminate the acoustic
echo by learning the impulse response sequentially and following the variation of the acoustic
path (for example, Non-Patent Document 1).
[0004]
In addition, a method of canceling acoustic echo using a microphone array has been proposed
(for example, Patent Document 1).
In the prior art, because the performance of the echo canceller is not sufficient, when the near
end talker and the far end talk at the same time, the low volume talker's voice is completely shut
out, and the one-sided call state is established. It is done to prevent howling. However, there is a
problem that it is difficult to talk in one-way calling.
[0005]
Patent Document 1: JP-A-2005-136701 Peter Heitkamper, ?An Adaptation Control for Acoustic
Echo Cancellers,? IEEE Signal Processing Letters, Vol. 4, No. 6, 1997/6. R. O. Schmidt, "Multiple
Emitter Location and Signal Parameter Estimation," IEEE Trans. Antennas and Propagation, vol.
34, no. 3, pp. 276-280, 1986. Togami Mahito, Amano Akio, Shinjohiro, Kushida Yuta, Tamamoto
Shinichi, Karakawa Nori, "Hearing Function of Human Symbiosis Robot" EMIEW "," 22nd AI
Challenge Study Group, pp. 59-64, 2005
[0006]
04-05-2019
2
In the conventional method of sequentially learning impulse responses and following changes in
the acoustic path, although it is possible to learn sequentially when sound is emitted only from
the speaker, sound is emitted from the speaker and the speaker in the meeting room utters In
this case, learning becomes impossible, and in the worst case, learning of the impulse response
fails, and acoustic echo can not be removed at all. Therefore, it is necessary to determine whether
a sound is produced only from the speaker or a speaker in the conference room is speaking
(double talk detection).
[0007]
In the present invention, a situation where the sound from the speaker is dominant is detected to
perform adaptive control of the echo canceller. As a configuration example for that, the arrival
direction of the sound source can be estimated by using a microphone array having a plurality of
microphone elements. In a more preferable aspect, it is possible to detect a phase difference of
sounds input to a plurality of microphone elements, and to determine a situation in which the
sound from the speaker is dominant. The determination can be performed by comparison with a
threshold stored in advance. In a preferred embodiment, the acoustic echo canceller adaptation
unit extracts only the band division signal whose direction of arrival of the sound source is in the
direction of the speaker and performs adaptation of the acoustic echo canceller with the band
division signal.
[0008]
The acoustic echo canceller can cancel the echo by artificially creating the sound from the
speaker and subtracting it from the input speech.
[0009]
As a typical example of the configuration of the present invention, a microphone for inputting
voice, an AD converter for converting a signal from the microphone into a digital signal, and an
information processing for processing a digital signal from the AD converter to suppress an
acoustic echo component Device, an output interface for sending a signal from the information
processing device to the network, an input interface for receiving the signal from the network, a
DA converter for converting the signal from the input interface to analog, and a signal from the
DA converter A conference system having a speaker for outputting as voice, wherein the
information processing device controls optimization timing of the information processing device
04-05-2019
3
based on a state of voice input to the microphone.
The AD converter and the DA converter may be integrated.
[0010]
Preferably, the information processing apparatus is optimized at a timing when the sound input
to the microphone is mainly from the direction of the speaker. The determination can be made,
for example, by setting an appropriate threshold.
[0011]
More preferably, the information processing apparatus suppresses an acoustic echo component,
which is a mixture of voice of a speaker from a digital signal, using an adaptive filter, an acoustic
echo canceller adaptation unit that optimizes the adaptive filter, and an adaptive filter. And an
acoustic echo cancellation unit.
[0012]
More preferably, the microphone is a microphone array having a plurality of microphone
elements, the AD converter is a plurality of AD converters which digitally convert a signal for
each microphone element, and the information processing apparatus is a plurality of AD
converters It has a phase difference calculation unit that calculates the phase difference between
voices input to a plurality of microphone elements based on signals, and from the phase
difference output by the phase difference calculation unit, the voice input to the microphone
array is output from the speaker It has a frequency distribution unit that determines whether or
not it is voice.
[0013]
More preferably, the information processing apparatus includes a band division unit that divides
a digital signal into bands, and the band division unit divides the digital signal digitally converted
for each microphone element into bands, and the phase difference calculation unit divides the
digital signal. The phase difference between the voices input to the plurality of microphone
elements in each band is calculated, and the frequency division unit determines that the band
division signal is the speaker output signal from the phase difference in each band output by the
04-05-2019
4
phase difference calculation unit. It is determined whether the signal is a speaker signal or the
speaker signal, and the acoustic echo canceller adaptation unit suppresses mixing of the voice of
the speaker from the signal of the microphone element only for the band determined to be the
speaker output signal by the frequency distribution unit. The acoustic echo canceller unit uses an
adaptive filter to adapt the adaptive filter used for the To remove the code component.
In the band division unit, for example, the frequency from 20 Hz to 16 kHz is divided every 20
Hz.
As described above, by performing control for each frequency band, highly accurate echo
cancellation can be performed.
[0014]
More preferably, in order to determine whether the signal is a speaker output signal, the
frequency distribution unit measures in advance the transfer function of the sound transmitted
from the speaker to the microphone array, and the measured transfer function outputs the sound
from the speaker Calculates the phase difference for each band of the microphone array, stores
the phase difference for each band in the external storage device, and stores the stored phase
difference and the phase difference between the microphone elements for each band of the band
division signal Is equal to or less than a predetermined threshold value, it is determined that the
band division signal is a speaker output signal.
[0015]
More preferably, the user has a user interface characterized in that the user specifies in advance
the number of speakers and the physical position relative to the microphone array, and the
sound from the speakers is determined from the number of speakers and the physical position
specified in the user interface. The echo phase difference calculation processing unit that
calculates the phase difference for each band of the microphone array when the microphone
output is performed, stores the phase difference for each band in the external storage device, and
stores the stored phase difference and band If the phase difference between the microphone
elements for each band of the divided signal is less than or equal to a predetermined threshold
value, it is determined that the band divided signal is a speaker output signal.
[0016]
More preferably, a histogram of the sound source direction across the band is calculated using
the phase difference of each microphone array, and a sound source localization unit for
04-05-2019
5
estimating the sound source direction from the histogram is provided. The magnitude of the
signal estimated to have been calculated, and the magnitude of the band division signal
determined that the magnitude of the calculated signal is the speaker output signal, or the
speaker output signal in the band division signal after the acoustic echo canceler The size of the
band division signal is reduced or all 0 if the size of the band division signal is determined to be
smaller than or equal to the predetermined size.
[0017]
The echo canceller can be controlled dynamically according to the conditions of the conference
room.
[0018]
Hereinafter, specific embodiments of the present invention will be described with reference to
the drawings.
The present invention is, for example, a teleconferencing system using an IP network circuit, in
which two (or more) sites connected by a network communicate using a teleconferencing facility
consisting of a microphone array, a speaker, etc. Achieve conversations between speakers at both
sites.
Hereinafter both sides of this site will be referred to as near end and far end.
[0019]
FIG. 1 is a diagram showing the hardware configuration of the present invention disposed at the
near end and the far end of a site, respectively.
Microphone array 1 comprising at least two or more microphone elements, A / D converter for
converting analog sound pressure values input from the microphone array into digital data, and
D / A converter 2 for converting digital data into analog data 2 A central processing unit 3 for
processing the output of the conversion unit 2, eg a volatile memory 4, a hub 5 connected to the
network, for transmitting and receiving data to and from the far end, D / A converted analog data
04-05-2019
6
The speaker 6 is composed of a speaker 6 that converts sound pressure to sound pressure, for
example, a non-volatile external storage medium 7.
[0020]
The sound pressure values of multiple channels recorded by the microphone array 1 are sent to
the AD / DA converter 2 and converted into digital data of multiple channels.
The converted digital data is stored in the memory 4 via the CPU 3.
[0021]
The voice on the far end side sent through the hub 5 is sent to the AD / DA converter 2 via the
CPU 3 and output from the speaker 6. The far-end voice output from the speaker 6 is mixed with
the voice recorded by the microphone array 1 together with the near-end speaker voice.
Therefore, the sound on the far end side output from the speaker is mixed in the digital sound
pressure data stored in the memory 4 as well. The CPU 3 performs echo cancellation processing
to suppress the mixed far-end voice from the digital sound pressure data stored in the memory 4
and transmits only the near-end speaker voice to the far end via the hub 5 Do. The echo
cancellation process suppresses voice at the far end by using data on how to transmit sound
from the speaker to the microphone array stored in advance in the external storage medium 7,
information such as the number of speakers, the position of the speakers, etc. Do.
[0022]
FIG. 2 is a diagram showing the software configuration of the present invention, and the CPU 3
digitally executes the components other than the A / D conversion unit 8 which converts analog
data into digital data. It is the most convenient method if the CPU processing capacity is
sufficient. Alternatively, a hardware configuration equivalent to this may be used, and digital or
analog processing may be performed.
[0023]
04-05-2019
7
The main functional blocks are: band division unit 9 for converting digital data into band divided
data; phase difference calculation unit 10 for calculating phase differences between respective
microphone channels of band division signals; An acoustic echo canceller adaptation unit 12
adapted to determine whether the acoustic echo is dominant or the speaker voice is dominant for
each band of the band division signal from the phase difference, and an adaptive filter for
acoustic echo cancellation; It comprises an acoustic echo cancellation unit 14 that suppresses
acoustic echo from an input signal using a pseudo echo generation unit 13 that artificially
generates an acoustic echo that is transmitted to a microphone array from a speaker reference
signal, and an adaptive filter. The analog sound pressure data of multiple channels recorded by
the microphone array 1 is converted by the A / D conversion unit 8 into digital data x (t) of
multiple channels. The converted multi-channel digital data is sent to the band dividing unit 9
and converted into multi-channel band divided data x (f: ?). Here, since the microphone array
includes a plurality of microphone elements, the A / D conversion unit 8 and the band division
unit 9 may be arranged in parallel as many as the number of microphone elements.
[0024]
For band division, short time Fourier transform, wavelet transform, band pass filter, etc. are used.
In the band division unit, for example, the frequency from 20 Hz to 20 kHz is divided every 20
Hz. ? is a frame index at the time of short-term frequency analysis. The band-divided data is sent
to the phase difference calculation unit 10. The phase difference calculation unit 10 calculates
the phase difference for each microphone channel by [Equation 1].
[0025]
xi (f, ?) is the f-th band division data of the i channel. Similarly, x j (f, ?) is the f-th band divided
data of the j channel. ? i, j (f, ?) is the phase difference with respect to the f-th band of the i
channel and the j channel. The calculated phase difference for each microphone channel is sent
to the frequency distribution unit 11. In the frequency distribution unit 11, ei, j (f, f) defined by
[Equation 2] from the phase difference Spi, j (f) of echo components from the speaker to the
microphone array set in advance and the phase difference for each microphone channel. If ?) is
calculated and the sum of ei, j (f, ?) relating to index i, j is equal to or less than a preset
threshold, it is determined that the f-th band is a band in which echo is dominant, index i If the
sum of ei, j (f, ?) relating to, j is equal to or greater than a preset threshold value, it is
determined that the voice is near-end.
04-05-2019
8
[0026]
The frequency components determined to be echoes are sent to the acoustic echo canceller
adaptation unit 12. The acoustic echo canceller adaptation unit 12 stores the setting conditions
of the adaptive filter for each of the divided frequency bands. The acoustic echo canceller
adaptation unit 12 uses the pseudo echo component Echo i (f, ?) output from the pseudo echo
generation unit 13 for the frequency band determined by the frequency distribution unit 11 to
be an echo, by using Eq. Adaptive filters hi and ? (f, T) are adapted.
[0027]
Echo i (f, ?) is a pseudo echo component of the microphone of the i-th channel. hi, ? (f, T) are
the f th -th tap T-th filter of the adaptive filter of the i-th channel microphone and a filter adapted
to signals up to ?-1 frame. L is the tap length of the adaptive filter. The adaptation may be
performed for each frequency in this manner, or when the number of bands determined to be in
the speaker direction at time ? is equal to or greater than a predetermined threshold value, all
the frequency components for that time ? Adaptation may be performed. Further, sound source
localization for each frequency may be performed based on the MUSIC method (see Non-Patent
Document 2) and the modified delay-and-sum array method (see Non-Patent Document 3). The
pseudo echo generation unit 13 generates a pseudo echo component e ^ (f, ?) defined by
[Equation 4].
[0028]
d (f, ?) is a band division signal of the original signal output to the speaker. Furthermore, the
pseudo echo generation unit 13 updates the echo phase difference DB with [Equation 5] from the
pseudo echo.
[0029]
The inter-microphone phase difference of the adaptive filter is stored in the echo phase
difference DB. The acoustic echo cancellation unit 14 uses the adaptive filter adapted by the
acoustic echo canceller adaptation unit 12 to generate and output the audio digital data x ^ i (f,
04-05-2019
9
?) after the acoustic echo suppression by the equation (6).
[0030]
As described above, in the present embodiment, using a microphone array having a plurality of
microphone elements, a band in which the speaker sound is dominant is determined from the
phase difference between the microphones, and adaptive control is performed only for that band.
A filter with high echo suppression performance can be generated. In addition, since double talk
detection is possible, it is possible to perform adaptation of the acoustic echo canceller when
sound is produced only from the speaker. Therefore, it is possible to always follow fluctuations in
the acoustic path, and when the speaker emits a sound from the speaker, the learning of the
impulse response is temporarily stopped when the speaker speaks, so there is little failure in the
learning of the impulse response. Become.
[0031]
FIG. 3 is a block diagram of a system for setting an initial value of the echo phase difference DB
using information on the number of speakers and a GUI for specifying the speaker position. From
the function block and database of echo phase difference calculation processing unit 16 that
calculates the phase difference of acoustic echo from the number of speakers used for video
conference and the number of speakers setting the position 15 for specifying the physical
position and the number and position of speakers set And realized by the CPU and storage
means.
[0032]
Speaker Number and Position Setting In the GUI 15, the number of speakers and the position of
the speaker with respect to the microphone array 1 are set. In the speaker number and position
setting GUI 15, it is essential to set the speaker direction with respect to the microphone array 1,
but the number of speakers and the position information set in the speaker number and position
setting GUI are sent to the echo phase difference calculation processing unit 16. Sent. The echo
phase difference calculation processing unit 16 estimates the phase difference Spi, j (f) between
the microphones of the acoustic echo i channel and the j channel based on the assumption of
FarField from the position of the number of speakers. The estimated echo phase difference is
stored in the echo phase difference DB.
04-05-2019
10
[0033]
FIG. 4 shows a large conference room exceeding the performance limit of the echo canceller by
switching between using the echo canceller and using the voice switch by using the adaptive
algorithm and the sound source direction of the present invention. It becomes possible to realize
a video conference system that does not cause howling. In addition to the voice echo cancellation
unit 14, a sound source localization unit 17 that estimates the power of the speaker sound, a
VoiceSwitch determination unit 18 that determines whether to use the VoiceSwitch from the size
of the acoustic echo and the power of the speaker sound, An output signal generation unit 19
that outputs a signal after acoustic echo suppression is provided.
[0034]
The A / D conversion unit, the band division unit, the phase difference calculation unit, the
frequency distribution unit, the acoustic echo canceller adaptation unit, the pseudo echo
generation unit, and the acoustic echo cancellation unit are the same processes as in FIG. The
sound source localization unit 17 calculates the histogram of the phase difference of the
frequency component not regarded as the echo component by the frequency distribution unit 11.
The sound source direction is identified from the calculated peak of the phase difference
histogram. The number of sound source directions to be identified is determined in advance, or is
a peak of a histogram, and if the frequency is a certain threshold or more, it is regarded as a
sound source direction. For the identified sound source direction, the sum of all the powers is
defined as near-end speaker power. The sound source localization unit 17 outputs near end
talker power. The VoiceSwitch determination unit 18 calculates, as the acoustic echo power, the
sum of the powers after the acoustic echo cancellation for the frequency determined to be
dominant by the frequency distribution unit. If the ratio between the calculated acoustic echo
power and the near-end speaker power is equal to or greater than a predetermined threshold, it
is determined that the frame is mainly an acoustic echo and there is no speaker, and it is
determined that the VoiceSwitch is used. Also, if it is equal to or less than a predetermined
threshold value, it is determined that a speaker is present in the frame, and it is determined that
VoiceSwitch is not used. When it is determined that the VoiceSwitch is used, the output signal
generation unit 19 generates and outputs a signal in which all values are set to 0. When it is
determined that the VoiceSwitch is not used, the signal after acoustic echo cancellation output
from the acoustic echo canceller unit is output. If there is a large amount of residual echo in the
signal after the acoustic echo cancellation, the VoiceSwitch determination unit 18 determines
that the VoiceSwitch is to be used, and does not transmit a signal including the residual echo.
04-05-2019
11
Sending a signal containing residual echo may cause the system to close and cause feedback due
to residual echo. Therefore, it is desirable to use VoiceSwitch to prevent the echo from looping to
prevent residual echo, but always using VoiceSwitch prevents near-end talkers and far-end
talkers from talking simultaneously. Therefore, in the VoiceSwitch determination unit 18 of the
present invention, since VoiceSwitch is used only for frames in which residual echo is present,
near end talkers and far end talkers can talk at the same time when residual echo does not occur.
Also, when residual echo occurs, switching to VoiceSwitch can dramatically reduce the possibility
of howling. In this embodiment, the acoustic echo power used for the determination of the
VoiceSwitch determination unit 18 is obtained from the signal after acoustic echo cancellation,
but the acoustic echo power may be calculated from the power of the signal before acoustic echo
cancellation.
[0035]
Also, the residual echo of all frequencies is compared with the near end talker power of all
frequencies to determine the use of the VoiceSwitch, but the residual echo and the near end
talker power are determined for each subband including several frequency bins. You may
compare whether or not to use VoiceSwitch for each sub-band. In this case, the output signal
generation unit 19 outputs, for each subband, the one that has been substituted with the value 0
in the output signal generation unit 19. The sub-band determined not to use the VoiceSwitch
outputs the signal after acoustic echo cancellation in the output signal generation unit 19.
[0036]
FIG. 5 shows an overall view of the video conference system when the present invention is
applied. This system is a television conference system characterized in that the adaptation of the
acoustic echo canceller is controlled on the calculator 101 using the information of the phase
difference and the sound source direction calculated by the phase difference calculation unit 10.
[0037]
FIG. 5 shows the system configuration of one site. The video conference system 100 performs
04-05-2019
12
sound signal processing or image processing and communication processing with the computer
101. An A / DD / A device 102 is connected to the computer 101, and an audio signal recorded
by the microphone array 105 is converted into a digital audio signal by the A / DD / A device
102 and sent to the computer 101. The microphone array 105 has a plurality of microphone
elements.
[0038]
The computer 101 performs acoustic signal processing on the digital audio signal, and the
processed audio signal is sent via the hub 103 onto the network. Here, the computer 101
includes the CPU 3, the memory 4 and the external storage medium 7 shown in FIG.
[0039]
The external storage medium 7 may be inside the computer 101 or outside the computer 101.
Then, in the CPU 3 in the computer 101, the band division unit 9, the phase difference
calculation unit 10, the frequency distribution unit 11, the acoustic echo canceller adaptation
unit 12, the pseudo echo generation unit 13, and the acoustic echo canceller unit as shown in
FIG. 14 or as shown in FIG. 9 to be described later, it has an audio transmission unit 201, an
acoustic echo canceller adaptation unit 204, an acoustic echo canceller unit 205, an audio
recording unit 203, an audio reception unit 207, and an audio reproduction unit 208, An
acoustic echo canceller is realized by these.
[0040]
The image signal of the other site sent to the video conference system 100 via the hub 103 is
sent to the image display device 104 and displayed on the screen. The audio signal of the other
site sent through the hub 103 is output from the speaker 106.
[0041]
The sound received by the microphone array 105 includes an acoustic echo transmitted from the
speaker 106 to the microphone array 105, which needs to be removed. The digital cable 110 and
04-05-2019
13
the digital cable 113 use a USB cable or the like.
[0042]
FIG. 6 shows a configuration of acoustic echo suppression processing of the prior art by an
acoustic echo model and an acoustic echo canceller using an adaptive filter when sound is
transmitted from the speaker to the microphone element.
[0043]
All signals are expressed by z-transform.
The incoming signal d (z) is emitted from the speaker and arrives in the form in which the room
impulse response H (z) is folded into the microphone element. The impulse response H (z)
includes the direct sound from the speaker to the microphone and the reflection (acoustic echo)
from the wall, floor, ceiling or the like.
[0044]
In the microphone element, in addition to the acoustic echo, the speaker voice N (z) is mixed. If
the microphone element signal X (z) is transmitted as it is, the transmission voice contains an
acoustic echo, the signal loops, and in the worst case, it causes howling and communication
becomes impossible. Therefore, it is necessary to suppress only the acoustic echo from the
transmission voice.
[0045]
The adaptive filter W (z) is a filter that adaptively learns the room impulse response H (z), and
can generate pseudo acoustic echoes by applying W (z) to the reception signal. The adaptation of
the adaptive filter W (z) is performed using, for example, the NLMS method. In the NLMS method,
the adaptive filter is updated as W (z) = W (z) +2 ?X (z) N ? (z) * / | X (z) | ^ 2.
[0046]
04-05-2019
14
In the case where W (z) = H (z), only the speaker voice N (z) can be extracted by subtracting the
pseudo acoustic echo from the microphone signal.
[0047]
In the case of W (z) = H (z) when N (z) = 0, the transmission voice is 0.
That is, in the update equation of the NLMS method described above, W (z) is adaptively changed
so that the transmission speech becomes zero.
[0048]
However, if N (z) is not 0, W (z) is adaptively changed so that the transmission speech becomes 0,
and conversely, W (z) fails in adaptation. Therefore, when N (z) is not 0, it is necessary to control
so as not to be adaptive.
[0049]
FIG. 7 shows a system to which the present invention is applied, which has a double talk detector
having a function of controlling so as not to be adaptive when N (z) is not zero. The double talk
detector determines if N (z) is zero and applies an adaptive filter only when N (z) is close to zero.
[0050]
This system to which the present invention is applied is characterized in that the double talk
detection unit 702 which has received a sound by the microphone array performs the
determination using the information on the arrival direction of the sound source obtained by the
phase difference calculation unit 701.
[0051]
If N (z) is not 0, updating the acoustic echo canceler will fail to adapt and there is a risk that the
filter will diverge, so the double talk detector is mandatory to avoid that risk .
04-05-2019
15
[0052]
FIG. 8 shows the flow of an audio stream in a two-point video conference system and the flow of
an audio stream in a three or more point conference system.
Here, the phase difference calculation unit may be in the server or may be in the CPU of each
site.
[0053]
In the case of the two bases, the transmission signal after acoustic echo cancellation is sent from
the video conference system at the base A to the video conference system at the base B via the
network and reproduced at the base B.
The voice of the site B is sent to the site A and reproduced.
[0054]
Also, in the case of three or more sites, data is collected once by the server or CPU, redistributed
to each site, and reproduced.
[0055]
FIG. 9 shows a block configuration of the video conference system when the present invention is
applied.
The received voice transmitted through the network is received by the voice receiver 207. The
received voice is sent to the voice reproduction unit 208. The voice reproduction unit 208
reproduces the received voice by the speaker.
[0056]
04-05-2019
16
The received voice is sent to the acoustic echo canceller unit 205. The voice recording unit 203
records voice signals of the microphone array. The recorded audio signal is sent to the acoustic
echo canceller unit 205.
[0057]
The acoustic echo canceller unit 205 generates a pseudo echo from the acoustic echo
cancellation filter stored in the acoustic echo cancellation filter DB 211 and the reception voice,
and subtracts the pseudo echo from the audio signal of the microphone array. As a result of
subtraction, the remaining error signal is sent to the acoustic echo canceller adaptation unit 204.
[0058]
The acoustic echo canceller adaptation unit 204 adapts the acoustic echo canceller to zero the
error signal. The adapted result is stored in the acoustic echo cancellation filter DB 211. The
error signal output from the acoustic echo cancellation unit 205 is sent to the voice transmission
unit 201.
[0059]
The voice transmission unit 201 transmits an error signal to another site. The image capturing
unit 210 captures an image with a camera. The photographed image is sent to the image
transmission unit 202 and transmitted to another site.
[0060]
An image reception unit 209 receives an image sent from another site. The received image is sent
to the image display unit 206. The image display unit 206 displays the sent image on the screen.
[0061]
04-05-2019
17
FIG. 10 shows a processing flow of the video conference system. In the acoustic echo canceller
adaptation processing S1, a learning signal is sounded from the speaker to perform adaptation of
the acoustic echo canceller. The learning signal is preferably white noise. The learning signal
length is desirably several seconds to several tens of seconds or more. If the learning length is
short, the acoustic echo canceller may not sufficiently learn the room impulse response. As
described above, the impulse response can be sufficiently learned by setting the learning signal
length to several seconds to several tens seconds or more.
[0062]
After the end of learning, it is determined whether a connection request has been issued from
another location S2. If there is a connection request from another base, connection S4 is made
with the other base.
[0063]
If a connection request is not issued from another site, it is determined whether a connection
request is issued from the own site S3. The connection request from the own site is issued by the
user through the GUI.
[0064]
If there is a connection request from the own site, connection S4 with other sites is performed. If
a connection request is not issued from the own site, the connection is not made with another
site, and the process returns to the determination of S2 whether the connection request is issued
from another site.
[0065]
That is, the video conference system waits until a connection request is issued from either the
own site or another site.
04-05-2019
18
[0066]
After connection S4 with other locations, playback S6, image display S7, voice recording S8, echo
cancellation S9, and voice transmission S10 are repeated from the speaker until the connection is
disconnected.
[0067]
In the reproduction S6 from the speaker, the reception voice sent from the other base is
reproduced.
[0068]
In the image display S7, an image sent from another site is displayed on the monitor.
[0069]
In voice recording S8, the voice of the microphone array at the own site is recorded.
[0070]
In echo cancellation S9, an acoustic echo component is suppressed from the sound of the
recorded microphone array.
[0071]
In audio transmission S10, the audio signal after the acoustic echo component suppression is
transmitted.
If it is determined in S11 that the connection has been disconnected, it is determined that the
connection has been disconnected, a connection S13 is made from another base, and the video
conference system is ended.
[0072]
If it is determined that the connection has not been disconnected, it is determined whether or not
there is a disconnection request from the user of the own site through the GUI at S12. If there is
a disconnection request, another site disconnects S13 and the video conference system is ended.
04-05-2019
19
[0073]
FIG. 11 shows sparsity which is a basic concept of double talk processing which is a main
element of the present invention.
[0074]
In the present invention, the voice signal from the microphone array and the received signal used
for pseudo echo generation are all subjected to short time Fourier transform, wavelet transform
or subband processing, and converted into frequency domain signals.
It is desirable that the frame size at the time of the short time Fourier transform is a point
number corresponding to about 50 ms.
[0075]
For example, at a sampling rate of 32 kHz, a frame size of 2048 points is desirable.
The speech is steady for about several tens of ms, and by setting it to such a frame size, it can be
assumed that the sparsity is most established in the frequency domain, and the adaptive
processing of the acoustic echo canceller operates with high accuracy. It is possible to
[0076]
In addition, it is desirable to perform short-time Fourier transformation after applying a
Hamming window, a Hanning window, a Blackman window, or the like.
The short time Fourier transform assumes that the signal is repeated in an analytical long period.
If the window function is not multiplied, the values at both ends will be different, and after a
04-05-2019
20
short time Fourier transform, non-existent frequencies will be observed.
By applying the window function in this manner, it becomes difficult to observe non-existent
frequency components, and it becomes possible to improve frequency analysis accuracy.
[0077]
The frame shift is preferably about 1/4 or 1/8 of the frame size, and the finer the frame shift, the
better the sound quality of the output voice.
However, as the frame shift is made finer, the processing amount becomes larger, so it is
necessary to make the frame shift as fine as possible within the range that can be processed in
real time at the processing speed of the installed computer.
[0078]
FIG. 11 shows a grid in which the horizontal axis is time (frame number) and the vertical axis is
frequency (frequency bin number).
[0079]
In the double-talk processing of the present invention, it is determined for each time-frequency
component whether the component is an acoustic echo component or a non-acoustic echo
component, and adaptation of the acoustic echo canceller is performed only for the timefrequency determined to be an acoustic echo component. Do the processing.
[0080]
Speech is a sparse signal when viewed in the time-frequency domain, and it is known that mixing
of a plurality of speech on the same time-frequency is rare.
[0081]
When both the incoming call signal and the outgoing call signal are voice signals as in a
teleconferencing system, only the acoustic echo component is highly accurate by distributing the
acoustic echo component or the non-acoustic echo component for each time-frequency from
sparsity. It is possible to extract
04-05-2019
21
[0082]
FIG. 12 shows the basic configuration of an acoustic echo canceller to which the present
invention is applied.
[0083]
First, a band division unit performs frequency decomposition S101 on an audio signal input to
the computer 101 from a microphone array having a plurality of microphone elements, and
converts recorded audio into a signal in the frequency domain.
[0084]
Next, the phase difference calculation unit calculates the inter-element phase difference of the
recorded voice.
[0085]
Next, the frequency distribution unit determines which sound the band division signal is from the
phase difference of each band output by the phase difference calculation unit.
That is, it is determined whether it is a speaker output signal or a speaker signal.
[0086]
Then, the acoustic echo canceller unit removes voice contained in the band division signal S102.
[0087]
In S102, W (z) is multiplied by the reference signal d (z) to generate a pseudo echo W (z) d (z).
Acoustic echo can be eliminated from the microphone input signal by subtracting W (z) d (z)
from the microphone input signal x (x).
04-05-2019
22
[0088]
The acoustic echo canceller adaptation unit determines that it is a double talk state when there
are many components which are speaker output signals among the band division signals, and
performs an adaptive processing S104 of the acoustic echo canceller.
[0089]
The adaptive processing of the acoustic echo canceller updates the filter W (z) of the acoustic
echo canceller by the NLMS method or the like.
In the NLMS method, W (z) = W (z) + 2.mu.X (z) N '(z) * /. Vertline.X (z) .vertline. @ 2.
[0090]
In the case of the double talk state, the echo canceller adaptation S103 is not performed, and the
acoustic echo canceller processing is ended.
[0091]
FIG. 13 shows a processing flow of processing in which only time-frequency components coming
from the direction of the speaker are regarded as acoustic echoes using sound source direction
information.
[0092]
In frequency decomposition S201, the recorded voice is converted into a time-frequency domain
signal.
[0093]
In sound source localization S202, sound source direction estimation is performed based on the
modified delay-and-sum array method.
The modified delay-sum array method used for sound source direction estimation in the present
04-05-2019
23
invention determines from which direction the component comes from for each time-frequency.
[0094]
As shown in FIG. 11, since the speech is a sparse signal when viewed every time-frequency, it can
be assumed that an acoustic echo is divided into a main component and a non-acoustic echo as a
main component every time-frequency. .
Therefore, if the time-frequency component coming from the direction of the speaker is selected
from the sound source localization result estimated for each time-frequency, it can be considered
that the component is an acoustic echo as the main component.
[0095]
In the modified delay sum array method, sound source localization is performed using a steering
vector A? (f) in the sound source direction ?.
When the number of microphones is M, A? (f) is an M-dimensional complex vector.
[0096]
Here, f represents a frequency bin number.
The time-frequency representation of the input signal is denoted as X (f, ?).
? is the frame number of the short time Fourier transform.
X (f, ?) is an M-dimensional complex vector, and is a vector having elements of the frame ? and
frequency f of each microphone element.
04-05-2019
24
[0097]
In the modified delay-sum array method, a virtual sound source direction ? at which | A? (f) * X
(f, ?) | is maximum is estimated to be a sound source direction of the frame ? and the
frequency f.
[0098]
In the speaker direction identification step S203, the number of frequencies f whose sound
source direction is estimated to be ? for each virtual sound source direction ? or log | A? (f) *
X (f, ?) |
Then, the peak of the histogram within a predetermined range (for example, -30 ░ to 30 ░) of
the speaker direction is calculated, and the direction is set as the speaker direction ?sp.
[0099]
If it is determined at step S204 that the sound source direction is the speaker direction, if the
estimated sound source direction ? ? at frequency f satisfies | ????sp | <?, the frequency f
is regarded as a frequency component that has arrived from the speaker direction.
Then, it is regarded as an acoustic echo component, and echo canceller adaptation S205 is
performed.
[0100]
After S204 and S205 are performed on all frequencies, the adaptation processing of the acoustic
echo canceller is ended.
[0101]
Now, next, FIG. 14 shows an acoustic echo canceller adaptive process using information similar
to the sound source direction that can be calculated from the pseudo echo.
04-05-2019
25
[0102]
Frequency decomposition S301 is performed to convert the recorded voice into a signal in the
frequency domain.
[0103]
An acoustic echo filter is applied to the converted signal in the frequency domain to calculate a
pseudo echo S302.
[0104]
The similarity between the pseudo echo calculated for each frequency f and the input signal is
calculated S303.
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements.
The 0-th element of E (f, ?) is described as E0 (f, ?).
It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?).
????
E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) |
Let the similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |.
[0105]
04-05-2019
26
This degree of similarity looks at the degree of similarity between the acoustic echo component
and the sound source direction of the input signal, and when the input signal contains only the
acoustic echo component, the degree of similarity is 1. A value obtained by multiplying the
degree of similarity by a different threshold ? (f) for each frequency is taken as the final degree
of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f, (tau)) * A (theta) (f) |.
[0106]
When the similarity exceeds the predetermined threshold th (S304), the adaptation S305 of the
echo canceller is performed, and when the similarity falls below the threshold th, the adaptation
S305 of the echo canceller is not performed.
[0107]
After S303 to S305 are performed for every frequency, the adaptation process of the acoustic
echo canceller is finished.
[0108]
FIG. 15 shows acoustic echo canceller adaptation processing using information similar to the
sound source direction that can be calculated from the pseudo echo.
In this process, whether to adapt the acoustic echo canceller is the same for all frequency
components in the frame.
[0109]
Frequency decomposition S401 is performed to convert the recorded voice into a frequency
domain signal.
An acoustic echo filter is applied to the converted signal in the frequency domain to calculate a
pseudo echo S402.
04-05-2019
27
[0110]
The degree of similarity between the pseudo echo calculated for each frequency f and the input
signal is calculated S403.
[0111]
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements. The 0-th element of E (f, ?) is described as E0 (f,
?). It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?). ???? E ? ? (f, ?) = E ? (f, ?) / | E ?
(f, ?) | Let the similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |.
[0112]
This degree of similarity looks at the degree of similarity between the acoustic echo component
and the sound source direction of the input signal, and when the input signal contains only the
acoustic echo component, the degree of similarity is 1. A value obtained by multiplying the
degree of similarity by a different threshold ? (f) for each frequency is taken as the final degree
of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f, (tau)) * A (theta) (f) |. Calculation of
similarity is performed for every frequency. Then, the similarity of all frequencies is added S404.
[0113]
If the added similarity exceeds the predetermined threshold th (S405), the echo canceller
adaptation S406 is performed on all frequency components, and if it falls below the echo
canceller adaptation S406, the acoustic echo canceller adaptation is not performed. End the
process. When performing adaptation, after the echo canceller adaptation S406, the acoustic
echo canceller adaptation processing ends.
[0114]
FIG. 16 shows a processing flow of processing in which only time-frequency components coming
from the direction of the speaker are regarded as acoustic echoes using sound source direction
04-05-2019
28
information.
[0115]
Whether to adapt the acoustic echo canceller in this process is the same for all frequency
components in the frame.
[0116]
In the frequency decomposition S501, the recorded voice is converted into a time-frequency
domain signal.
In sound source localization S502, sound source direction estimation is performed based on the
modified delay-and-sum array method.
[0117]
In the modified delay sum array method, sound source localization is performed using a steering
vector A? (f) in the sound source direction ?.
When the number of microphones is M, A? (f) is an M-dimensional complex vector.
[0118]
Here, f represents a frequency bin number. The time-frequency representation of the input signal
is denoted as X (f, ?). ? is the frame number of the short time Fourier transform. X (f, ?) is an
M-dimensional complex vector, and is a vector having elements of the frame ? and frequency f
of each microphone element.
[0119]
In the modified delay-sum array method, a virtual sound source direction ? at which | A? (f) * X
(f, ?) | is maximum is estimated to be a sound source direction of the frame ? and the
04-05-2019
29
frequency f.
[0120]
In the speaker direction identification step S503, the number of frequencies f whose source
direction is estimated to be ? for each virtual sound source direction ? or log | A? (f) * X (f, ?)
| is accumulated to create a histogram.
Then, the peak of the histogram within a predetermined range (for example, -30 ░ to 30 ░) of
the speaker direction is calculated, and the direction is set as the speaker direction ?sp.
[0121]
If it is determined at step S504 that the sound source direction is the speaker direction or if the
estimated sound source direction ? ? at the frequency f satisfies | ????sp | <?, then the
frequency f is regarded as a frequency component that has arrived from the speaker direction.
Then, the number of frequencies f regarded as the speaker direction or log | A? (f) * X (f, ?) | is
accumulated to obtain a power spectrum added in the frequency direction. Whether or not the
power spectrum added at all frequencies is equal to or greater than a predetermined value is
determined in S506. If it is equal to or more than a threshold, echo canceller adaptation S507 is
performed for all frequency components, and the process is ended.
[0122]
FIG. 17 is a diagram showing the effect of the adaptive processing of the present acoustic echo
canceller. This is the result of suppressing the acoustic echo in the input signal in which only the
acoustic echo is present, using the echo cancellation filter adapted with the voice (double-talk
voice) spoken by the speaker at the local site.
[0123]
In this case, it is desirable that all input signals be suppressed to be silent. The upper part shows
04-05-2019
30
the result in the case of performing adaptive control according to the present invention.
[0124]
The lower part shows the result when adaptive control is not performed. The results are shown
in the figure where the greater the power per time-frequency, the smaller the brighter, the
darker. The horizontal axis is time, and the vertical axis is frequency. In the case of performing
adaptive control, it can be seen that the signal power after acoustic echo suppression is
particularly small at high frequencies, and the acoustic echo suppression performance is high.
[0125]
In the video conference system using the present invention, a configuration may be adopted in
which an acoustic echo canceller is adapted by sounding a white signal or the like from the
speaker of the own site before connecting to another site.
[0126]
FIG. 18 shows a process flow in the case where the acoustic echo canceller is adapted in advance.
[0127]
An adaptive processing S601 of an acoustic echo canceller using data of all frames while the
speaker at the own site is on is performed before the other site is connected.
[0128]
This corresponds to unconditionally performing the adaptive processing of the acoustic echo
canceler without performing the double talk detection of the present invention.
[0129]
Then, the other base connection waiting S602 is performed, and it waits until the user of the own
base sends a connection request to the other base or if the connection request comes from the
other base.
[0130]
04-05-2019
31
After connecting to another site, acoustic echo canceller adaptation processing S603 with
adaptive control by the double talk detection processing of the present invention is repeated.
Then, after disconnection of the video conference system, the process ends.
[0131]
FIG. 19 shows a flow of processing for performing control of non-linear processing by
VoiceSwitch using the similarity between the input signal and the pseudo echo.
[0132]
Frequency decomposition S701 is performed to convert the recorded voice into a signal in the
frequency domain.
[0133]
An acoustic echo filter is applied to the converted signal in the frequency domain to calculate a
pseudo echo S702.
[0134]
The degree of similarity between the pseudo echo calculated for each frequency f and the input
signal is calculated S703.
[0135]
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements.
The 0-th element of E (f, ?) is described as E0 (f, ?).
04-05-2019
32
It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?).
????
E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) |
Let the similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |.
This degree of similarity looks at the degree of similarity between the acoustic echo component
and the sound source direction of the input signal, and when the input signal contains only the
acoustic echo component, the degree of similarity is 1.
A value obtained by multiplying the degree of similarity by a different threshold ? (f) for each
frequency is taken as the final degree of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f,
(tau)) * A (theta) (f) |.
[0136]
If the similarity exceeds the predetermined threshold th (S704) and the power of the input signal
is equal to or higher than the threshold, the corresponding frequency component in the
transmission voice is set to zero. Otherwise, the processing is ended with the signal after echo
cancellation as the corresponding frequency component in the transmission voice.
[0137]
FIG. 20 shows a flow of processing for performing control of non-linear processing by
VoiceSwitch using the similarity between the input signal and the pseudo echo. The
determination of whether to use VoiceSwitch is common to all frequencies.
[0138]
04-05-2019
33
Frequency decomposition S801 is performed to convert the recorded voice into a signal in the
frequency domain. An acoustic echo filter is applied to the converted signal in the frequency
domain to calculate pseudo echo S802. The similarity between the pseudo echo calculated for
each frequency f and the input signal is calculated S803.
[0139]
In the similarity calculation process, the pseudo echo E (f, ?) is used. E (f, ?) has M-dimensional
complex vectors, frame ? of each microphone element, and pseudo echo components of
frequency f as elements. The 0-th element of E (f, ?) is described as E0 (f, ?). It is defined that E
? (f, ?) = E (f, ?) / E0 (f, ?). ???? E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) | Let the
similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |. This degree of similarity looks at the degree
of similarity between the acoustic echo component and the sound source direction of the input
signal, and when the input signal contains only the acoustic echo component, the degree of
similarity is 1. A value obtained by multiplying the degree of similarity by a different threshold ?
(f) for each frequency is taken as the final degree of similarity. It is set as (alpha) (f) = 1 / (SIGMA)
| E '' (f, (tau)) * A (theta) (f) |. Add the similarity at all frequencies.
[0140]
When the similarity after addition exceeds the predetermined threshold th (S805) and the power
of the input signal is equal to or higher than the threshold, the transmission voice signal is set to
0. Otherwise, the processing is terminated using the signal after echo cancellation as the
transmission voice.
[0141]
FIG. 21 shows a flow of processing for controlling the non-linear suppression coefficient of the
residual echo after acoustic echo cancellation using the similarity between the input signal and
the pseudo echo.
[0142]
Frequency decomposition S901 is performed to convert the recorded voice into a signal in the
frequency domain.
04-05-2019
34
[0143]
An acoustic echo filter is applied to the converted frequency domain signal to calculate pseudo
echo S902.
[0144]
The similarity between the pseudo echo calculated for each frequency f and the input signal is
calculated S903.
[0145]
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements.
The 0-th element of E (f, ?) is described as E0 (f, ?).
It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?).
???? E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) | Let the similarity be | E ? ? (f, ?) * X (f, ?)
| / | X (f, ?) |. This degree of similarity looks at the degree of similarity between the acoustic
echo component and the sound source direction of the input signal, and when the input signal
contains only the acoustic echo component, the degree of similarity is 1. A value obtained by
multiplying the degree of similarity by a different threshold ? (f) for each frequency is taken as
the final degree of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f, (tau)) * A (theta) (f) |.
[0146]
When the similarity exceeds the predetermined threshold th (S904) and the power of the input
signal is equal to or higher than the threshold, the nonlinear suppression coefficient ? is set to a
04-05-2019
35
predetermined value ?0. Otherwise, the non-linear suppression coefficient ? is ?1. It is
previously determined as ?0> ?1.
[0147]
The signal after acoustic echo cancellation S 907 is n ? (f, ?), and the pseudo echo component
is e (f, ?). In the non-linear suppression processing S 908, n ? ? (f, ?) = Floor (| n ? (f, ?) | ?
| e (f, ?) |) arg (n ? (f, ?)), n ? Output '(f, ?). Here, Floor (x) is a function that returns x when x
is 0 or more and 0 when x is 0 or less. arg (x) is a function that returns the phase component of
x. After non-linear suppression processing is performed on all frequencies, the processing ends.
[0148]
As described above, according to the present invention, control of acoustic echo can be realized
by a teleconference system, a teleconference system or the like, and can be applied to acoustic
echo cancellation technology in a double talk state.
[0149]
The figure which showed the hardware constitutions of this invention.
Block diagram of the software of the present invention. The block diagram of the speaker
number and position designation by GUI of this invention. The block diagram of VoiceSwitch
switching processing by sound source localization of the present invention. BRIEF DESCRIPTION
OF THE DRAWINGS The whole apparatus which applied this invention to the video conference
system. The figure explaining the acoustic echo canceller of a prior art. The figure explaining the
acoustic echo canceller which used the double talk detection process. The figure which showed
the structure of 2 base video conferences. FIG. 1 is a block diagram of a video conference system
to which the present invention is applied. The processing flow figure of the video conference
system to which the present invention is applied. The figure explaining the acoustic echo
component identification method in this invention. The processing flow figure of the echo
canceller of this invention, a double talk judgment, and an echo canceller adaptation process. The
echo canceller adaptive control flow figure for every frequency which used sound source
direction. The echo canceller adaptive control flow figure for every frequency using the
information similar to the sound source direction. The echo canceller adaptive control flowchart
of all the frequency components using the information similar to the sound source direction. The
04-05-2019
36
echo canceller adaptive control flow figure for every frequency which used sound source
direction. The figure which showed the effect of this invention. The processing flow figure in the
case of performing adaptation processing before a video conference start. VoiceSwitch control
flow diagram for each frequency using information similar to the sound source direction. The
voice switch control flow diagram of all the frequencies using the information similar to the
sound source direction. The suppression coefficient control flowchart of the nonlinear acoustic
echo suppression processing for every frequency using the information similar to a sound source
direction.
Explanation of sign
[0150]
1?Microphone array 2. A / D converter and D / A converter 3. Central processing unit 4.
Memory 5. Hub 5. Speaker 7. External storage medium 8. A / D converter 9. ... band division unit,
10 ... phase difference calculation unit, 11 ... frequency distribution unit, 12 ... acoustic echo
canceller adaptation unit, 13 ... pseudo echo generation unit, 14 ... acoustic echo cancellation
unit, 15 ... number of speakers and position setting GUI, 16 ... echo phase difference calculation
processing unit, 17 ... sound source localization unit, 18 ... Voice Switch determination unit, 19 ...
output signal generation unit, 100 ... video conference system, 101 ... computer, 102 ... A / DD /
A Device 103 103 Hub 104 Image display device 105 Microphone array 106 Speaker 107
Camera 108 Audio cable 109 Audio cable 1 0 ... digital cable, 111 ... monitor cable, 112 ... LAN
cable, 113 ... digital cable, 201 audio transmission unit, 202 ... image transmission unit, 203 ...
audio recording unit, 204 Acoustic echo canceller adaptation unit 205 Acoustic echo canceller
unit 206 Image display unit 207 Audio reception unit 208 Audio reproduction unit 209 Image
reception unit 210: image capturing unit, 211: acoustic echo canceller filter DB, 701: phase
difference calculating unit, 702: double talk detecting unit, S1: acoustic echo canceller adaptive
processing, S2: Determination of whether a connection request has been received from another
site, S3: determination of whether a connection request has been received from the own site, S4:
connection to another site, S6: reproduction processing from a speaker, S ... Image display, S8 ...
Voice recording, S9 ... Echo cancellation, S10 ... Voice transmission, S11 ... Determination of
disconnection, S12 ... Whether there is a disconnection request from the own site S13 ...
Disconnect from other sites, S101 ... Frequency resolution, S102 ... Echo canceller execution,
S103 ... Double-talk state determination, S104 ... Echo canceller adaptation, S201 и и и Frequency
decomposition, S 202 иии Sound source localization, S 203 и и и Speaker direction identification, S
204 и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и echo cancellation can be applied S301 и и и frequency
decomposition ... Simulated echo calculation, S303 ... Calculation of similarity with input signal,
S304 ... Determination of similarity more than threshold value, S305 ... Eco -Canceller adaptation,
S401: frequency decomposition, S402: pseudo echo calculation, S403: similarity calculation with
input signal, S404: similarity addition processing, S405: similarity after addition is a threshold
04-05-2019
37
Determination as above, S406: All frequency echo canceller adaptation, S501: Frequency
resolution, S502: Source localization, S503: Speaker direction identification, S504: Determination
of whether the source direction is the speaker direction , S505: power spectrum addition
processing, S506: determination of whether the speaker direction spectrum is equal to or more
than a threshold value, S507: echo canceller adaptation, S601: all frame adaptation processing,
S602: waiting for connection to another site, S603 ... Acoustic echo canceller adaptive processing
with adaptive control, S 701 ... Frequency resolution, S 702 ... Pseudo Echo calculation, S703 ...
Calculation of similarity with input signal, S704 ... Determination of similarity being above
threshold, S705 ... Determination of power above threshold, S706 ... VoiceSwitch determination,
S801 ... Frequency decomposition, S802: pseudo echo calculation, S803: similarity calculation
with the input signal, S804: similarity addition processing, S805: determination after similarity
after addition is a threshold value, S806 и и Determination of whether the input power is equal to
or higher than the threshold value, S 807 иии All-band VoiceSwitch determination, S 901 и и и
Frequency decomposition, S 902 и и и Pseudo echo calculation, S 903 и и и Similarity calculation
with input signal и Judgment as to whether the similarity is above the threshold, S 905 иии
Judgment as to whether the power is above the threshold, S 906 и и и и и и и и и и и и и и и и и и и и и и и и и и и и и
и и и ?? adjustment of the nonlinear suppression coefficient Canceller, S908 ии?? non-linear
suppression.
04-05-2019
38
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2008141718
PROBLEM TO BE SOLVED: A conventional adaptive method of acoustic echo canceller has a low
adaptive accuracy and can not generate a filter that sufficiently suppresses acoustic echo.
SOLUTION: In the present invention, using a microphone array having a plurality of microphone
elements, a band in which only a speaker sound is present is determined from a phase difference
between microphones, and acoustic echo suppression performance is achieved by adapting only
for that band. Means for generating a high filter. [Selected figure] Figure 2
Acoustic echo canceller system
[0001]
Acoustic echo cancellation technology for teleconferencing systems or teleconferencing systems
with speakers and microphones.
[0002]
There is a teleconferencing system or a teleconferencing system which has a speaker and a
microphone on both sides, and can be connected by a network and can talk in voice with a
person at a remote place.
In this system, there is a problem that the sound outputted from the speaker is mixed in the
microphone. Therefore, it has been practiced to remove the speaker output sound (acoustic echo)
mixed in the microphone by using the acoustic echo canceller technology. If the acoustic
04-05-2019
1
environment of the conference room is invariant, it is possible to first learn how to transmit the
sound in space (impulse response) only once and use the impulse response to completely remove
the acoustic echo. However, when the conference participant moves the seat, the acoustic path of
the acoustic echo fluctuates, so the learned impulse response and the actual impulse response
become unmatched, and the acoustic echo can not be completely removed. In the worst case, the
residual echo turns around and the volume gradually increases, causing a phenomenon called
howling, which makes it quite difficult to talk.
[0003]
Therefore, a method has been proposed which aims to always completely eliminate the acoustic
echo by learning the impulse response sequentially and following the variation of the acoustic
path (for example, Non-Patent Document 1).
[0004]
In addition, a method of canceling acoustic echo using a microphone array has been proposed
(for example, Patent Document 1).
In the prior art, because the performance of the echo canceller is not sufficient, when the near
end talker and the far end talk at the same time, the low volume talker's voice is completely shut
out, and the one-sided call state is established. It is done to prevent howling. However, there is a
problem that it is difficult to talk in one-way calling.
[0005]
Patent Document 1: JP-A-2005-136701 Peter Heitkamper, ?An Adaptation Control for Acoustic
Echo Cancellers,? IEEE Signal Processing Letters, Vol. 4, No. 6, 1997/6. R. O. Schmidt, "Multiple
Emitter Location and Signal Parameter Estimation," IEEE Trans. Antennas and Propagation, vol.
34, no. 3, pp. 276-280, 1986. Togami Mahito, Amano Akio, Shinjohiro, Kushida Yuta, Tamamoto
Shinichi, Karakawa Nori, "Hearing Function of Human Symbiosis Robot" EMIEW "," 22nd AI
Challenge Study Group, pp. 59-64, 2005
[0006]
04-05-2019
2
In the conventional method of sequentially learning impulse responses and following changes in
the acoustic path, although it is possible to learn sequentially when sound is emitted only from
the speaker, sound is emitted from the speaker and the speaker in the meeting room utters In
this case, learning becomes impossible, and in the worst case, learning of the impulse response
fails, and acoustic echo can not be removed at all. Therefore, it is necessary to determine whether
a sound is produced only from the speaker or a speaker in the conference room is speaking
(double talk detection).
[0007]
In the present invention, a situation where the sound from the speaker is dominant is detected to
perform adaptive control of the echo canceller. As a configuration example for that, the arrival
direction of the sound source can be estimated by using a microphone array having a plurality of
microphone elements. In a more preferable aspect, it is possible to detect a phase difference of
sounds input to a plurality of microphone elements, and to determine a situation in which the
sound from the speaker is dominant. The determination can be performed by comparison with a
threshold stored in advance. In a preferred embodiment, the acoustic echo canceller adaptation
unit extracts only the band division signal whose direction of arrival of the sound source is in the
direction of the speaker and performs adaptation of the acoustic echo canceller with the band
division signal.
[0008]
The acoustic echo canceller can cancel the echo by artificially creating the sound from the
speaker and subtracting it from the input speech.
[0009]
As a typical example of the configuration of the present invention, a microphone for inputting
voice, an AD converter for converting a signal from the microphone into a digital signal, and an
information processing for processing a digital signal from the AD converter to suppress an
acoustic echo component Device, an output interface for sending a signal from the information
processing device to the network, an input interface for receiving the signal from the network, a
DA converter for converting the signal from the input interface to analog, and a signal from the
DA converter A conference system having a speaker for outputting as voice, wherein the
information processing device controls optimization timing of the information processing device
04-05-2019
3
based on a state of voice input to the microphone.
The AD converter and the DA converter may be integrated.
[0010]
Preferably, the information processing apparatus is optimized at a timing when the sound input
to the microphone is mainly from the direction of the speaker. The determination can be made,
for example, by setting an appropriate threshold.
[0011]
More preferably, the information processing apparatus suppresses an acoustic echo component,
which is a mixture of voice of a speaker from a digital signal, using an adaptive filter, an acoustic
echo canceller adaptation unit that optimizes the adaptive filter, and an adaptive filter. And an
acoustic echo cancellation unit.
[0012]
More preferably, the microphone is a microphone array having a plurality of microphone
elements, the AD converter is a plurality of AD converters which digitally convert a signal for
each microphone element, and the information processing apparatus is a plurality of AD
converters It has a phase difference calculation unit that calculates the phase difference between
voices input to a plurality of microphone elements based on signals, and from the phase
difference output by the phase difference calculation unit, the voice input to the microphone
array is output from the speaker It has a frequency distribution unit that determines whether or
not it is voice.
[0013]
More preferably, the information processing apparatus includes a band division unit that divides
a digital signal into bands, and the band division unit divides the digital signal digitally converted
for each microphone element into bands, and the phase difference calculation unit divides the
digital signal. The phase difference between the voices input to the plurality of microphone
elements in each band is calculated, and the frequency division unit determines that the band
division signal is the speaker output signal from the phase difference in each band output by the
04-05-2019
4
phase difference calculation unit. It is determined whether the signal is a speaker signal or the
speaker signal, and the acoustic echo canceller adaptation unit suppresses mixing of the voice of
the speaker from the signal of the microphone element only for the band determined to be the
speaker output signal by the frequency distribution unit. The acoustic echo canceller unit uses an
adaptive filter to adapt the adaptive filter used for the To remove the code component.
In the band division unit, for example, the frequency from 20 Hz to 16 kHz is divided every 20
Hz.
As described above, by performing control for each frequency band, highly accurate echo
cancellation can be performed.
[0014]
More preferably, in order to determine whether the signal is a speaker output signal, the
frequency distribution unit measures in advance the transfer function of the sound transmitted
from the speaker to the microphone array, and the measured transfer function outputs the sound
from the speaker Calculates the phase difference for each band of the microphone array, stores
the phase difference for each band in the external storage device, and stores the stored phase
difference and the phase difference between the microphone elements for each band of the band
division signal Is equal to or less than a predetermined threshold value, it is determined that the
band division signal is a speaker output signal.
[0015]
More preferably, the user has a user interface characterized in that the user specifies in advance
the number of speakers and the physical position relative to the microphone array, and the
sound from the speakers is determined from the number of speakers and the physical position
specified in the user interface. The echo phase difference calculation processing unit that
calculates the phase difference for each band of the microphone array when the microphone
output is performed, stores the phase difference for each band in the external storage device, and
stores the stored phase difference and band If the phase difference between the microphone
elements for each band of the divided signal is less than or equal to a predetermined threshold
value, it is determined that the band divided signal is a speaker output signal.
[0016]
More preferably, a histogram of the sound source direction across the band is calculated using
the phase difference of each microphone array, and a sound source localization unit for
04-05-2019
5
estimating the sound source direction from the histogram is provided. The magnitude of the
signal estimated to have been calculated, and the magnitude of the band division signal
determined that the magnitude of the calculated signal is the speaker output signal, or the
speaker output signal in the band division signal after the acoustic echo canceler The size of the
band division signal is reduced or all 0 if the size of the band division signal is determined to be
smaller than or equal to the predetermined size.
[0017]
The echo canceller can be controlled dynamically according to the conditions of the conference
room.
[0018]
Hereinafter, specific embodiments of the present invention will be described with reference to
the drawings.
The present invention is, for example, a teleconferencing system using an IP network circuit, in
which two (or more) sites connected by a network communicate using a teleconferencing facility
consisting of a microphone array, a speaker, etc. Achieve conversations between speakers at both
sites.
Hereinafter both sides of this site will be referred to as near end and far end.
[0019]
FIG. 1 is a diagram showing the hardware configuration of the present invention disposed at the
near end and the far end of a site, respectively.
Microphone array 1 comprising at least two or more microphone elements, A / D converter for
converting analog sound pressure values input from the microphone array into digital data, and
D / A converter 2 for converting digital data into analog data 2 A central processing unit 3 for
processing the output of the conversion unit 2, eg a volatile memory 4, a hub 5 connected to the
network, for transmitting and receiving data to and from the far end, D / A converted analog data
04-05-2019
6
The speaker 6 is composed of a speaker 6 that converts sound pressure to sound pressure, for
example, a non-volatile external storage medium 7.
[0020]
The sound pressure values of multiple channels recorded by the microphone array 1 are sent to
the AD / DA converter 2 and converted into digital data of multiple channels.
The converted digital data is stored in the memory 4 via the CPU 3.
[0021]
The voice on the far end side sent through the hub 5 is sent to the AD / DA converter 2 via the
CPU 3 and output from the speaker 6. The far-end voice output from the speaker 6 is mixed with
the voice recorded by the microphone array 1 together with the near-end speaker voice.
Therefore, the sound on the far end side output from the speaker is mixed in the digital sound
pressure data stored in the memory 4 as well. The CPU 3 performs echo cancellation processing
to suppress the mixed far-end voice from the digital sound pressure data stored in the memory 4
and transmits only the near-end speaker voice to the far end via the hub 5 Do. The echo
cancellation process suppresses voice at the far end by using data on how to transmit sound
from the speaker to the microphone array stored in advance in the external storage medium 7,
information such as the number of speakers, the position of the speakers, etc. Do.
[0022]
FIG. 2 is a diagram showing the software configuration of the present invention, and the CPU 3
digitally executes the components other than the A / D conversion unit 8 which converts analog
data into digital data. It is the most convenient method if the CPU processing capacity is
sufficient. Alternatively, a hardware configuration equivalent to this may be used, and digital or
analog processing may be performed.
[0023]
04-05-2019
7
The main functional blocks are: band division unit 9 for converting digital data into band divided
data; phase difference calculation unit 10 for calculating phase differences between respective
microphone channels of band division signals; An acoustic echo canceller adaptation unit 12
adapted to determine whether the acoustic echo is dominant or the speaker voice is dominant for
each band of the band division signal from the phase difference, and an adaptive filter for
acoustic echo cancellation; It comprises an acoustic echo cancellation unit 14 that suppresses
acoustic echo from an input signal using a pseudo echo generation unit 13 that artificially
generates an acoustic echo that is transmitted to a microphone array from a speaker reference
signal, and an adaptive filter. The analog sound pressure data of multiple channels recorded by
the microphone array 1 is converted by the A / D conversion unit 8 into digital data x (t) of
multiple channels. The converted multi-channel digital data is sent to the band dividing unit 9
and converted into multi-channel band divided data x (f: ?). Here, since the microphone array
includes a plurality of microphone elements, the A / D conversion unit 8 and the band division
unit 9 may be arranged in parallel as many as the number of microphone elements.
[0024]
For band division, short time Fourier transform, wavelet transform, band pass filter, etc. are used.
In the band division unit, for example, the frequency from 20 Hz to 20 kHz is divided every 20
Hz. ? is a frame index at the time of short-term frequency analysis. The band-divided data is sent
to the phase difference calculation unit 10. The phase difference calculation unit 10 calculates
the phase difference for each microphone channel by [Equation 1].
[0025]
xi (f, ?) is the f-th band division data of the i channel. Similarly, x j (f, ?) is the f-th band divided
data of the j channel. ? i, j (f, ?) is the phase difference with respect to the f-th band of the i
channel and the j channel. The calculated phase difference for each microphone channel is sent
to the frequency distribution unit 11. In the frequency distribution unit 11, ei, j (f, f) defined by
[Equation 2] from the phase difference Spi, j (f) of echo components from the speaker to the
microphone array set in advance and the phase difference for each microphone channel. If ?) is
calculated and the sum of ei, j (f, ?) relating to index i, j is equal to or less than a preset
threshold, it is determined that the f-th band is a band in which echo is dominant, index i If the
sum of ei, j (f, ?) relating to, j is equal to or greater than a preset threshold value, it is
determined that the voice is near-end.
04-05-2019
8
[0026]
The frequency components determined to be echoes are sent to the acoustic echo canceller
adaptation unit 12. The acoustic echo canceller adaptation unit 12 stores the setting conditions
of the adaptive filter for each of the divided frequency bands. The acoustic echo canceller
adaptation unit 12 uses the pseudo echo component Echo i (f, ?) output from the pseudo echo
generation unit 13 for the frequency band determined by the frequency distribution unit 11 to
be an echo, by using Eq. Adaptive filters hi and ? (f, T) are adapted.
[0027]
Echo i (f, ?) is a pseudo echo component of the microphone of the i-th channel. hi, ? (f, T) are
the f th -th tap T-th filter of the adaptive filter of the i-th channel microphone and a filter adapted
to signals up to ?-1 frame. L is the tap length of the adaptive filter. The adaptation may be
performed for each frequency in this manner, or when the number of bands determined to be in
the speaker direction at time ? is equal to or greater than a predetermined threshold value, all
the frequency components for that time ? Adaptation may be performed. Further, sound source
localization for each frequency may be performed based on the MUSIC method (see Non-Patent
Document 2) and the modified delay-and-sum array method (see Non-Patent Document 3). The
pseudo echo generation unit 13 generates a pseudo echo component e ^ (f, ?) defined by
[Equation 4].
[0028]
d (f, ?) is a band division signal of the original signal output to the speaker. Furthermore, the
pseudo echo generation unit 13 updates the echo phase difference DB with [Equation 5] from the
pseudo echo.
[0029]
The inter-microphone phase difference of the adaptive filter is stored in the echo phase
difference DB. The acoustic echo cancellation unit 14 uses the adaptive filter adapted by the
acoustic echo canceller adaptation unit 12 to generate and output the audio digital data x ^ i (f,
04-05-2019
9
?) after the acoustic echo suppression by the equation (6).
[0030]
As described above, in the present embodiment, using a microphone array having a plurality of
microphone elements, a band in which the speaker sound is dominant is determined from the
phase difference between the microphones, and adaptive control is performed only for that band.
A filter with high echo suppression performance can be generated. In addition, since double talk
detection is possible, it is possible to perform adaptation of the acoustic echo canceller when
sound is produced only from the speaker. Therefore, it is possible to always follow fluctuations in
the acoustic path, and when the speaker emits a sound from the speaker, the learning of the
impulse response is temporarily stopped when the speaker speaks, so there is little failure in the
learning of the impulse response. Become.
[0031]
FIG. 3 is a block diagram of a system for setting an initial value of the echo phase difference DB
using information on the number of speakers and a GUI for specifying the speaker position. From
the function block and database of echo phase difference calculation processing unit 16 that
calculates the phase difference of acoustic echo from the number of speakers used for video
conference and the number of speakers setting the position 15 for specifying the physical
position and the number and position of speakers set And realized by the CPU and storage
means.
[0032]
Speaker Number and Position Setting In the GUI 15, the number of speakers and the position of
the speaker with respect to the microphone array 1 are set. In the speaker number and position
setting GUI 15, it is essential to set the speaker direction with respect to the microphone array 1,
but the number of speakers and the position information set in the speaker number and position
setting GUI are sent to the echo phase difference calculation processing unit 16. Sent. The echo
phase difference calculation processing unit 16 estimates the phase difference Spi, j (f) between
the microphones of the acoustic echo i channel and the j channel based on the assumption of
FarField from the position of the number of speakers. The estimated echo phase difference is
stored in the echo phase difference DB.
04-05-2019
10
[0033]
FIG. 4 shows a large conference room exceeding the performance limit of the echo canceller by
switching between using the echo canceller and using the voice switch by using the adaptive
algorithm and the sound source direction of the present invention. It becomes possible to realize
a video conference system that does not cause howling. In addition to the voice echo cancellation
unit 14, a sound source localization unit 17 that estimates the power of the speaker sound, a
VoiceSwitch determination unit 18 that determines whether to use the VoiceSwitch from the size
of the acoustic echo and the power of the speaker sound, An output signal generation unit 19
that outputs a signal after acoustic echo suppression is provided.
[0034]
The A / D conversion unit, the band division unit, the phase difference calculation unit, the
frequency distribution unit, the acoustic echo canceller adaptation unit, the pseudo echo
generation unit, and the acoustic echo cancellation unit are the same processes as in FIG. The
sound source localization unit 17 calculates the histogram of the phase difference of the
frequency component not regarded as the echo component by the frequency distribution unit 11.
The sound source direction is identified from the calculated peak of the phase difference
histogram. The number of sound source directions to be identified is determined in advance, or is
a peak of a histogram, and if the frequency is a certain threshold or more, it is regarded as a
sound source direction. For the identified sound source direction, the sum of all the powers is
defined as near-end speaker power. The sound source localization unit 17 outputs near end
talker power. The VoiceSwitch determination unit 18 calculates, as the acoustic echo power, the
sum of the powers after the acoustic echo cancellation for the frequency determined to be
dominant by the frequency distribution unit. If the ratio between the calculated acoustic echo
power and the near-end speaker power is equal to or greater than a predetermined threshold, it
is determined that the frame is mainly an acoustic echo and there is no speaker, and it is
determined that the VoiceSwitch is used. Also, if it is equal to or less than a predetermined
threshold value, it is determined that a speaker is present in the frame, and it is determined that
VoiceSwitch is not used. When it is determined that the VoiceSwitch is used, the output signal
generation unit 19 generates and outputs a signal in which all values are set to 0. When it is
determined that the VoiceSwitch is not used, the signal after acoustic echo cancellation output
from the acoustic echo canceller unit is output. If there is a large amount of residual echo in the
signal after the acoustic echo cancellation, the VoiceSwitch determination unit 18 determines
that the VoiceSwitch is to be used, and does not transmit a signal including the residual echo.
04-05-2019
11
Sending a signal containing residual echo may cause the system to close and cause feedback due
to residual echo. Therefore, it is desirable to use VoiceSwitch to prevent the echo from looping to
prevent residual echo, but always using VoiceSwitch prevents near-end talkers and far-end
talkers from talking simultaneously. Therefore, in the VoiceSwitch determination unit 18 of the
present invention, since VoiceSwitch is used only for frames in which residual echo is present,
near end talkers and far end talkers can talk at the same time when residual echo does not occur.
Also, when residual echo occurs, switching to VoiceSwitch can dramatically reduce the possibility
of howling. In this embodiment, the acoustic echo power used for the determination of the
VoiceSwitch determination unit 18 is obtained from the signal after acoustic echo cancellation,
but the acoustic echo power may be calculated from the power of the signal before acoustic echo
cancellation.
[0035]
Also, the residual echo of all frequencies is compared with the near end talker power of all
frequencies to determine the use of the VoiceSwitch, but the residual echo and the near end
talker power are determined for each subband including several frequency bins. You may
compare whether or not to use VoiceSwitch for each sub-band. In this case, the output signal
generation unit 19 outputs, for each subband, the one that has been substituted with the value 0
in the output signal generation unit 19. The sub-band determined not to use the VoiceSwitch
outputs the signal after acoustic echo cancellation in the output signal generation unit 19.
[0036]
FIG. 5 shows an overall view of the video conference system when the present invention is
applied. This system is a television conference system characterized in that the adaptation of the
acoustic echo canceller is controlled on the calculator 101 using the information of the phase
difference and the sound source direction calculated by the phase difference calculation unit 10.
[0037]
FIG. 5 shows the system configuration of one site. The video conference system 100 performs
04-05-2019
12
sound signal processing or image processing and communication processing with the computer
101. An A / DD / A device 102 is connected to the computer 101, and an audio signal recorded
by the microphone array 105 is converted into a digital audio signal by the A / DD / A device
102 and sent to the computer 101. The microphone array 105 has a plurality of microphone
elements.
[0038]
The computer 101 performs acoustic signal processing on the digital audio signal, and the
processed audio signal is sent via the hub 103 onto the network. Here, the computer 101
includes the CPU 3, the memory 4 and the external storage medium 7 shown in FIG.
[0039]
The external storage medium 7 may be inside the computer 101 or outside the computer 101.
Then, in the CPU 3 in the computer 101, the band division unit 9, the phase difference
calculation unit 10, the frequency distribution unit 11, the acoustic echo canceller adaptation
unit 12, the pseudo echo generation unit 13, and the acoustic echo canceller unit as shown in
FIG. 14 or as shown in FIG. 9 to be described later, it has an audio transmission unit 201, an
acoustic echo canceller adaptation unit 204, an acoustic echo canceller unit 205, an audio
recording unit 203, an audio reception unit 207, and an audio reproduction unit 208, An
acoustic echo canceller is realized by these.
[0040]
The image signal of the other site sent to the video conference system 100 via the hub 103 is
sent to the image display device 104 and displayed on the screen. The audio signal of the other
site sent through the hub 103 is output from the speaker 106.
[0041]
The sound received by the microphone array 105 includes an acoustic echo transmitted from the
speaker 106 to the microphone array 105, which needs to be removed. The digital cable 110 and
04-05-2019
13
the digital cable 113 use a USB cable or the like.
[0042]
FIG. 6 shows a configuration of acoustic echo suppression processing of the prior art by an
acoustic echo model and an acoustic echo canceller using an adaptive filter when sound is
transmitted from the speaker to the microphone element.
[0043]
All signals are expressed by z-transform.
The incoming signal d (z) is emitted from the speaker and arrives in the form in which the room
impulse response H (z) is folded into the microphone element. The impulse response H (z)
includes the direct sound from the speaker to the microphone and the reflection (acoustic echo)
from the wall, floor, ceiling or the like.
[0044]
In the microphone element, in addition to the acoustic echo, the speaker voice N (z) is mixed. If
the microphone element signal X (z) is transmitted as it is, the transmission voice contains an
acoustic echo, the signal loops, and in the worst case, it causes howling and communication
becomes impossible. Therefore, it is necessary to suppress only the acoustic echo from the
transmission voice.
[0045]
The adaptive filter W (z) is a filter that adaptively learns the room impulse response H (z), and
can generate pseudo acoustic echoes by applying W (z) to the reception signal. The adaptation of
the adaptive filter W (z) is performed using, for example, the NLMS method. In the NLMS method,
the adaptive filter is updated as W (z) = W (z) +2 ?X (z) N ? (z) * / | X (z) | ^ 2.
[0046]
04-05-2019
14
In the case where W (z) = H (z), only the speaker voice N (z) can be extracted by subtracting the
pseudo acoustic echo from the microphone signal.
[0047]
In the case of W (z) = H (z) when N (z) = 0, the transmission voice is 0.
That is, in the update equation of the NLMS method described above, W (z) is adaptively changed
so that the transmission speech becomes zero.
[0048]
However, if N (z) is not 0, W (z) is adaptively changed so that the transmission speech becomes 0,
and conversely, W (z) fails in adaptation. Therefore, when N (z) is not 0, it is necessary to control
so as not to be adaptive.
[0049]
FIG. 7 shows a system to which the present invention is applied, which has a double talk detector
having a function of controlling so as not to be adaptive when N (z) is not zero. The double talk
detector determines if N (z) is zero and applies an adaptive filter only when N (z) is close to zero.
[0050]
This system to which the present invention is applied is characterized in that the double talk
detection unit 702 which has received a sound by the microphone array performs the
determination using the information on the arrival direction of the sound source obtained by the
phase difference calculation unit 701.
[0051]
If N (z) is not 0, updating the acoustic echo canceler will fail to adapt and there is a risk that the
filter will diverge, so the double talk detector is mandatory to avoid that risk .
04-05-2019
15
[0052]
FIG. 8 shows the flow of an audio stream in a two-point video conference system and the flow of
an audio stream in a three or more point conference system.
Here, the phase difference calculation unit may be in the server or may be in the CPU of each
site.
[0053]
In the case of the two bases, the transmission signal after acoustic echo cancellation is sent from
the video conference system at the base A to the video conference system at the base B via the
network and reproduced at the base B.
The voice of the site B is sent to the site A and reproduced.
[0054]
Also, in the case of three or more sites, data is collected once by the server or CPU, redistributed
to each site, and reproduced.
[0055]
FIG. 9 shows a block configuration of the video conference system when the present invention is
applied.
The received voice transmitted through the network is received by the voice receiver 207. The
received voice is sent to the voice reproduction unit 208. The voice reproduction unit 208
reproduces the received voice by the speaker.
[0056]
04-05-2019
16
The received voice is sent to the acoustic echo canceller unit 205. The voice recording unit 203
records voice signals of the microphone array. The recorded audio signal is sent to the acoustic
echo canceller unit 205.
[0057]
The acoustic echo canceller unit 205 generates a pseudo echo from the acoustic echo
cancellation filter stored in the acoustic echo cancellation filter DB 211 and the reception voice,
and subtracts the pseudo echo from the audio signal of the microphone array. As a result of
subtraction, the remaining error signal is sent to the acoustic echo canceller adaptation unit 204.
[0058]
The acoustic echo canceller adaptation unit 204 adapts the acoustic echo canceller to zero the
error signal. The adapted result is stored in the acoustic echo cancellation filter DB 211. The
error signal output from the acoustic echo cancellation unit 205 is sent to the voice transmission
unit 201.
[0059]
The voice transmission unit 201 transmits an error signal to another site. The image capturing
unit 210 captures an image with a camera. The photographed image is sent to the image
transmission unit 202 and transmitted to another site.
[0060]
An image reception unit 209 receives an image sent from another site. The received image is sent
to the image display unit 206. The image display unit 206 displays the sent image on the screen.
[0061]
04-05-2019
17
FIG. 10 shows a processing flow of the video conference system. In the acoustic echo canceller
adaptation processing S1, a learning signal is sounded from the speaker to perform adaptation of
the acoustic echo canceller. The learning signal is preferably white noise. The learning signal
length is desirably several seconds to several tens of seconds or more. If the learning length is
short, the acoustic echo canceller may not sufficiently learn the room impulse response. As
described above, the impulse response can be sufficiently learned by setting the learning signal
length to several seconds to several tens seconds or more.
[0062]
After the end of learning, it is determined whether a connection request has been issued from
another location S2. If there is a connection request from another base, connection S4 is made
with the other base.
[0063]
If a connection request is not issued from another site, it is determined whether a connection
request is issued from the own site S3. The connection request from the own site is issued by the
user through the GUI.
[0064]
If there is a connection request from the own site, connection S4 with other sites is performed. If
a connection request is not issued from the own site, the connection is not made with another
site, and the process returns to the determination of S2 whether the connection request is issued
from another site.
[0065]
That is, the video conference system waits until a connection request is issued from either the
own site or another site.
04-05-2019
18
[0066]
After connection S4 with other locations, playback S6, image display S7, voice recording S8, echo
cancellation S9, and voice transmission S10 are repeated from the speaker until the connection is
disconnected.
[0067]
In the reproduction S6 from the speaker, the reception voice sent from the other base is
reproduced.
[0068]
In the image display S7, an image sent from another site is displayed on the monitor.
[0069]
In voice recording S8, the voice of the microphone array at the own site is recorded.
[0070]
In echo cancellation S9, an acoustic echo component is suppressed from the sound of the
recorded microphone array.
[0071]
In audio transmission S10, the audio signal after the acoustic echo component suppression is
transmitted.
If it is determined in S11 that the connection has been disconnected, it is determined that the
connection has been disconnected, a connection S13 is made from another base, and the video
conference system is ended.
[0072]
If it is determined that the connection has not been disconnected, it is determined whether or not
there is a disconnection request from the user of the own site through the GUI at S12. If there is
a disconnection request, another site disconnects S13 and the video conference system is ended.
04-05-2019
19
[0073]
FIG. 11 shows sparsity which is a basic concept of double talk processing which is a main
element of the present invention.
[0074]
In the present invention, the voice signal from the microphone array and the received signal used
for pseudo echo generation are all subjected to short time Fourier transform, wavelet transform
or subband processing, and converted into frequency domain signals.
It is desirable that the frame size at the time of the short time Fourier transform is a point
number corresponding to about 50 ms.
[0075]
For example, at a sampling rate of 32 kHz, a frame size of 2048 points is desirable.
The speech is steady for about several tens of ms, and by setting it to such a frame size, it can be
assumed that the sparsity is most established in the frequency domain, and the adaptive
processing of the acoustic echo canceller operates with high accuracy. It is possible to
[0076]
In addition, it is desirable to perform short-time Fourier transformation after applying a
Hamming window, a Hanning window, a Blackman window, or the like.
The short time Fourier transform assumes that the signal is repeated in an analytical long period.
If the window function is not multiplied, the values at both ends will be different, and after a
04-05-2019
20
short time Fourier transform, non-existent frequencies will be observed.
By applying the window function in this manner, it becomes difficult to observe non-existent
frequency components, and it becomes possible to improve frequency analysis accuracy.
[0077]
The frame shift is preferably about 1/4 or 1/8 of the frame size, and the finer the frame shift, the
better the sound quality of the output voice.
However, as the frame shift is made finer, the processing amount becomes larger, so it is
necessary to make the frame shift as fine as possible within the range that can be processed in
real time at the processing speed of the installed computer.
[0078]
FIG. 11 shows a grid in which the horizontal axis is time (frame number) and the vertical axis is
frequency (frequency bin number).
[0079]
In the double-talk processing of the present invention, it is determined for each time-frequency
component whether the component is an acoustic echo component or a non-acoustic echo
component, and adaptation of the acoustic echo canceller is performed only for the timefrequency determined to be an acoustic echo component. Do the processing.
[0080]
Speech is a sparse signal when viewed in the time-frequency domain, and it is known that mixing
of a plurality of speech on the same time-frequency is rare.
[0081]
When both the incoming call signal and the outgoing call signal are voice signals as in a
teleconferencing system, only the acoustic echo component is highly accurate by distributing the
acoustic echo component or the non-acoustic echo component for each time-frequency from
sparsity. It is possible to extract
04-05-2019
21
[0082]
FIG. 12 shows the basic configuration of an acoustic echo canceller to which the present
invention is applied.
[0083]
First, a band division unit performs frequency decomposition S101 on an audio signal input to
the computer 101 from a microphone array having a plurality of microphone elements, and
converts recorded audio into a signal in the frequency domain.
[0084]
Next, the phase difference calculation unit calculates the inter-element phase difference of the
recorded voice.
[0085]
Next, the frequency distribution unit determines which sound the band division signal is from the
phase difference of each band output by the phase difference calculation unit.
That is, it is determined whether it is a speaker output signal or a speaker signal.
[0086]
Then, the acoustic echo canceller unit removes voice contained in the band division signal S102.
[0087]
In S102, W (z) is multiplied by the reference signal d (z) to generate a pseudo echo W (z) d (z).
Acoustic echo can be eliminated from the microphone input signal by subtracting W (z) d (z)
from the microphone input signal x (x).
04-05-2019
22
[0088]
The acoustic echo canceller adaptation unit determines that it is a double talk state when there
are many components which are speaker output signals among the band division signals, and
performs an adaptive processing S104 of the acoustic echo canceller.
[0089]
The adaptive processing of the acoustic echo canceller updates the filter W (z) of the acoustic
echo canceller by the NLMS method or the like.
In the NLMS method, W (z) = W (z) + 2.mu.X (z) N '(z) * /. Vertline.X (z) .vertline. @ 2.
[0090]
In the case of the double talk state, the echo canceller adaptation S103 is not performed, and the
acoustic echo canceller processing is ended.
[0091]
FIG. 13 shows a processing flow of processing in which only time-frequency components coming
from the direction of the speaker are regarded as acoustic echoes using sound source direction
information.
[0092]
In frequency decomposition S201, the recorded voice is converted into a time-frequency domain
signal.
[0093]
In sound source localization S202, sound source direction estimation is performed based on the
modified delay-and-sum array method.
The modified delay-sum array method used for sound source direction estimation in the present
04-05-2019
23
invention determines from which direction the component comes from for each time-frequency.
[0094]
As shown in FIG. 11, since the speech is a sparse signal when viewed every time-frequency, it can
be assumed that an acoustic echo is divided into a main component and a non-acoustic echo as a
main component every time-frequency. .
Therefore, if the time-frequency component coming from the direction of the speaker is selected
from the sound source localization result estimated for each time-frequency, it can be considered
that the component is an acoustic echo as the main component.
[0095]
In the modified delay sum array method, sound source localization is performed using a steering
vector A? (f) in the sound source direction ?.
When the number of microphones is M, A? (f) is an M-dimensional complex vector.
[0096]
Here, f represents a frequency bin number.
The time-frequency representation of the input signal is denoted as X (f, ?).
? is the frame number of the short time Fourier transform.
X (f, ?) is an M-dimensional complex vector, and is a vector having elements of the frame ? and
frequency f of each microphone element.
04-05-2019
24
[0097]
In the modified delay-sum array method, a virtual sound source direction ? at which | A? (f) * X
(f, ?) | is maximum is estimated to be a sound source direction of the frame ? and the
frequency f.
[0098]
In the speaker direction identification step S203, the number of frequencies f whose sound
source direction is estimated to be ? for each virtual sound source direction ? or log | A? (f) *
X (f, ?) |
Then, the peak of the histogram within a predetermined range (for example, -30 ░ to 30 ░) of
the speaker direction is calculated, and the direction is set as the speaker direction ?sp.
[0099]
If it is determined at step S204 that the sound source direction is the speaker direction, if the
estimated sound source direction ? ? at frequency f satisfies | ????sp | <?, the frequency f
is regarded as a frequency component that has arrived from the speaker direction.
Then, it is regarded as an acoustic echo component, and echo canceller adaptation S205 is
performed.
[0100]
After S204 and S205 are performed on all frequencies, the adaptation processing of the acoustic
echo canceller is ended.
[0101]
Now, next, FIG. 14 shows an acoustic echo canceller adaptive process using information similar
to the sound source direction that can be calculated from the pseudo echo.
04-05-2019
25
[0102]
Frequency decomposition S301 is performed to convert the recorded voice into a signal in the
frequency domain.
[0103]
An acoustic echo filter is applied to the converted signal in the frequency domain to calculate a
pseudo echo S302.
[0104]
The similarity between the pseudo echo calculated for each frequency f and the input signal is
calculated S303.
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements.
The 0-th element of E (f, ?) is described as E0 (f, ?).
It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?).
????
E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) |
Let the similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |.
[0105]
04-05-2019
26
This degree of similarity looks at the degree of similarity between the acoustic echo component
and the sound source direction of the input signal, and when the input signal contains only the
acoustic echo component, the degree of similarity is 1. A value obtained by multiplying the
degree of similarity by a different threshold ? (f) for each frequency is taken as the final degree
of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f, (tau)) * A (theta) (f) |.
[0106]
When the similarity exceeds the predetermined threshold th (S304), the adaptation S305 of the
echo canceller is performed, and when the similarity falls below the threshold th, the adaptation
S305 of the echo canceller is not performed.
[0107]
After S303 to S305 are performed for every frequency, the adaptation process of the acoustic
echo canceller is finished.
[0108]
FIG. 15 shows acoustic echo canceller adaptation processing using information similar to the
sound source direction that can be calculated from the pseudo echo.
In this process, whether to adapt the acoustic echo canceller is the same for all frequency
components in the frame.
[0109]
Frequency decomposition S401 is performed to convert the recorded voice into a frequency
domain signal.
An acoustic echo filter is applied to the converted signal in the frequency domain to calculate a
pseudo echo S402.
04-05-2019
27
[0110]
The degree of similarity between the pseudo echo calculated for each frequency f and the input
signal is calculated S403.
[0111]
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements. The 0-th element of E (f, ?) is described as E0 (f,
?). It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?). ???? E ? ? (f, ?) = E ? (f, ?) / | E ?
(f, ?) | Let the similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |.
[0112]
This degree of similarity looks at the degree of similarity between the acoustic echo component
and the sound source direction of the input signal, and when the input signal contains only the
acoustic echo component, the degree of similarity is 1. A value obtained by multiplying the
degree of similarity by a different threshold ? (f) for each frequency is taken as the final degree
of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f, (tau)) * A (theta) (f) |. Calculation of
similarity is performed for every frequency. Then, the similarity of all frequencies is added S404.
[0113]
If the added similarity exceeds the predetermined threshold th (S405), the echo canceller
adaptation S406 is performed on all frequency components, and if it falls below the echo
canceller adaptation S406, the acoustic echo canceller adaptation is not performed. End the
process. When performing adaptation, after the echo canceller adaptation S406, the acoustic
echo canceller adaptation processing ends.
[0114]
FIG. 16 shows a processing flow of processing in which only time-frequency components coming
from the direction of the speaker are regarded as acoustic echoes using sound source direction
04-05-2019
28
information.
[0115]
Whether to adapt the acoustic echo canceller in this process is the same for all frequency
components in the frame.
[0116]
In the frequency decomposition S501, the recorded voice is converted into a time-frequency
domain signal.
In sound source localization S502, sound source direction estimation is performed based on the
modified delay-and-sum array method.
[0117]
In the modified delay sum array method, sound source localization is performed using a steering
vector A? (f) in the sound source direction ?.
When the number of microphones is M, A? (f) is an M-dimensional complex vector.
[0118]
Here, f represents a frequency bin number. The time-frequency representation of the input signal
is denoted as X (f, ?). ? is the frame number of the short time Fourier transform. X (f, ?) is an
M-dimensional complex vector, and is a vector having elements of the frame ? and frequency f
of each microphone element.
[0119]
In the modified delay-sum array method, a virtual sound source direction ? at which | A? (f) * X
(f, ?) | is maximum is estimated to be a sound source direction of the frame ? and the
04-05-2019
29
frequency f.
[0120]
In the speaker direction identification step S503, the number of frequencies f whose source
direction is estimated to be ? for each virtual sound source direction ? or log | A? (f) * X (f, ?)
| is accumulated to create a histogram.
Then, the peak of the histogram within a predetermined range (for example, -30 ░ to 30 ░) of
the speaker direction is calculated, and the direction is set as the speaker direction ?sp.
[0121]
If it is determined at step S504 that the sound source direction is the speaker direction or if the
estimated sound source direction ? ? at the frequency f satisfies | ????sp | <?, then the
frequency f is regarded as a frequency component that has arrived from the speaker direction.
Then, the number of frequencies f regarded as the speaker direction or log | A? (f) * X (f, ?) | is
accumulated to obtain a power spectrum added in the frequency direction. Whether or not the
power spectrum added at all frequencies is equal to or greater than a predetermined value is
determined in S506. If it is equal to or more than a threshold, echo canceller adaptation S507 is
performed for all frequency components, and the process is ended.
[0122]
FIG. 17 is a diagram showing the effect of the adaptive processing of the present acoustic echo
canceller. This is the result of suppressing the acoustic echo in the input signal in which only the
acoustic echo is present, using the echo cancellation filter adapted with the voice (double-talk
voice) spoken by the speaker at the local site.
[0123]
In this case, it is desirable that all input signals be suppressed to be silent. The upper part shows
04-05-2019
30
the result in the case of performing adaptive control according to the present invention.
[0124]
The lower part shows the result when adaptive control is not performed. The results are shown
in the figure where the greater the power per time-frequency, the smaller the brighter, the
darker. The horizontal axis is time, and the vertical axis is frequency. In the case of performing
adaptive control, it can be seen that the signal power after acoustic echo suppression is
particularly small at high frequencies, and the acoustic echo suppression performance is high.
[0125]
In the video conference system using the present invention, a configuration may be adopted in
which an acoustic echo canceller is adapted by sounding a white signal or the like from the
speaker of the own site before connecting to another site.
[0126]
FIG. 18 shows a process flow in the case where the acoustic echo canceller is adapted in advance.
[0127]
An adaptive processing S601 of an acoustic echo canceller using data of all frames while the
speaker at the own site is on is performed before the other site is connected.
[0128]
This corresponds to unconditionally performing the adaptive processing of the acoustic echo
canceler without performing the double talk detection of the present invention.
[0129]
Then, the other base connection waiting S602 is performed, and it waits until the user of the own
base sends a connection request to the other base or if the connection request comes from the
other base.
[0130]
04-05-2019
31
After connecting to another site, acoustic echo canceller adaptation processing S603 with
adaptive control by the double talk detection processing of the present invention is repeated.
Then, after disconnection of the video conference system, the process ends.
[0131]
FIG. 19 shows a flow of processing for performing control of non-linear processing by
VoiceSwitch using the similarity between the input signal and the pseudo echo.
[0132]
Frequency decomposition S701 is performed to convert the recorded voice into a signal in the
frequency domain.
[0133]
An acoustic echo filter is applied to the converted signal in the frequency domain to calculate a
pseudo echo S702.
[0134]
The degree of similarity between the pseudo echo calculated for each frequency f and the input
signal is calculated S703.
[0135]
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements.
The 0-th element of E (f, ?) is described as E0 (f, ?).
04-05-2019
32
It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?).
????
E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) |
Let the similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |.
This degree of similarity looks at the degree of similarity between the acoustic echo component
and the sound source direction of the input signal, and when the input signal contains only the
acoustic echo component, the degree of similarity is 1.
A value obtained by multiplying the degree of similarity by a different threshold ? (f) for each
frequency is taken as the final degree of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f,
(tau)) * A (theta) (f) |.
[0136]
If the similarity exceeds the predetermined threshold th (S704) and the power of the input signal
is equal to or higher than the threshold, the corresponding frequency component in the
transmission voice is set to zero. Otherwise, the processing is ended with the signal after echo
cancellation as the corresponding frequency component in the transmission voice.
[0137]
FIG. 20 shows a flow of processing for performing control of non-linear processing by
VoiceSwitch using the similarity between the input signal and the pseudo echo. The
determination of whether to use VoiceSwitch is common to all frequencies.
[0138]
04-05-2019
33
Frequency decomposition S801 is performed to convert the recorded voice into a signal in the
frequency domain. An acoustic echo filter is applied to the converted signal in the frequency
domain to calculate pseudo echo S802. The similarity between the pseudo echo calculated for
each frequency f and the input signal is calculated S803.
[0139]
In the similarity calculation process, the pseudo echo E (f, ?) is used. E (f, ?) has M-dimensional
complex vectors, frame ? of each microphone element, and pseudo echo components of
frequency f as elements. The 0-th element of E (f, ?) is described as E0 (f, ?). It is defined that E
? (f, ?) = E (f, ?) / E0 (f, ?). ???? E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) | Let the
similarity be | E ? ? (f, ?) * X (f, ?) | / | X (f, ?) |. This degree of similarity looks at the degree
of similarity between the acoustic echo component and the sound source direction of the input
signal, and when the input signal contains only the acoustic echo component, the degree of
similarity is 1. A value obtained by multiplying the degree of similarity by a different threshold ?
(f) for each frequency is taken as the final degree of similarity. It is set as (alpha) (f) = 1 / (SIGMA)
| E '' (f, (tau)) * A (theta) (f) |. Add the similarity at all frequencies.
[0140]
When the similarity after addition exceeds the predetermined threshold th (S805) and the power
of the input signal is equal to or higher than the threshold, the transmission voice signal is set to
0. Otherwise, the processing is terminated using the signal after echo cancellation as the
transmission voice.
[0141]
FIG. 21 shows a flow of processing for controlling the non-linear suppression coefficient of the
residual echo after acoustic echo cancellation using the similarity between the input signal and
the pseudo echo.
[0142]
Frequency decomposition S901 is performed to convert the recorded voice into a signal in the
frequency domain.
04-05-2019
34
[0143]
An acoustic echo filter is applied to the converted frequency domain signal to calculate pseudo
echo S902.
[0144]
The similarity between the pseudo echo calculated for each frequency f and the input signal is
calculated S903.
[0145]
In the similarity calculation process, the pseudo echo E (f, ?) is used.
E (f, ?) has M-dimensional complex vectors, frame ? of each microphone element, and pseudo
echo components of frequency f as elements.
The 0-th element of E (f, ?) is described as E0 (f, ?).
It is defined that E ? (f, ?) = E (f, ?) / E0 (f, ?).
???? E ? ? (f, ?) = E ? (f, ?) / | E ? (f, ?) | Let the similarity be | E ? ? (f, ?) * X (f, ?)
| / | X (f, ?) |. This degree of similarity looks at the degree of similarity between the acoustic
echo component and the sound source direction of the input signal, and when the input signal
contains only the acoustic echo component, the degree of similarity is 1. A value obtained by
multiplying the degree of similarity by a different threshold ? (f) for each frequency is taken as
the final degree of similarity. It is set as (alpha) (f) = 1 / (SIGMA) | E '' (f, (tau)) * A (theta) (f) |.
[0146]
When the similarity exceeds the predetermined threshold th (S904) and the power of the input
signal is equal to or higher than the threshold, the nonlinear suppression coefficient ? is set to a
04-05-2019
35
predetermined value ?0. Otherwise, the non-linear suppression coefficient ? is ?1. It is
previously determined as ?0> ?1.
[0147]
The signal after acoustic echo cancellation S 907 is n ? (f, ?), and the pseudo echo component
is e (f, ?). In the non-linear suppression processing S 908, n ? ? (f, ?) = Floor (| n ? (f, ?) | ?
| e (f, ?) |) arg (n ? (f, ?)), n ? Output '(f, ?). Here, Floor (x) is a function that returns x when x
is 0 or more and 0 when x is 0 or less. arg (x) is a function that returns the phase component of
x. After non-linear suppression processing is performed on all frequencies, the processing ends.
[0148]
As described above, according to the present invention, control of acoustic echo can be realized
by a teleconference system, a teleconference system or the like, and can be applied to acoustic
echo cancellation technology in a double talk state.
[0149]
The figure which showed the hardware constitutions of this invention.
Block diagram of the software of the present invention. The block diagram of the speaker
number and position designation by GUI of this invention. The block diagram of VoiceSwitch
switching processing by sound source localization of the present invention. BRIEF DESCRIPTION
OF THE DRAWINGS The whole apparatus which applied this invention to the video conference
system. The figure explaining the acoustic echo canceller of a prior art. The figure explaining the
acoustic echo canceller which used the double talk detection process. The figure which showed
the structure of 2 base video conferences. FIG. 1 is a block diagram of a video conference system
to which the present invention is applied. The processing flow figure of the video conference
system to which the present invention is applied. The figure explaining the acoustic echo
component identification method in this invention. The processing flow figure of the echo
canceller of this invention, a double talk judgment, and an echo canceller adaptation process. The
echo canceller adaptive control flow figure for every frequency which used sound source
direction. The echo canceller adaptive control flow figure for every frequency using the
information similar to the sound source direction. The echo canceller adaptive control flowchart
of all the frequency components using the information similar to the sound source direction. The
04-05-2019
36
echo canceller adaptive control flow figure for every frequency which used sound source
direction. The figure which showed the effect of this invention. The processing flow figure in the
case of performing adaptation processing before a video conference start. VoiceSwitch control
flow diagram for each frequency using information similar to the sound source direction. The
voice switch control flow diagram of all the frequencies using the information similar to the
sound source direction. The suppression coefficient control flowchart of the nonlinear acoustic
echo suppression processing for every frequency using the information similar to a sound source
direction.
Explanation of sign
[0150]
1?Microphone array 2. A / D converter and D / A converter 3. Central processing unit 4.
Memory 5. Hub 5. Speaker 7. External storage medium 8. A / D converter 9. ... band division unit,
10 ... phase difference calculation unit, 11 ... frequency distribution unit, 12 ... acoustic echo
canceller adaptation unit, 13 ... pseudo echo generation unit, 14 ... acoustic echo cancellation
unit, 15 ... number of speakers and position setting GUI, 16 ... echo phase difference calculation
processing unit, 17 ... sound source localization unit, 18 ... Voice Switch determination unit, 19 ...
output signal generation unit, 100 ... video conference system, 101 ... computer, 102 ... A / DD /
A Device 103 103 Hub 104 Image display device 105 Microphone array 106 Speaker 107
Camera 108 Audio cable 109 Audio cable 1 0 ... digital cable, 111 ... monitor cable, 112 ... LAN
cable, 113 ... digital cable, 201 audio transmission unit, 202 ... image transmission unit, 203 ...
audio recording unit, 204 Acoustic echo canceller adaptation unit 205 Acoustic echo canceller
unit 206 Image display unit 207 Audio reception unit 208 Audio reproduction unit 209 Image
reception unit 210: image capturing unit, 211: acoustic echo canceller filter DB, 701: phase
difference calculating unit, 702: double talk detecting unit, S1: acoustic echo canceller adaptive
processing, S2: Determination of whether a connection request has been received from another
site, S3: determination of whether a connection request has been received from the own site, S4:
connection to another site, S6: reproduction processing from a speaker, S ... Image display, S8 ...
Voice recording, S9 ... Echo cancellation, S10 ... Voice transmission, S11 ... Determination of
disconnection, S12 ... Whether there is a disconnection request from the own site S13 ...
Disconnect from other sites, S101 ... Frequency resolution, S102 ... Echo canceller execution,
S103 ... Double-talk state determination, S104 ... Echo canceller adaptation, S201 и и и Frequency
decomposition, S 202 иии Sound source localization, S 203 и и и Speaker direction identification, S
204 и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и и echo cancellation can be applied S301 и и и frequency
decomposition ... Simulated echo calculation, S303 ... Calculation of similarity with input signal,
S304 ... Determination of similarity more than threshold value, S305 ... Eco -Canceller adaptation,
S401: frequency decomposition, S402: pseudo echo calculation, S403: similarity calculation with
input signal, S404: similarity addition processing, S405: similarity after addition is a threshold
04-05-2019
37
Determination as above, S406: All frequency echo canceller adaptation, S501: Frequency
resolution, S502: Source localization, S503: Speaker direction identification, S504: Determination
of whether the source direction is the speaker direction , S505: power spectrum addition
processing, S506: determination of whether the speaker direction spectrum is equal to or more
than a threshold value, S507: echo canceller adaptation, S601: all frame adaptation processing,
S602: waiting for connection to another site, S603 ... Acoustic echo canceller adaptive processing
with adaptive control, S 701 ... Frequency resolution, S 702 ... Pseudo Echo calculation, S703 ...
Calculation of similarity with input signal, S704 ... Determination of similarity being above
threshold, S705 ... Determination of power above threshold, S706 ... VoiceSwitch determination,
S801 ... Frequency decomposition, S802: pseudo echo calculation, S803: similarity calculation
with the input signal, S804: similarity addition processing, S805: determination after similarity
after addition is a threshold value, S806 и и Determination of whether the input power is equal to
or higher than the threshold value, S 807 иии All-band VoiceSwitch determination, S 901 и и и
Frequency decomposition, S 902 и и и Pseudo echo calculation, S 903 и и и Similarity calculation
with input signal и Judgment as to whether the similarity is above the threshold, S 905 иии
Judgment as to whether the power is above the threshold, S 906 и и и и и и и и и и и и и и и и и и и и и и и и и и и и и
и и и ?? adjustment of the nonlinear suppression coefficient Canceller, S908 ии?
Документ
Категория
Без категории
Просмотров
0
Размер файла
55 Кб
Теги
jp2008141718
1/--страниц
Пожаловаться на содержимое документа