close

Вход

Забыли?

вход по аккаунту

?

JP2010054728

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010054728
[PROBLEMS] To extract a specific sound source component from among a plurality of mixed
voices in a reverberant environment. According to the multi-channel space prediction and
distortion correction processing of the present invention, it is possible to accurately separate, for
each sound source, a sound in which a plurality of sounds are mixed in an indoor reverberation
environment. [Selected figure] Figure 4
Sound source extraction device
[0001]
The present invention relates to a speech extraction apparatus, and relates to a sound source
extraction apparatus that extracts only a signal of a specific sound source from among various
sound sources mixed.
[0002]
A sound source separation technique for extracting only a specific sound from various sounds
using a plurality of microphones has been actively studied conventionally.
An application such as extraction of a driver's voice from voice data of a vehicle interior recorded
with traveling noise superimposed has been considered (for example, see Patent Document 1).
Conventional sound source separation techniques are roughly classified into two: beam source
techniques such as blind source separation techniques based on independent component analysis
04-05-2019
1
and methods based on SNR maximization criteria (see, for example, Non-Patent Document 2).
[0003]
JP 2007-10897 A J. Chen, J. Benesty, and Y. Huang, “A minimum distortion noise reduction
algorithm with multiple microphones,” IEEE Trans. ASLP, vol. 16, pp. 481-493, 2008 S. Araki, H.
Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum snr
beamformers,” Proc. ICASSP 2007, vol. I, pp. 41-44, 2007 M. Togami, T. Sumiyoshi, and A.
Amano, “Stepwise phase difference restoration method for sound source localization using
multiple microphone pairs,” Proc. ICASSP 2007, vol. I, pp. 117-120, 2007.
[0004]
Although the blind source separation technology has the advantage of not requiring information
on the microphone arrangement and the target sound direction, there is a problem that the
performance is not sufficient in an environment where reverberation exists. The beamforming
method based on the SNR maximization standard suffers from poor performance when the signal
band is wide. Therefore, it is general to apply a beamforming method based on the SNR
maximization criterion to a signal converted into a narrow band signal by time frequency
decomposition. However, in order to convert to a narrow band signal in general, the frame length
needs to be long, but when the frame length is long, there is a problem that the assumption of
voice constancy is broken and the performance is rather deteriorated. As a method applicable to
wide-band signals in the time domain, there is a minimum distortion beamformer method (see,
for example, Non-Patent Document 1). In this method, noise is highly effective in noise
suppression when stationary, such as the sound of a fan of a projector, but in principle it is noise
suppression in the case of non-stationary noise in which the volume changes momentarily like
voice. There was a problem that the effect was low.
[0005]
The sound source extraction apparatus according to the present invention has multi-channel
space prediction capable of estimating spatial transfer characteristics of noise using a plurality of
channels of microphone elements, and correction processing of distortion of target sound
accompanying multi-channel space prediction. . In multi-channel spatial prediction, the spatial
transfer characteristics of noise can be estimated regardless of whether the noise is stationary or
04-05-2019
2
non-stationary. Therefore, it is possible to suppress even non-stationary noise by using the
estimated space transfer characteristic. Further, the present invention has a noise removal filter
having a plurality of taps, and can suppress noise in consideration of reverberation. Similarly,
since the reverberation of the target sound can be considered, the reverberation component of
the target sound can be extracted without distortion.
[0006]
A sound source extraction device according to the present invention includes a microphone array
composed of a plurality of microphone elements, an AD conversion device that converts an
analog signal output from the microphone array into a digital signal, a calculation device, and a
storage device. The digital signal processing is performed to suppress the noise component in the
digital signal converted by the AD converter, and the noise suppression signal is extracted, and
then the distortion of the target sound included in the noise suppression signal is corrected. The
corrected signal is stored in the reproduction or storage device.
[0007]
The calculation device approximates the noise signal contained in one of the plurality of
microphone elements by the sum of noise signals contained in elements other than the element
subjected to the first FIR filter, and the square of the approximation error A multi-channel space
prediction unit that determines the coefficients of the first FIR filter such that the sum is
minimized, and the noise suppression signal is a signal from any one of a plurality of microphone
elements to an element other than the element And a signal obtained by superimposing the first
FIR filter predicted by the multi-channel spatial prediction unit on the signal contained in the
signal.
[0008]
Furthermore, multi-channel distortion in which a noise suppression signal is individually
generated for the outputs of all the microphone elements of the microphone array, and a
plurality of generated noise suppression signals are subjected to a second FIR filter to obtain a
distortion correction signal of one channel A correction unit, and a square error between the
distortion correction signal and the output signal of a specific microphone element in the
microphone array or its delay signal and the distortion correction signal when the input signal of
the microphone element is noise only It is preferable to determine the second FIR filter of the
multi-channel distortion correction unit so that the sum of the product of multiplication and the
product of the constant value is minimized.
[0009]
04-05-2019
3
Furthermore, a noise signal estimation unit for estimating a noise signal is provided, and the sum
of the estimated noise signal and the distortion correction signal superimposed on a separate
third FIR filter, and the output signal of a specific microphone element in the microphone array
Alternatively, the third FIR filter is determined such that the squared error between the delay
signal and the delay signal is minimized, and a one-channel distortion correction unit that
outputs a distortion correction signal superimposed on the third FIR filter is provided. Is
preferred.
[0010]
The noise section is identified based on the mixing degree calculated from the ratio of the target
sound power to the noise power for each short time section calculated based on the information
of the target sound position identified by the user's designation operation of the target sound
position. be able to.
[0011]
In the noise suppression method of the present invention, if the spatial transfer characteristic of
noise is invariant, it is possible in principle to eliminate even non-stationary noise such as speech
from the original signal of noise.
Therefore, it is possible to extract a specific voice from a sound in which a plurality of voices are
mixed, and a high-accuracy voice monitoring system can be realized.
Also, the present invention is applicable to wideband signals in the time domain or subband
domain, and there is no need to convert the signal in the time frequency domain.
It is not necessary to consider the constancy problem of speech in the time frequency domain,
and it is possible to obtain a high performance noise suppression signal compared to the
technology in the time frequency domain.
[0012]
Hereinafter, specific embodiments of the present invention will be described with reference to
04-05-2019
4
the drawings.
FIG. 1 shows the hardware configuration of the first embodiment of the present invention.
The analog sound pressure taken in by the microphone array 101 having a plurality of
microphone elements is sent to the AD converter 102 and converted from analog to digital data.
The conversion process to digital data is performed for each microphone element. The converted
digital sound pressure data of each microphone element is sent to the central processing unit
103 and subjected to digital signal processing. At this time, software for performing digital signal
processing and necessary data are stored in advance in the non-volatile memory 105, and a work
area necessary for the processing is secured on the volatile memory 104. The sound pressure
data processed by the digital signal processing is sent to the DA converter 106 and converted
from digital data to analog sound pressure. After conversion, the signal is output from the
speaker 107 and reproduced. All software blocks in the first embodiment of the present
invention are assumed to be executed on the central processing unit 103.
[0013]
FIG. 2 shows a software block diagram of the first embodiment. FIG. 20 shows the
correspondence between software blocks and the hardware configuration shown in FIG. The
waveform capture unit 201 develops the digital data of each microphone element captured by
the AD conversion device on the volatile memory 104. The acquired sound pressure data is
expressed as the following equation (1).
[0014]
xm (t) (1) m represents the index of the microphone element and takes a value of 1 to M. M is the
number of microphone elements used for noise suppression processing. Let t be a time index in
sampling interval units.
[0015]
The acquired waveform is sent to the filter adaptation processing unit 202, and performs the
04-05-2019
5
adaptive processing of the noise suppression filter. The applied filter coefficients are stored in
the filter data 204 secured in the volatile memory 104 or the non-volatile memory 105. The
filtering unit 203 reads the stored filter data 204, superimposes a noise suppression filter on the
microphone input signal captured by the waveform capturing unit 201, and obtains a noisesuppressed signal. The noise-suppressed signal is sent to the waveform reproduction unit 205,
output from the speaker 107, and reproduced. Alternatively, the signal after noise suppression
may be stored in the volatile memory 104 or the non-volatile memory 105, and may be
transmitted to an external system using a network device or the like, or may be read and
reproduced by another system. good.
[0016]
It is assumed that the sound captured by the waveform capture unit 201 is only noise that is
unnecessary for the user, or a sound in which the target sound that the user wants to hear is
mixed. The present invention aims at suppressing noise from such sounds and extracting a target
sound that the user wants to hear. One of the M microphone elements is called a target
microphone, and the target sound component is extracted from the input signal of the target
microphone. The filter adaptation processing unit 202 determines whether the sound data
obtained by the waveform capturing unit is noise that is unnecessary for the user or that the
target sound that the user wants to hear is mixed, using the determination result. Perform filter
adaptation. The adaptation of the filter is done in a so-called batch process. In other words, filter
adaptation is performed using recording data for a relatively long time. On the other hand, the
filtering unit 203 can operate each time a waveform is obtained, as long as the filter data 204 is
present.
[0017]
FIG. 3 shows a flowchart of processing in the filter adaptation processing unit 202. In the filter
adaptation processing, first, it is determined whether the sound obtained by the waveform
capturing unit is a sound with only noise unnecessary for the user or a sound (mixed sound) in
which the target sound that the user wants to hear is mixed. In the noise capture S301, data of a
time zone determined to be noise is captured and expanded on a volatile memory. In mixed
sound capture S302, data of a time zone determined to be a mixed sound is captured and
expanded on a volatile memory. The obtained noise is expressed by equation (2). Further, the
obtained mixed sound is expressed by equation (3).
04-05-2019
6
[0018]
T is an operator representing conjugate transpose of a vector and a matrix. It is assumed that
data for time lengths Ln and Ls are obtained, respectively.
[0019]
In the present invention, processing may be performed after the signal obtained by the
microphone is divided into a plurality of sub-bands using filter bank processing or the like. In
that case, analysis filter bank processing is performed immediately after the signal is taken in
from the microphone, and division into sub-bands is performed, and the noise suppression
processing of the present invention is performed for each sub-band. Then, synthesis filter bank
processing may be performed to obtain a configuration in which signals of respective sub-bands
are synthesized. When DFT (Discrete Fourier Transform) modulation filter bank is used, the
signal after subband division is a complex number, but the processing of the present invention is
applicable whether the input signal is a complex number or a real number.
[0020]
The obtained noise and mixed sound are processed in noise multi-channel spatial prediction
S303. As noise statistics, a noise covariance matrix expressed by equation (4) and a noise
correlation matrix expressed by equation (5) are obtained. Here, Vm (t) is defined by equation (6).
This is a vector of which the number of elements not including the m-th microphone input signal
is (M−1) L. Let L be the filter length. Also, D is a delay for satisfying the causality.
[0021]
A filter (FIR = Finite Impulse Response filter) which approximately obtains the m-th microphone
signal so as to minimize the square error by using microphone elements other than the m-th
microphone element is expressed by Expression (7).
[0022]
Hereinafter, the filter in the present invention is referred to as an FIR filter.
04-05-2019
7
In noise multi-channel spatial prediction, this filter is determined for each microphone. In the
conventional single channel spatial prediction method (see, for example, Non-Patent Document
1), the signal of a certain microphone element (prediction destination) is approximated by the
signal of another microphone element (prediction source). There is a problem that the prediction
accuracy of single channel spatial prediction is deteriorated when the amplitude characteristics
of the prediction destination and the prediction source are largely different due to the influence
of echo reverberation. On the other hand, in the multi-channel space prediction of the present
invention, even if a valley of the amplitude characteristic that is one microphone element is
formed, the valley can be covered by another microphone, so that high-accuracy prediction is
possible. Is possible. The noise multi-channel spatial prediction S303 outputs a multi-channel
spatial prediction filter of the obtained noise. In the target sound estimation S304, a signal ym (t)
in which noise is suppressed is obtained for each microphone by the equation (8). Xm (t) is
defined by equation (9).
[0023]
This signal has noise suppressed and is only a component resulting from the target sound, but
the target sound is distorted by the spatial prediction filter.
[0024]
The target sound estimation S304 outputs, together with the noise suppression signal, expression
(10) which is a covariance matrix of the noise suppression signal and expression (11) which is a
correlation matrix between the target microphone and the noise suppression signal.
target is the microphone index of the target microphone. Y (t) is defined by equation (12). L2 is a
filter length of distortion correction processing in the subsequent stage.
[0025]
In residual noise estimation S305, an output signal y v, m (t) when a noise-only signal is
subjected to noise suppression processing by multi-channel spatial prediction processing is
calculated using equation (13). The result of equation (14) is output as the obtained residual
noise component and noise covariance matrix. Yv (t) is defined by equation (15).
04-05-2019
8
[0026]
In the spatial / F-specific distortion correction S306, distortion of the target sound included in
ym (t) is corrected by a two-step correction process described later, and a corrected signal is
output. The F characteristic refers to the amplitude and phase characteristics of each frequency.
2By applying the stage correction, a signal having spatial, amplitude and phase characteristics
corrected can be obtained.
[0027]
FIG. 4 shows a processing flow of the filtering unit 203 of the present invention. A multi-channel
spatial prediction unit 401 superimposes a spatial prediction filter wm on an input signal other
than the m-th microphone in which the target sound and noise are mixed. The delay processing
unit 402 delays the m-th microphone input signal by D points in order to satisfy the causality of
the microphone input signal. A noise suppression signal is obtained by subtracting the multichannel spatial prediction filter superimposed signal from the delayed microphone input signal.
The multi-channel distortion correction unit 403 applies the multi-channel distortion correction
filter H defined by Expression (16) to the obtained multi-channel noise suppression signal.
[0028]
The signal sdistorted (t) after distortion correction is a monaural signal. The one-channel
distortion correction unit 404 superposes a frequency distortion correction filter g on sdistorted
(t), and obtains the following equation (17) as a signal after distortion correction.
[0029]
FIG. 5 shows the effect of the present invention. The waveform at the top of FIG. 5 is a waveform
obtained by extracting the target sound signal included in the target microphone. The purpose is
to obtain a waveform after noise suppression close to this waveform. 2The waveform at the stage
is a waveform after noise is mixed. It can be seen that the noise is different from the original
target sound signal. The third waveform is a waveform after applying a method based on single
04-05-2019
9
channel spatial prediction (see, for example, Non-Patent Document 1). The noise component is
reduced and approaches the top waveform, but the distortion is large and the shape is different.
The fourth waveform is a waveform after noise suppression is performed by the process of the
present invention. It can be seen that the waveform is very close to the target sound. As
described above, according to the present invention, it is possible to obtain a noise suppression
signal with small distortion.
[0030]
The determination of the noise interval in the noise capture S301 of FIG. 3 may take a form in
which the user drags and designates a time interval in which only noise exists on the waveform
display tool. In addition, based on the spatial position of the target sound specified by the user
based on the signal that separated the sound by the time frequency domain sound source
separation based on the time frequency distribution method based on the conventional
independent component analysis and the sparsity described later It may take a form that
automatically identifies it.
[0031]
The specific process flow of the latter form is shown in FIG. The mixed sound capture 601
outputs a signal in which a plurality of microphone elements receive a sound in which a plurality
of sound sources are mixed. In the case of sound source separation based on independent
component analysis, the time frequency domain sound source separation 602 is a sound source
direction estimation result for each time frequency estimated using the sound source direction
estimation in the time frequency domain (see, for example, non-patent document 3). Cluster and
restore the original signal for each sound source.
[0032]
In the target sound specification 603, the user selects a sound to be extracted from the restored
original signal. The selection may be configured such that the user reproduces the sound of each
of the original signals with a speaker and makes a selection while the sound source direction is
estimated (for example, see Non-Patent Document 3) for each restored original signal. The
selected sound source direction may be displayed on the screen, and the user may select a
direction to be extracted from the sound source directions displayed on the screen. In this way,
04-05-2019
10
the target sound specification 603 ends the processing by outputting the information as to which
sound source the target sound the user wants to extract out of the plurality of restoration signals
output by the time frequency domain sound source separation 602. Here, the number of target
sounds does not have to be one, and may be plural.
[0033]
In processing for each section 604, the restoration signal is cut into short sections of several
seconds to perform loop processing. After the target sound designation 603, the restoration
signal can be assigned to the target sound or noise. The target sound and the distributed sound
are all added, and similarly, the noise and the distributed sound are all added. The time series of
power per hour of the target sound and noise after addition has a shape as shown in the top and
second stages of FIG. 7. The power of the target sound for each short interval is Ps (τ), and the
power of noise is Pn (τ). Here, τ is a variable representing the index of the short interval.
[0034]
In the mixing degree processing 605, the ratio of Ps (τ) + Pn (τ) to Ps (τ) is calculated for each
short interval as an estimated value of the power ratio (mixing degree) to the noise of the target
sound. The sound source mixing degree is, for example, a time series as shown in the third stage
of FIG. In the sorting 606, in order to identify a short section having a small degree of mixing, the
above-mentioned ratios having a small degree of mixing are rearranged in order from the
smallest one. The process for each section 607 shifts the process to the next short section. The
noise segment estimation 608 takes out the top N segments determined in advance from the
short segments with a small degree of mixing. The section taken out is output as a noise section
and the process ends.
[0035]
An example of performing sound source separation from the histogram of the sound source
direction calculated for each time frequency as sound source separation processing in the time
frequency domain is shown in FIG. In the process 801 for each time frequency, first, microphone
input signals of a plurality of elements are processed every short time (frame shift). The head of
the waveform which starts processing every short time is shifted by frame shift. The frame shift
is predetermined so as to have a time length of about several ms. The time length from the
04-05-2019
11
beginning to the end of the waveform to start processing is called a frame size, and is set to a
value longer than the frame shift. Direct current component cut, Hanning window superposition,
short time Fourier transform are applied to the data of the frame size for each microphone
element to obtain a signal in the time frequency domain. The processing unit of short time
processing is called a frame, and the index of the frame is described as τ. The signal of the
frame τ of the f-th frequency obtained at the microphone element number m is described as xm
(f, τ), and X (f, τ) = [x1 (f, τ) ... xm (f, τ) ... It is assumed that xM (f, τ)] <T>. In processing 801
for each time frequency, a loop for performing processing for each frequency f and frame τ is
started.
[0036]
In the phase difference analysis 802, the sound source direction of the frequency f and the frame
τ is estimated by the GCC-PHAT or SPIRE method (for example, refer to Non-Patent Document
3). Histogram generation 803 estimates a histogram of the estimated sound source direction.
One vote is added for each frequency f and each frame τ to the bin of the histogram
corresponding to the sound source direction obtained for the frequency f and the frame τ. The
process for each time frequency 804 transfers the process to the next frequency or the next
frame. The histogram peak search 805 searches for the peak of the histogram of the determined
sound source direction. Histogram bins whose values are larger than those of the preceding and
succeeding bins are detected as peaks, and a predetermined number of peaks are extracted and
output from the peaks in descending order of voting values. The number P of peaks is less than
or equal to the number of microphones. In steering vector generation 806, the direction
difference between the frequency f and the sound source direction for each frame τ is compared
with each peak obtained in the histogram peak search 805 to select the peak with the smallest
direction difference. In the steering vector generation 806, a set of input vectors X (f, τ)
corresponding to the sound source direction of the frequency f is set as と な る p (f) among the
sound source directions in which the selected peak number is p. A steering vector ap (f) that
holds one for each peak and frequency is determined by equation (18). Normalize the calculated
steering vector magnitude to 1. The steering vector after normalization is denoted as a ^ p (f). A
matrix A (f) generated on the basis of this steering vector is set as Expression (19). In the inverse
filtering 807, a filter (equation (20)) defined by the generalized inverse matrix of A (f) is
superimposed on the microphone input signal for each time frequency. The vector after
superposition is a vector having a separated signal for each time frequency as an element.
[0037]
In time domain waveform generation 808, all time frequency components are further collected
for each sound source, inverse short time Fourier transform and superposition addition
04-05-2019
12
processing are performed, and a waveform for each sound source in the time domain is obtained
and output.
[0038]
FIG. 9 shows a configuration for performing dereverberation in real time in addition to noise
elimination.
From the waveform acquisition unit 901 to the filter data 904, the same processing as the filter
data 204 from the waveform acquisition unit 201 of FIG. 2 is performed. In the configuration of
FIG. 2, the target microphone is a specific one of the M microphones, but in FIG. 9, the
waveforms after noise suppression of all the microphones are extracted. That is, the target
microphone is changed from 1 to M to perform noise suppression, and a waveform after noise
suppression is extracted.
[0039]
The target sound segment extraction unit 905 calculates a power time series of the noisesuppressed M channel signal output from the filtering unit 903. Then, voice segments are
extracted using power-based VAD (voice segment detection technology). Furthermore, voice
segments are extracted in descending order of power so that a predetermined number or a total
time length after taking out becomes a predetermined time length. The extracted voice section is
output as a target sound section. As described above, it is possible to learn the space transfer
characteristic with high accuracy by extracting the voice section having a large power.
[0040]
The target sound transfer characteristic learning unit 906 learns various statistics used in multichannel dereverberation based on the second order statistics from the target sound section
waveform extracted by the target sound section extraction unit 905, and calculates the
dereverberation filter after learning. The dereverberation filter 907 writes out the calculated
dereverberation filter. While the processing up to this point is a so-called batch processing, noise
suppression processing and dereverberation processing are subsequently performed on the
waveform extracted in real time.
04-05-2019
13
[0041]
The real-time waveform acquisition unit 908 outputs the minimum data required to filter sound
data of a plurality of channels each time the data is obtained. The output data is sent to the
filtering unit 903, noise-suppressed, and then sent to the dereverberation unit 909.
[0042]
The dereverberation unit 909 reads the dereverberation filter 907 adapted by batch processing
and performs dereverberation processing. The dereverberated data is sent to the real time
waveform reproduction unit 910, subjected to DA conversion, and emitted from the speaker.
[0043]
Generally, adaptation of a dereverberation filter requires long-term observation data, so it is
desirable to use a batch-adapted filter. In consideration of the case where a plurality of target
sounds exist, the target sound segment extraction unit 905 performs sound source direction
estimation for each of the obtained segments, and clusters the obtained segments based on the
direction estimation result, and is predetermined for each cluster Based on the power, target
sound signal with a length of time is extracted, and from the extracted section, the target sound
transfer characteristic learning unit 906 obtains the dereverberation filter for each direction, and
further, the sound source direction estimation is performed before the dereverberation unit 909
The configuration may be such that dereverberation is performed using a dereverberation filter
in the direction closest to the estimated direction.
[0044]
FIG. 10 shows a configuration example of the spatial distortion correction of the spatial / F
special distortion correction S306 of FIG. The spatial distortion correction filter H is defined by
the following equation (21) and calculated by the equation (22).
04-05-2019
14
[0045]
The residual noise estimation unit 1001 calculates yv, m (t) defined by equation (13). The target
sound estimation unit 1002 calculates ym (t) defined by equation (8). The delay processing unit
1007 adds a delay D for satisfying the causality to the input signal of the target microphone, and
outputs the delayed signal. The target sound covariance estimation unit 1005 calculates Rcov
(noiseless). The residual noise covariance estimation unit 1003 calculates Rcov (noise, noiseless).
The μ multiplication 1004 multiplies μ by all elements of Rcov (noise, noiseless). An inverse
matrix operation unit 1006 calculates an inverse matrix inv R of Rcov (noiseless) + μRcov (noise,
noiseless). The target sound correlation matrix estimation unit 1008 calculates the correlation
vector Rcor (noiseless) defined by equation (11), and the matrix multiplication unit 1009
calculates the product of the matrices of Rcor (noiseless) invR. The product of the matrices is
output as the distortion correction filter H.
[0046]
FIG. 11 shows one configuration of the F-specific distortion correction. The multi-channel
distortion correction unit 1101 calculates a signal after multi-channel distortion defined by
Expression (16). The delay processing unit 1102 delays the input signal of the target microphone
by the delay D satisfying the causality, and outputs the delayed signal. The noise covariance
matrix calculates Rcov (noise) defined by equation (24). Here, V (t) is defined by equation (23).
[0047]
The μ multiplication unit 1104 multiplies all elements of Rcov (noise) by a predetermined
coefficient μ. The target sound covariance estimation unit 1105 calculates a matrix Rcov (input)
defined by the following equation (26). Here, X (t) is defined by equation (25).
[0048]
The noise correlation estimation unit 1107 calculates the correlation matrix Rcor (noise) defined
by equation (27).
[0049]
04-05-2019
15
An inverse matrix operation unit 1106 calculates an inverse matrix invR2 of Rcov (input) +
μRcov (noise).
The matrix multiplication unit 1108 calculates the product of the matrix of Rcor (noise) invR2
and sets it as a noise estimation filter R. The noise estimation unit 1109 estimates a noise
component n (t) from n (t) = RX (t) from the multi-channel signal X (t) in which the target sound
and the noise are mixed. n (t) is a noise signal of 1 ch.
[0050]
In the least squares filter estimation unit 1110, the least squares method is used for g and q for
which the squared error between the input signal estimated values xtaget ^ (tD) and xtarget (tD)
represented by equation (28) takes a minimum value. It calculates | requires by Formula (29). In
the formula, "*" is an operator representing convolution. The distortion correction filter g thus
obtained is output and the process ends.
[0051]
FIG. 12 is a diagram showing a hardware configuration of the second embodiment of the present
invention. Sound data captured by the microphone array 1201 is converted from analog sound
pressure to digital sound pressure data by an AD converter 1203. After the converted data is
processed on the computer 1204, the data is transmitted to the computer on the server via the
HUB 1205. In addition, the image data captured by the camera 1202 is also transmitted together
with the audio data. The server receives the transmitted data via the HUB 1206. The received
data is subjected to signal processing by the computer 1207 on the server. The sound data
subjected to the signal processing is recorded by the large scale storage 1211.
[0052]
Also, in response to the request of the user who browses the conference data, the server
transmits the data to the conference data browsing user. Data is sent to the browsing user side
computer 1208 via the browsing user side HUB 1211. The data is processed on the computer
1208 and reproduced from the speaker 1209. In addition, part of the acoustic information is
04-05-2019
16
displayed on the display device 1210.
[0053]
FIG. 13 shows the configuration of the screen displayed on the display device 1210 of the
browsing user. The screen 1301 of the display device 1210 consists of four sub screens. A
moving image captured by the camera 1202 at the time of a conference is displayed on the
camera image display unit 1301-1. The sound source position display unit 1301-2 displays the
sound source position estimated from the sound captured by the microphone array at the time of
a conference. The sound source position may be obtained by performing peak search of a
directional histogram created using all audio at the conference, or a directional histogram
generated from audio waveforms before and after the video time in synchronization with the
camera image. The sound source position obtained by peak search may be displayed. 1301-Make
the screen of 2 be a plan view of a scaled down meeting room, and display the planar position of
the sound source. The color and shape of the display may be changed and displayed for each
sound source position.
[0054]
The speech timing display unit 1301-3 marks the speech part by changing the density according
to the speech volume. The speech position of each sound source may be marked with the color or
shape used to display each sound source in the sound source position display unit 1301-2. The
thumbnail image display unit 1301-4 displays one camera image of a time zone included in the
utterance part for each utterance part. When there are a plurality of cameras, an image of the
camera in which the sound source direction of the utterance part is photographed may be
displayed. In addition, when the user clicks a specific point of the camera image display unit
1301-1 with the mouse attached to the computer, the sound of the click position is reproduced
or the sound source position of the sound source position display unit 1301-2 is clicked. The
reproduction location of the sound source may be displayed on the utterance timing display unit
1301-3, and when the utterance location of the utterance timing display unit 1301-3 is clicked,
the click location may be reproduced.
[0055]
FIG. 14 is a diagram showing a software configuration of the second embodiment of the present
04-05-2019
17
invention. The sound information of a plurality of channels taken in by the sound taking-in unit
1401 and the image data taken in by the image taking-in unit 1403 are sent to the data sending
unit 1404 and sent to the server. Further, information 1402 regarding the arrangement of each
microphone element of the microphone array at the conference site and the arrangement and
orientation of the camera is also transmitted together with the sound information and the image
data. On the server, the data reception unit 1405 receives sound information, image data, data of
the arrangement of each microphone element of the microphone array, and the arrangement and
orientation of the camera, and stores it in the data 1413 for each site. The per-base data 1413 is
a data area on a large scale storage.
[0056]
At the viewing site, the user I / F processing unit 1412 recognizes the click position and the drag
position of the user, and converts the position into information of the sound source position to be
reproduced. The voice waveform of the corresponding sound source position stored in the sitespecific data 1410 is reproduced. If there is no voice waveform of the corresponding sound
source position in the base data 1410, the conference data request unit 1406 may transmit a
request for transmitting the voice waveform of the corresponding sound source position to the
server. The request transmitted to the server is received by the data receiving unit 1407. Then, a
command to extract the sound waveform of the reproduction sound source position included in
the request is sent to the sound information generation unit 1409.
[0057]
Based on the information on the spatial arrangement of the multi-channel audio waveform stored
in the per-base data 1413 and the spatial arrangement of the microphone array in which the
audio waveform is recorded, the audio information generation unit 1409 based on the first
embodiment of the present invention Separate and extract the speech waveform of The data
transmission unit 1408 transmits the extracted voice waveform to the browsing base. In addition,
camera images and information on the sound source direction at each time may be sent. The
image display unit 1415 displays a camera image on the camera image display unit on the
display device. At the time of display, the reproduced image may be changed in accordance with
the reproduced sound source waveform. The sound reproduction unit 1411 reproduces a
designated reproduction portion of the waveform of the sound source position selected by the
user, and outputs sound from the speaker.
04-05-2019
18
[0058]
FIG. 15 shows a processing flow of user click and drag processing including a user I / F
processing unit, an audio reproduction unit, and an image display unit. In the selection 1501 of
the direction to be heard, the user's desired direction is identified from the click position and the
drag position of the user. Whether or not a sound source is present is determined whether the
sound source is present in the identified direction (1502). If not present, the message is
presented 1507 indicating that the sound source is not present in that direction, and the process
ends. If there is a sound source, the noise segment is extracted by the noise segment
identification processing 1503 in the noise segment extraction processing of FIG. 6 shown in the
first embodiment. In the target sound extraction 1504, the target sound after noise suppression
is extracted by the noise suppression method of FIG. 3 shown in the first embodiment from the
information of the noise section. In the selection of playback section 1505, after the speech
section of the target sound after noise suppression is displayed on the speech timing display unit,
the user is made to select a section desired to be heard from the speech section. In Reproduction
of sound and image 1506, the sound reproduction unit reproduces the voice of the selected
utterance section, and the camera image corresponding to the reproduction utterance section is
reproduced on the camera image display unit 1301-1 of the display device 1210. Synchronize
with and display. After the end of reproduction, the process is ended.
[0059]
FIG. 16 is a diagram showing an abnormal sound detection block of the monitoring system of the
third embodiment of the present invention. The target abnormal sound is, for example, an
operation sound at the time of abnormality of a machine in a factory, or a sound that a glass in
an office or a home breaks. The hardware configuration is the same as the hardware
configuration of the second embodiment shown in FIG. The software block configuration is the
same as that shown in FIG. The sound source information generation unit 1601 corresponds to
the sound information generation unit of FIG.
[0060]
It is assumed that the abnormal sound database 1603 stores state transition information of
transition patterns of acoustic characteristic amounts such as amplitude spectrum and cepstrum
of abnormal sound and acoustic characteristic amounts of abnormal sound described in Hidden
Markov Model format. The pattern matching unit 1602 performs pattern matching with the
04-05-2019
19
information of the extracted sound source waveform and the information of the abnormal sound
described in the abnormal sound database. A short-time Fourier transform is applied to the sound
source waveform to extract an acoustic feature such as an amplitude spectrum or cepstrum, and
the extracted acoustic feature and transition pattern of the acoustic feature of the abnormal
sound described in the abnormal sound database are described by Hidden Markov Model
Calculate the distance to the spectrum pattern of the abnormal sound. From the result of the
distance calculation, the likelihood of the existence probability of the abnormal sound is
calculated. In the case of the spectrum pattern of the abnormal sound described in Hidden
Markov Model, it is possible to perform distance calculation at high speed by the Viterbi
algorithm or the like.
[0061]
The abnormal sound determination unit 1604 determines, for each short time interval, whether
or not an abnormal sound exists from the calculated likelihood. As a result of the determination,
when there is an abnormal sound, the alert transmission unit 1605 transmits warning
information. The warning information has a form of emitting a predetermined warning sound
from a speaker on the browsing base and displaying on the screen the place where the abnormal
sound occurred and the time zone.
[0062]
FIG. 17 is a diagram showing a specific processing flow of the abnormal sound detection
processing. A mixed sound capture 1701 captures sound data of a plurality of channels in which
various sounds are mixed. A time-frequency domain sound source separation 1702 generates a
signal for each sound source. In time-frequency domain sound source separation, since the signal
for each sound source can not be completely separated, processing for improving the separation
accuracy is added next. In processing for each sound source 1703, a processing loop for each
separated sound source is started. In the processing 1704 for each section, a processing loop for
the waveform for each short time section of the sound source signal to be processed is started. In
the mixing degree processing 1705, the mixing degree Ps (t) / (Pn (t) is calculated using the
power Ps (t) of the sound source waveform to be processed and the power Pn (t) of the sound
source other than the processing target. Calculate + Ps (t)) for each interval t. The calculated
degrees of mixing are rearranged in order from the smallest degree of mixing in sorting 1706. In
the process 1707 for each section, the process is transferred to the next section. In the noise
segment extraction 1708, segments are extracted from the information of the mixing degree
after sorting until the total time becomes a predetermined time in order from the one with the
04-05-2019
20
smallest mixing degree. Then, the extracted section is output as a noise section. In the noise
removal 1709, the processing flow shown in FIG. 3 of the first embodiment of the present
invention extracts the signal of only the target sound from which the noise has been removed. In
abnormal sound detection 1710, pattern matching processing with abnormal sound information
is performed, and when an abnormal sound is detected, the processing is transferred to alert
transmission unit 1711 and an alert is transmitted to a browsing base and then processing of the
next sound source Move. If no abnormal sound is detected, the process proceeds to the next
sound source processing without doing anything.
[0063]
FIG. 18 shows a processing flow of speech speed conversion processing for reproducing at high
speed the sound of the sound source position designated by the user according to the present
invention. This processing flow is processed by the audio reproduction unit 1411 in FIG. The
purpose of this process is to reproduce the sound of the sound source specified by the user
slowly and at an easy-to-hear speed, and reproduce the sounds of other speakers at high speed,
so that only the sound you want to hear can be easily heard. . The other sounds are played at
high speed, so they can be heard without time.
[0064]
In the target sound / noise extraction 1801, according to the first embodiment of the present
invention, a section in which the target sound is present and a section in which only the noise is
extracted. In the process 1802 for each section, the extracted voice is divided into short time
sections, and loop processing for each section is started. The speech detection 1803 based on the
SNR calculates SNR = Ps (t) / Pn (t) from the short-time power Ps (t) of the target sound and the
short-time power Pn (t) of the noise. In the voice determination 1804, if the SNR is equal to or
more than a predetermined threshold value, it is determined that the voice is present in the short
time interval, and the playback speed of the interval is set to the speech speed for the target
sound interval (1806). If the interval is less than the threshold, the section is determined to be a
noise section, and the speech speed for noise section is set to 1805, and the playback speed of
that section is set to the speech speed for noise section to be determined in advance. Here, the
speech speed for the noise section is set in advance to be faster than the speech speed for the
target sound section. After setting, processing is transferred to the next section in processing
1807 for each section. According to the set speech speed, in reproduction 1808, speech speed
conversion processing is performed according to the speech speed actually set from the speaker,
and after the converted voice is reproduced, the process ends.
04-05-2019
21
[0065]
FIG. 19 is a flowchart of a process of extracting and reproducing only the information on the
sound source direction selected by the user. 1901To 1904 are the same as the corresponding
processing in FIG. In this flow, a section is deleted 1905, and a section which is not determined
as a target sound section is deleted from the reproduction section. Also, in 1906 leaving the
section, the section determined to be the target sound section is left in the reproduction section.
The process for each section 1907 shifts the process to the next section. In the reproduction
1908 of the set reproduction section, the processing is ended after the set reproduction section
is reproduced from the speaker.
[0066]
The hardware block diagram of the noise suppression apparatus of this invention. The software
block block diagram of the noise suppression apparatus of this invention. The processing flow
figure of the noise suppression apparatus of this invention. FIG. 5 is a detailed block diagram of a
filtering unit of the noise suppression device according to the present invention. The figure which
shows the effect of the noise suppression method of this invention. The processing flow figure of
the blind noise suppression device of the present invention. The figure which showed the
example of the mixing degree process in the blind noise suppression apparatus of this invention.
The figure which shows the structural example of the time frequency domain sound source
separation of the blind noise suppression apparatus of this invention. The block diagram of the
signal processing apparatus which performs noise suppression and dereverberation
simultaneously. The detailed block diagram of the multi-channel distortion correction process in
this invention. The detailed block diagram of 1 channel distortion correction processing in the
present invention. The hardware block diagram in the case of applying this invention to a
meeting assistance system or an audio | voice monitoring system. The figure which showed the
example of a screen display of the meeting assistance system. The figure which showed the
software block configuration of the meeting support system. The flowchart of the user interface
of a meeting assistance system, and internal processing. BRIEF DESCRIPTION OF THE DRAWINGS
The block diagram of the abnormal sound detection apparatus which applied this invention to
the audio | voice monitoring system. The processing flow figure of a voice surveillance system.
The processing flow figure which used speech speed conversion processing to reproduction
processing of the present invention. The processing flow figure which used silence deletion
processing for reproduction processing of the present invention. The figure which showed the
correspondence of the software block of the noise suppression apparatus of this invention, and
04-05-2019
22
hardware.
Explanation of sign
[0067]
101…Microphone array 102 AD conversion device 103 central processing unit 104 volatile
memory 105 non-volatile memory 106 DA conversion device 107 speaker 201 waveform
acquisition unit 202 filter adaptive processing unit 203: filtering unit, 204: filter data, 205:
waveform reproduction unit, 401: multichannel space prediction unit, 402: delay processing unit,
403: multichannel distortion correction unit, 404: 1 channel distortion correction unit, 901:
waveform acquisition Unit 902: filter adaptive processing unit 903: filtering unit 904: filter data
905: target sound section extraction unit 906: target sound transfer characteristic learning unit
907: dereverberation filter 908: real time waveform capture unit 909 ... dereverberation unit 910
... real-time waveform reproduction unit 1001 ... residual noise estimation unit 1002 ... target
sound estimation unit 1003 ... residual noise covariance estimation unit 1004 ... μ multiplication
unit 1005 ... target sound covariance estimation unit 100 6. Inverse matrix operation unit, 1007
... Delay processing unit, 1008 ... Target sound correlation matrix estimation unit, 1009 ... Matrix
multiplication unit, 1101 ... Multi-channel distortion correction unit, 1102 ... Delay processing
unit, 1103 ... Noise covariance estimation unit, 1104 μ multiplication unit 1105 target sound
covariance estimation unit 1106 inverse matrix calculation unit 1107 noise correlation
estimation unit 1108 matrix multiplication unit 1109 noise estimation unit 1110 least squares
filter estimation unit DESCRIPTION OF SYMBOLS 1201 Microphone array 1201 Camera 1203
AD converter 1204 Calculator HUB 1206 HUB 2 1207 Calculator 1208 Calculator 1209 Speaker
1210 Display device 1301 Screen 1301 1 camera image display unit 1301-2 sound source
position display unit 1301-3 speech timing display unit 1301-4 thumbnail image display unit
1401 sound capture unit 1403 image capture unit 1404 data transmission Unit 1405: Data
reception unit 1406: Conference data request unit 1407: Data reception Reception unit, 1408:
data transmission unit, 1409: acoustic information generation unit, 1410: data for each location,
1411: voice reproduction unit, 1412: data for each location, 1601: sound source extraction unit,
1602: pattern matching unit, 1603: abnormal sound Database, 1604 ... abnormal sound
determination unit, 1605 ... alert transmission unit
04-05-2019
23
Документ
Категория
Без категории
Просмотров
0
Размер файла
41 Кб
Теги
jp2010054728
1/--страниц
Пожаловаться на содержимое документа