close

Вход

Забыли?

вход по аккаунту

?

JP2015226104

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2015226104
PROBLEM TO BE SOLVED: To stably separate a sound source even when the relative positional
relationship between the sound source and the sound collection device changes. SOLUTION: A
sound pickup unit for picking up sound signals of a plurality of channels, a detection unit for
detecting a change in relative positional relationship between a sound source and the sound
pickup unit, and a change amount of relative position detected by the detection unit Accordingly,
the phase adjustment unit that adjusts the phase of the acoustic signal, the parameter estimation
unit that estimates the dispersion of the sound source signal that is a sound source separation
parameter and the spatial correlation matrix of the sound source signal with respect to the phase
adjusted sound signal And a sound source separation unit that generates a separation filter from
the parameters and performs sound source separation. [Selected figure] Figure 1
Sound source separation device and sound source separation method
[0001]
The present invention relates to sound source separation technology.
[0002]
Video cameras and recently digital cameras are also capable of shooting moving pictures, and at
the same time, the opportunity to collect (record) sound has increased.
There is a problem that voices other than those to be photographed are mixed in when voices are
03-05-2019
1
collected. Therefore, research to extract only a desired signal from an acoustic signal mixed with
voices from multiple sound sources, for example, research on sound source separation
technology by array signal processing using multiple microphone signals such as beam former
and independent component analysis (ICA) Is widely practiced.
[0003]
However, the conventional sound source separation technology using array signal processing has
a problem that it is not possible to simultaneously separate more sound sources than the number
of microphones (underdetermined problem). A sound source separation method using a multichannel Wiener filter is known as a method for solving the problem (Non-Patent Document 1).
[0004]
The non-patent document 1 will be briefly described. Consider a situation where sound source
signals sj (j = 1, 2,..., J) emitted from J sound sources are picked up by M (≧ 2) microphones.
Here, the number of microphones is set to 2 for the sake of simplicity. An observation signal X
observed by two microphones can be written as follows. Here, [] <T> represents transpose of a
matrix, and t represents time. When the observation signal is subjected to time-frequency
conversion, (f represents a frequency bin and n represents the number of frames (n = 1, 2,..., N)).
[0005]
Assuming that the transfer characteristic from the sound source to the microphone is hj (f) and
the signal for each sound source observed by the microphone (hereinafter referred to as a source
image) is cj (n, f), the observed signals are as follows. It can be written as a superposition of
signals. Here, it is assumed that the sound source position does not move during the sound
collection time, and the transfer characteristic hj (f) from the sound source to the microphone
does not change with time.
[0006]
Furthermore, let Rcj (n, f) be the correlation matrix of the source image, vj (n, f) be the variance
03-05-2019
2
for each time-frequency bin of the source signal, and Rj (f) be the spatial correlation matrix
independent of time for each source. It is assumed that the relation of However, here, () H
represents Hermite displacement.
[0007]
Using the above relationship, the probability that the observation signal is observed as
superposition of all sound images is given, from which parameter estimation is performed using
the EM algorithm.
[0008]
Parameters Rcj (n, f) (= vj (n, f) * Rj (f)), Rx (n) for generating a multi-channel Wiener filter for
performing sound source separation by repeating the above calculation , f) can be obtained.
An estimated value of the source image cj (n, f) which is an observation signal for each sound
source using the calculated parameters is output as follows.
[0009]
NQKDuong, E.Vincent, R.Gribonval, “Under-Determed Reverberant Audio Source Separation
Using a Full-rank Spatial Covariance Model”, IEEE transactions on Audio, Speech and Language
Processing, vol.18, No.7, pp.1830 -1840, September 2010.
[0010]
The above conventional method assumes that the sound source position does not move during
the sound collection time in order to stably obtain the spatial correlation matrix.
Therefore, stable sound source separation can not be performed, for example, when the relative
position between the sound source and the sound collection device changes (for example, when
the sound source itself is moving, or when a sound collection device such as a microphone array
rotates or moves) There is a problem of
03-05-2019
3
[0011]
The present invention has been made to solve the above-mentioned problems, and it is an object
of the present invention to provide a technology that enables stable sound source separation
even when the relative position between the sound source and the sound collection device
changes. is there.
[0012]
In order to solve this problem, for example, the sound source separation device of the present
invention has the following configuration.
In other words, according to the amount of change in relative position detected by the sound
collection means for collecting sound signals of a plurality of channels, the detection means for
detecting a change in relative positional relationship between the sound source and the sound
collection means Phase separation means for adjusting the phase of the acoustic signal,
parameter estimation means for estimating the source separation parameter for the phaseadjusted acoustic signal, separation filters are generated from the parameters estimated by the
parameter estimation means, and sound source separation is performed. And sound source
separating means for performing.
[0013]
According to the present invention, even when the relative positional relationship between the
sound source and the sound collection device changes, sound source separation can be
performed stably.
[0014]
FIG. 1 is a block diagram of a sound source separation device according to a first embodiment.
The figure for demonstrating phase adjustment. 3 is a flowchart showing a processing procedure
in the first embodiment. The block block diagram of the sound source separation apparatus in
2nd Embodiment. The figure for demonstrating rotation of a sound collection part. The flowchart
which shows the process sequence in 2nd Embodiment. The block block diagram of the sound
source separation apparatus in 3rd Embodiment. The flowchart which shows the process
03-05-2019
4
sequence in 3rd Embodiment.
[0015]
Hereinafter, embodiments of the present invention will be described in detail with reference to
the accompanying drawings. In addition, the structure shown in the following embodiment is
only an example, and this invention is not limited to the illustrated structure.
[0016]
First Embodiment FIG. 1 is a block diagram of a sound source separation apparatus 1000
according to the first embodiment. The sound source separation apparatus 1000 includes a
sound collection unit 1010, an imaging unit 1020, a frame division unit 1030, an FFT unit 1040,
a relative position change detection unit 1050, and a phase adjustment unit 1060. The apparatus
1000 further includes a parameter estimation unit 1070, a separation filter generation unit
1080, a sound source separation unit 1090, an inverse phase adjustment unit 1100, an inverse
FFT unit 1110, a frame combination unit 1120, and an output unit 1130.
[0017]
The sound collection unit 1010 is a microphone array configured of a plurality of microphones,
and collects sound source signals generated from a plurality of sound sources. The collected
audio signals of a plurality of channels are A / D converted and output to the frame dividing unit
1030.
[0018]
The imaging unit 1020 is a camera that captures a moving image or a still image, and outputs
the captured image signal to the relative position change detection unit 1050. Here, the imaging
unit 1020 is, for example, a camera capable of turning 360 degrees, and can always monitor the
sound source position. Further, the positional relationship between the imaging unit 1020 and
the sound collection unit 1010 is fixed. That is, the direction of the sound collection unit 1010 is
also changed along with the change of the imaging direction of the imaging unit 1020 (change of
03-05-2019
5
the pan tilt value).
[0019]
The frame division unit 1030 applies a window function to the input signal while shifting the
time interval little by little, cuts out the signal for each predetermined time interval, and outputs
the signal as a frame signal to the FFT unit 1040. The FFT unit 1040 performs FFT (Fast Fourier
Transform) on each input frame signal. That is, a spectrogram obtained by performing timefrequency conversion of the input signal for each channel is output to the phase adjustment unit
1060.
[0020]
The relative position change detection unit 1050 detects the relative positional relationship
between the sound source and the sound collection unit 1010, which changes with time, using an
image recognition technique, for example, from the input image signal. For example, the position
of the face of the subject serving as the sound source is detected by the face recognition
technology in the frame of the image captured by the imaging unit 1020. Also, for example, the
amount of change in the sound source and the sound collection unit 1010 may be detected by
acquiring the amount of change in the imaging direction of the imaging unit 1020 (change in
pan and tilt values) that changes with time. Here, it is desirable that the frequency of detecting
the sound source position is the same as the shift amount of the cutout section in the frame
division unit 1030. However, when the frequency of detecting the sound source position and the
shift amount of the cutout section are different, for example, the relative positional relationship
may be interpolated or resampled to match the detection signal of the sound source position with
the shift amount of the cutout section. The relative positional relationship between the detected
sound collection unit 1010 and the sound source is output to the phase adjustment unit 1060.
Here, the relative positional relationship indicates, for example, the direction (angle) of the sound
source with respect to the sound collection unit 1010.
[0021]
The phase adjustment unit 1060 performs phase adjustment on the input frequency spectrum.
An example of the phase adjustment will be described using FIG. The microphones are assumed
to be two channels of L 0 and R 0, and as shown in FIG. 2A, the relative position of the sound
03-05-2019
6
source A and the sound collection unit 1010 changes with time by θ (t). Assuming that the
sound source position is far enough apart from the distance between the microphones L 0 and R
0, the phase difference P diff (n) of the signals arriving at the microphones L 0 and R 0 can be
expressed as follows . Here, f represents the frequency, d represents the distance between the
microphones, c represents the speed of sound, and t n represents the time corresponding to the
nth frame.
[0022]
The phase adjustment unit 1060 performs correction on the signal of the microphone R 0 so as
to cancel P diff so that the phase difference between L 0 and R 0 disappears. Here, X R
represents an observation signal at the microphone R 0, and X Rcomp represents a phaseadjusted signal. That is, by performing phase adjustment for each frame, the inter-channel phase
difference does not change with time, so as shown in FIG. 2 (b), the moving sound source is
treated as if it were the sound source A FIX fixed in the front direction. Can.
[0023]
When there are multiple sound sources, phase adjustment is performed for each sound source.
That is, when the sound source A and the sound source B are present, a signal in which the
relative position change of the sound source A is corrected and a signal in which the relative
position change of the sound source B is corrected are generated. The signal subjected to the
phase adjustment is output to parameter estimation unit 1070 and sound source separation unit
1090, and the corrected phase adjustment amount is output to antiphase adjustment unit 1100.
[0024]
The parameter estimation unit 1070 performs spatial correlation matrix Rj (f) and variance vj (n,
f) and correlation matrix Rx j (n, f) for each sound source using the EM algorithm for the input
phase adjusted signal. Estimate
[0025]
Here, parameter estimation will be briefly described.
03-05-2019
7
The sound collection unit 1010 has two microphones L 0 and R 0 placed in free space, and the
case of two sound sources (A and B) is considered. It is assumed that the sound source A has a
positional relationship of θ (t n) with respect to the sound collection unit 1010 at time t n, and
the sound source B has a positional relationship of Φ (t n). The signals phase-adjusted for each
sound source input from the phase adjustment unit 1060 are X A and X B respectively. It is
assumed that the sound sources A and B have their sound sources fixed in the front direction (0
degree) by phase adjustment.
[0026]
First, parameter estimation is performed using the phase-adjusted signal XA. Since the sound
source A is fixed in the 0 degree direction, the spatial correlation matrix R A is initialized as
follows. Here, h A represents an array manifold vector in the front direction. The array manifold
vector has the first microphone as a reference point and the sound source direction as Θ. Here,
since the sound source A is in the 0 degree direction, h A = [1 1] <T>. On the other hand, the
sound source B is initialized as follows. h ′ B is an array manifold vector of the sound source B
in a state where the sound source A is fixed in the 0 degree direction, and can be written as
follows. For example, the following values are used for δ (f).
[0027]
Further, the variance v A of the sound source A and the variance v B of the sound source B are
initialized with random values such that v A> 0 and v B> 0.
[0028]
The parameters for sound source A are estimated as follows.
Estimation is performed using the EM algorithm. Here, the sum of diagonal components of the tr
() matrix is represented.
[0029]
03-05-2019
8
Subsequently, the spatial correlation matrix R A (f) calculated is subjected to eigenvalue
decomposition. Here, let D A1 and D A2 be the eigenvalues in descending order.
[0030]
Subsequently, parameter estimation is performed using the phase-adjusted signal X B. Since the
sound source B is fixed in the 0 degree direction, it is initialized as follows. h B represents an
array manifold vector in the front direction, and h B = [1 1] <T>. The sound source A is initialized
as follows. Here, the array manifold vector h ′ A of the sound source A can be written as
follows. Also, h ′ A⊥ represents a vector orthogonal to h ′ A.
[0031]
After that, v B (n, f) and R B (f) are calculated using the EM algorithm as in the case of the sound
source A.
[0032]
Thus, parameters are estimated by iterative calculation using signals (X A, X B) subjected to
different phase adjustments for each sound source.
Here, the number of iterations is performed a predetermined number of times or until the change
in likelihood is sufficiently small.
[0033]
The estimated variance v j (n, f), the spatial correlation matrix R j (f), and the correlation matrix
Rx j (n, f) are output to the separation filter generation unit 1080. j represents a sound source
number, and in the present embodiment, j = A and B.
[0034]
03-05-2019
9
The separation filter generation unit 1080 generates a separation filter for separating an input
signal using the input parameters. For example, the following multichannel Wiener filter WFj is
generated from the spatial correlation matrix Rj (f) and the variance vj (n, f) and the correlation
matrix Rxj (n, f) for each sound source.
[0035]
The sound source separation unit 1090 applies the separation filter generated by the separation
filter generation unit 1080 to the signal output from the FFT unit 1040. The signal Y j (n, f)
obtained by the filtering is output to the antiphase adjuster 1100.
[0036]
The reverse phase adjustment unit 1100 performs phase adjustment on the input separated
sound signal so as to cancel the phase adjusted by the phase adjustment unit 1060. That is, the
phase of the signal is adjusted so that the fixed sound source is moved again. For example,
assuming that the phase of the signal on the R 0 side is adjusted by γ in the phase adjustment
unit 1060, the phase of the signal on the R 0 side is adjusted by −γ in the reverse phase
adjustment unit 1100. The signal subjected to the phase adjustment is output to the inverse FFT
unit 1110.
[0037]
The inverse FFT unit 1110 performs IFFT (Inverse Fast Fourier Transform) to convert the input
phase adjusted frequency spectrum into a time waveform signal. The converted time waveform
signal is output to the frame combining unit 1120. The frame combining unit 1120 combines the
input time waveform signals for each frame while overlapping and outputs the combined signal
to the output unit 1130. The output unit 1130 outputs the input separated sound signal to, for
example, a recording device.
[0038]
Next, the flow of signal processing will be described using FIG. First, the sound collection unit
03-05-2019
10
1010 and the imaging unit 1020 perform sound collection and image pickup processing
(S1010). The sound collection unit 1010 outputs the collected sound signal to the frame division
unit 1030, and the imaging unit 1020 outputs an image signal of the periphery of the sound
collection unit 1010 to the relative position change detection unit 1050.
[0039]
Subsequently, the frame division unit 1030 performs frame division processing of the acoustic
signal, and outputs the frame-divided acoustic signal to the FFT unit 1040 (S1020). The FFT unit
1040 performs FFT processing on the frame-divided signal, and outputs the FFT-processed signal
to the phase adjustment unit 1060 (S1030).
[0040]
The relative position change detection unit 1050 detects a relative positional relationship
between the sound collection unit 1010 and the sound source for each time, and a yield y
indicating the relative positional relationship between the detected sound collection unit 1010
and the sound source for each time , Output to the phase adjustment unit 1060 (S1040). The
phase adjustment unit 1060 adjusts the phase of the signal (S1050). The signal phase-adjusted
for each sound source is output to parameter estimation unit 1070 and source separation unit
1090, and the phase adjustment amount is output to reverse phase adjustment unit 1100.
[0041]
The parameter estimation unit 1070 estimates parameters for generating a sound source
separation filter (S1060). The parameter estimation in S1060 is repeated until iteration is
determined in the iteration end determination in S1070, and when the iteration is completed, the
parameter estimation unit 1070 outputs the estimated parameter to the separation filter
generation unit 1080. The separation filter generation unit 1080 generates a separation filter
according to the input parameters, and outputs the generated multi-channel Wiener filter to the
sound source separation unit 1090 (S1080).
[0042]
03-05-2019
11
Subsequently, the sound source separation unit 1090 performs sound source separation
processing (S1090). That is, the sound source separation unit 1090 applies multi-channel Wiener
filtering to the input phase-adjusted signal to separate the signals. The separated signal is output
to the antiphase adjustment unit 1100.
[0043]
Subsequently, the inverse phase adjustment unit 1100 performs inverse phase adjustment
processing to restore the phase adjusted in the phase adjustment unit 1060 to the original with
respect to the input separated sound signal, and sends the signal subjected to the inverse phase
adjustment to the inverse FFT unit 1110 It outputs (S1100). The inverse FFT unit 1110 performs
inverse FFT processing (IFFT processing), and outputs the processing result to the frame
combination unit 1120 (S1110).
[0044]
The frame combining unit 1120 performs frame combining processing for combining time
waveform signals for each frame input from the inverse FFT unit 1110, and outputs the
combined time waveform signal of separated sound to the output unit 1130 (S1120). The output
unit 1130 outputs the input time waveform signal of the separated sound (S1130).
[0045]
As described above, even when the relative position of the sound source and the sound collection
unit changes, the relative position of the sound source and the sound collection unit is detected,
and the phase of the input signal is adjusted for each sound source to stabilize the sound source.
It becomes possible to separate.
[0046]
Although the sound collection unit 1010 has two channels in the present embodiment, this is for
the purpose of simplifying the description, and the number of microphones may be two or more.
03-05-2019
12
Further, in the present embodiment, the imaging unit 1020 is an omnidirectional camera capable
of capturing an omnidirectional image, but may be a normal camera as long as it can constantly
monitor a subject as a sound source. If the shooting location is a space separated by a wall
surface, such as indoors, the camera needs to have an angle of view that can shoot the entire
room if the imaging unit is installed at the corner of the room, and it is necessary to be an
omnidirectional camera There is no.
[0047]
Although the sound collecting unit and the imaging unit are fixed in the present embodiment,
they may be moved independently. In that case, means for detecting the positional relationship
between the sound collection unit and the imaging unit is further provided, and the positional
relationship is corrected according to the detected positional relationship. For example, when the
image pickup unit is installed on the rotary camera platform and the sound pickup unit is fixed to
the pedestal portion (not rotating) of the rotary camera platform, the sound source position is
corrected using the rotation amount of the rotary camera platform. Just do it.
[0048]
In the present embodiment, the relative position change detection unit 1050 assumes that the
person's speech is a sound source, and detects the positional relationship between the sound
source and the sound collection unit by face recognition technology. However, the sound source
may be, for example, a speaker or a car other than a person, and in such a case, the relative
position change detection unit 1050 performs object recognition on the input image and
determines the positional relationship between the sound source and the sound collection unit. It
may be detected.
[0049]
In the present embodiment, the acoustic signal is input from the sound collection unit, and the
relative position change is detected from the image input from the imaging unit. However, if the
relative positional relationship between the sound signal and the sound collection device that
collected the signal and the sound source is both recorded on a recording medium such as a hard
disk, data may be read from the recording medium. That is, an audio signal input unit may be
provided instead of the sound collection unit of the present embodiment, a relative positional
03-05-2019
13
relationship input unit may be provided instead of the imaging unit, and the acoustic signal and
relative positional relationship may be read from the storage device. .
[0050]
In the present embodiment, the relative position change detection unit 1050 includes the
imaging unit 1020, and detects the positional relationship between the sound collection unit
1010 and the sound source from the image acquired from the imaging unit 1020. However, any
means may be used as long as it can detect the relative positional relationship between the sound
collection unit 1010 and the sound source. For example, a GPS (Global Positioning System) may
be provided for each of the sound source and the sound collection unit, and relative position
change detection may be performed.
[0051]
In the present embodiment, the phase adjustment unit performs processing after the FFT unit,
but the phase adjustment unit may be before the FFT unit, in which case the phase adjustment
unit adjusts the delay of the signal. Good. Also, the order may be reversed in the reverse phase
adjustment unit and the reverse FFT unit as well.
[0052]
In the present embodiment, phase adjustment is performed only on the R 0 side signal in the
phase adjustment unit, but phase adjustment may be performed on the L 0 side signal, or phase
adjustment may be performed on both signals. It may be applied. Further, in the phase
adjustment unit, the sound source position is fixed in the 0 degree direction in fixing the position
of the sound source. However, phase adjustment may be performed so that the sound source
position is fixed at another angle.
[0053]
In the present embodiment, the sound collection unit is assumed to be a microphone placed in
free space, but may be placed in an environment including the influence of a housing. In that
03-05-2019
14
case, it is preferable to measure in advance the transfer characteristics including the influence of
the housing for each direction and to use the transfer characteristics as an array manifold vector.
In that case, not only the phase but also the amplitude is adjusted in the phase adjustment unit
and the antiphase adjustment unit.
[0054]
In this embodiment, the array manifold vector is created with the first microphone as a reference
point, but any reference point may be used, for example, the midpoint between the first and
second microphones may be used as a reference point.
[0055]
Second Embodiment FIG. 4 is a block diagram of a sound source separation apparatus 2000
according to a second embodiment.
The present apparatus 2000 includes a sound collection unit 1010, a frame division unit 1030,
an FFT unit 1040, a phase adjustment unit 1060, a parameter estimation unit 1070, a separation
filter generation unit 1080, a sound source separation unit 1090, an inverse FFT unit 1110, a
frame combination unit 1120, The output unit 1130 is rotated. The apparatus 2000 further
includes a rotation detection unit 2050 and a parameter adjustment unit 2140.
[0056]
The sound collection unit 1010, the frame division unit 1030, the FFT unit 1040, the sound
source separation unit 1090, the inverse FFT unit 1110, the frame combination unit 1120, and
the output unit 1130 are substantially the same as in the first embodiment described above.
Description of is omitted.
[0057]
In the second embodiment, it is assumed that the sound source does not move during the sound
collection time, and the sound collection unit 1010 is rotated by the user's handling etc., and the
relative position of the sound collection unit 1010 and the sound source changes in time. .
03-05-2019
15
Here, the rotation of the sound collection unit 1010 refers to the rotation of the microphone
array by the pan, tilt, and roll operation of the sound collection unit 1010. For example, as shown
in FIG. 5A, when the microphone array serving as the sound pickup unit rotates from the state of
(L 0, R 0) to the state of (L 1, R 1) with respect to the position-fixed sound source C 1, As shown
in FIG. 5 (b), it appears from the microphone array that the sound source has moved from C 2 to
C 3.
[0058]
The rotation detection unit 2050 is, for example, an acceleration sensor, and detects the rotation
of the sound collection unit 1010 during the sound collection time. The rotation detection unit
2050 outputs the detected rotation amount to the phase adjustment unit 1060 as, for example,
angle information.
[0059]
The phase adjustment unit 1060 performs phase adjustment based on the input rotation amount
of the sound collection unit 1010 and the sound source direction input from the parameter
estimation unit 1070. As for the sound source direction, an arbitrary direction is given as an
initial value for each sound source only at the very beginning. For example, assuming that the
sound source direction is α and the amount of rotation of the sound collection unit 1010 is β
(n), the phase difference between the channels is as follows. The phase adjustment unit 1060
adjusts the phase of the inter-channel phase difference, outputs the phase-adjusted signal to the
parameter estimation unit 1070, and outputs the phase adjustment amount to the parameter
adjustment unit 2140. The parameter estimation unit 1070 performs parameter estimation on
the phase-adjusted signal.
[0060]
The parameter estimation method is almost the same as in the first embodiment. However, in the
second embodiment, the principal component analysis of the spatial correlation matrix Rj (f)
estimated further is performed to estimate the sound source direction γ ′. Here, assuming that
the direction in which the sound source is fixed in the phase adjustment unit 1060 is γ, α +
γ′−γ is output to the phase adjustment unit 1060 as the sound source direction. The
estimated variance v j (f, n) and the spatial correlation matrix R j (f) are output to the parameter
03-05-2019
16
adjustment unit 2140.
[0061]
The parameter adjustment unit 2140 calculates the time-varying spatial correlation matrix R j
new (n, f) using the input spatial correlation matrix R j (f) and the phase adjustment amount. For
example, assuming that the phase adjustment amount of the R channel is η (n, f), the parameters
used for filter generation are adjusted.
[0062]
The parameter adjustment unit 2140 outputs the adjusted spatial correlation matrix R j new (n, f)
and the variance v j (n, f) to the separation filter generation unit 1080. In response to this, the
separation filter generation unit 1080 generates a separation filter as follows.
[0063]
Then, the separation filter generation unit 1080 outputs the generated filter to the sound source
separation unit 1090.
[0064]
Subsequently, a signal processing flow in the second embodiment will be described with
reference to FIG.
First, the sound collection unit 1010 performs sound collection processing, and the rotation
detection unit 2050 performs detection processing of the rotation amount of the sound
collection unit 1010 (S2010). The sound collection unit 1010 outputs the collected sound signal
to the frame division unit 1030. The rotation detection unit 2050 outputs information indicating
the detected rotation amount of the sound collection unit 1010 to the phase adjustment unit
1060. The subsequent frame division (S2020) and the FFT processing (S2030) are substantially
the same as in the first embodiment, and thus the description thereof is omitted.
03-05-2019
17
[0065]
The phase adjustment unit 1060 performs phase adjustment processing (S2040). That is, phase
adjustment unit 1060 calculates the phase adjustment amount from the sound source position
input from parameter estimation unit 1070 and the rotation amount of sound collection unit
1010 with respect to the input signal, and the signal input from FFT unit 1040 And perform
phase adjustment processing. Then, phase adjustment section 1060 outputs the signal after
phase adjustment to parameter estimation section 1070.
[0066]
Subsequently, the parameter estimation unit 1070 estimates a sound source separation
parameter (S2050). Then, the parameter estimation unit 1070 determines whether or not the
following iteration is to be ended (S2060). When the repetition does not end, the parameter
estimation unit 1070 outputs the estimated sound source position to the phase adjustment unit
1060, and performs phase adjustment (S2040) and parameter estimation (S2050) again. When it
is determined that the repetition is ended, the phase adjustment unit 1060 outputs the phase
adjustment amount to the parameter adjustment unit 2140. The parameter estimation unit 1070
also outputs the estimated parameter to the parameter adjustment unit 2140.
[0067]
Subsequently, the parameter adjustment unit 2140 adjusts the parameters (S2070). That is, the
parameter adjustment unit 2140 adjusts the spatial correlation matrix R j (f) which is a sound
source separation parameter estimated using the input phase adjustment amount. The adjusted
spatial correlation matrix R j new (n, f) and the variance v j (n, f) are output to the separation
filter generation unit 1080.
[0068]
The subsequent sound source separation filter generation (S2080) and sound source separation
processing (S2090), inverse FFT processing (S2100), frame combining processing (S2110), and
output (S2120) are substantially the same as in the first embodiment, and therefore the
description is omitted. Do.
03-05-2019
18
[0069]
As described above, even when the relative position between the sound source and the sound
collection unit changes, it is possible to stably separate the sound source by detecting the relative
position of the sound source and the sound collection unit.
That is, the sound source separation filter can be stably generated by estimating the parameter
from the signal whose phase has been adjusted and correcting the estimated parameter in
consideration of the amount of the phase which has been further adjusted.
[0070]
In the second embodiment, although the rotation detection unit 2050 is an acceleration sensor,
any device capable of detecting a rotation amount may be used, and a gyro sensor, an angular
velocity sensor, or a magnetic sensor for detecting an orientation may be used. Further, as in the
first embodiment, an imaging unit may be provided to detect a rotation angle from an image.
When the sound collection unit is fixed to a rotating camera platform or the like, the rotational
angle of the rotating camera platform may be detected.
[0071]
Third Embodiment FIG. 7 is a block diagram of a sound source separation apparatus 3000
according to a third embodiment. This device 3000 includes a sound collection unit 1010, a
frame division unit 1030, an FFT unit 1040, a rotation detection unit 2050, a parameter
estimation unit 3070, a separation filter generation unit 1080, a sound source separation unit
1090, an inverse FFT unit 1110, a frame combination unit 1120, and an output. A unit 1130 is
provided.
[0072]
The blocks other than the parameter estimation unit 3070 are substantially the same as those in
the first embodiment described above, and thus the description thereof is omitted. Also in the
third embodiment, it is assumed that the sound source does not move during the sound
03-05-2019
19
collection time as in the second embodiment.
[0073]
The parameter estimation unit 3070 performs parameter estimation using the information
indicating the rotation amount of the sound collection unit 1010 from the rotation detection unit
2050 and the signal input from the FFT unit 1040. In the EM algorithm of estimation, (3) to (6)
of E step and M step are calculated as usual.
[0074]
The method of space correlation matrix calculation is shown below. The time-varying spatial
correlation matrix R j (n, f) is calculated according to the following equation. By subjecting the
calculated R j (n, f) to eigenvalue decomposition (principal component analysis), the sound source
direction θ j (n, f) can be calculated for each time. The sound source direction calculation
method calculates the sound source direction from the phase difference between the elements of
the eigenvector corresponding to the largest one of the eigenvalues calculated by the eigenvalue
decomposition. Subsequently, the influence of the rotation of the sound collection unit 1010
input from the rotation detection unit 2050 on the calculated sound source direction θ j (n, f) is
removed. For example, assuming that the rotation amount of the sound collection unit 1010 is ω
(n), the relative change amount of the sound source position is −ω (n). That is, the sound source
position θ j comp (n, f) = θ j (n, f) + ω (n) is the sound source direction when there is no
rotation. Subsequently, weighted averaging is performed in the time direction with respect to θj
comp (n, f) calculated as follows. Here, the calculated sound source direction θj comp (n, f) has a
high probability of calculating the wrong direction (if the signal amplitude decreases) as the
variance vj (n, f) decreases, so vj (n, f) Taking a weighted average of
[0075]
The apparent movement of the sound source due to rotation is reconsidered with respect to the
calculated direction θ j ave (f), and the sound source direction: is calculated as follows.
[0076]
Subsequently, the eigenvalues calculated by the eigenvalue decomposition of Rj (n, f) are
03-05-2019
20
respectively D 1 (n, f) and D 2 (n, f) in descending order, and the ratio g j (f) is calculated as
follows.
Then, the spatial correlation matrix R j (n, f) is updated as follows from g j (f). Here we represent
the updated spatial correlation matrix, which represents the array manifold vector for the
direction.
[0077]
Since the spatial correlation matrix is a Hermitian matrix, eigenvectors are orthogonal to each
other. Therefore, is a vector orthogonal to and has the following relationship.
[0078]
As described above, the parameter estimation unit 3070 calculates the spatial correlation matrix
as a time-varying parameter. Then, parameter estimation unit 3070 outputs the calculated spatial
correlation matrix: and variance v j (n, f) to separation filter generation unit 1080.
[0079]
Subsequently, a signal processing flow in the third embodiment will be described according to
FIG. The detection of the collected sound and the amount of rotation (S3010) to the FFT
processing (S3030) and the generation of the separation filter (S3060) to the output (S3100) are
substantially the same as those of the second embodiment described above, and therefore the
description thereof is omitted.
[0080]
The parameter estimation unit 3070 performs parameter estimation processing (S3040), and
repeatedly performs parameter estimation processing until it is determined in the subsequent
determination of the end of the iteration (S3050) that the iteration has ended. If it is determined
that the iteration has ended, the parameter estimation unit 3070 outputs the parameter
03-05-2019
21
estimated at that stage to the separation filter generation unit 1080.
[0081]
Subsequently, the separation filter generation unit 1080 performs generation processing of the
separation filter, and outputs the generated separation filter to the sound source separation unit
1090 (S3060).
[0082]
As described above, even when the relative position between the sound source and the sound
collection unit changes, the relative position between the sound source and the sound collection
unit is detected, and the parameter estimation method considering even the sound source
position is used to stabilize the sound source. It becomes possible to separate.
[0083]
In the third embodiment, the parameter estimation unit calculates the sound source direction θj
(n) to estimate the spatial correlation matrix :.
However, without adjusting the sound source direction, phase adjustment may be performed on
the first main component so as to cancel the rotation of the sound collection unit 1010, and the
average value thereof may be obtained.
[0084]
Also, although the weighted averaging of the variance v j (n, f) is performed at the time of
calculation of the position of the sound source at the start of sound collection, an average value
may be simply taken.
In the present embodiment, the sound source direction: was calculated independently for the
frequency. However, since it is hard to think that the directions of the same sound source are
different, it may be considered as a parameter without frequency dependency by averaging the
frequency direction.
03-05-2019
22
[0085]
[Other Embodiments] While the embodiment has been described in detail, the present invention
is not limited to the system, apparatus, method, control program or system, as long as it has a
sound collecting means for collecting sound signals of a plurality of channels. It is possible to
take an embodiment as a recording medium (storage medium) or the like. Specifically, the
present invention may be applied to a system configured of a plurality of devices (for example,
host computer, interface device, imaging device, web application, etc.), or may be applied to a
device including one device. good.
[0086]
Further, it goes without saying that the object of the present invention can be achieved by the
following. That is, a recording medium (or storage medium) storing a program code (computer
program) of software for realizing the functions of the above-described embodiments is supplied
to a system or apparatus. Such storage media are, of course, computer readable storage media.
Then, the computer (or CPU or MPU) of the system or apparatus reads out and executes the
program code stored in the recording medium. In this case, the program code itself read from the
recording medium implements the functions of the above-described embodiments, and the
recording medium recording the program code constitutes the present invention.
[0087]
DESCRIPTION OF SYMBOLS 1000 ... Sound source separation apparatus, 1010 ... Sound
collection part, 1020 ... Imaging part, 1030 ... Frame division part, 1040 ... FFT part, 1050 ...
Relative position change detection part, 1060 ... Phase adjustment part, 1070 ... Parameter
estimation part, 1080 ... Separation filter generation unit, 1090 ... source separation unit, 1100 ...
inverse phase adjustment unit, 1110 ... inverse FFT unit, 1120 ... frame coupling unit, 1130 ...
output unit
03-05-2019
23
Документ
Категория
Без категории
Просмотров
0
Размер файла
36 Кб
Теги
jp2015226104
1/--страниц
Пожаловаться на содержимое документа