close

Вход

Забыли?

вход по аккаунту

?

JP2010250152

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010250152
An object of the present invention is to realize a voice detection apparatus which does not
require calibration work at an installation place while reducing the quality dependence of a
microphone or amplifier and the installation environment's dependence on noise environment.
An utterance detection apparatus for detecting an utterance in a predetermined direction from an
acoustic signal acquired by two sound collectors, which cuts low frequency components of the
acoustic signal and detects the cut low frequency components. An artificial signal is added, and
then a cross-correlation is calculated for acoustic signals from both sound collectors to provide
an utterance detection apparatus for detecting an utterance in a predetermined direction.
[Selected figure] Figure 1
Utterance detection device
[0001]
The present invention relates to an utterance detection apparatus that detects an utterance in a
predetermined direction from an acoustic signal collected by a microphone, and in particular,
calculates normalized cross-correlation values of acoustic signals from two microphones, and
based on the calculation result , An utterance detection apparatus for detecting an utterance.
[0002]
Conventionally, sound signals emitted from a sound source are collected by a plurality of
microphones, such sound signals are processed, a normalized cross correlation value between
sound signals for each microphone is calculated, and the result of the normalized cross
04-05-2019
1
correlation value is An apparatus has been proposed based on which the presence of a sound
source present in a predetermined direction is detected.
Patent Document 1 is an example of a cross correlation circuit used conventionally. In the
conventional cross-correlation circuit, the sound signals emitted from the sound source are
collected by the two microphones on the left and right, and the cross-correlation function is
normalized by the average level of these two signals, thereby normalizing mutual The output of
the correlation value is obtained.
[0003]
JP-A 64-1984
[0004]
However, since the above-mentioned prior art is normalized by the average level of the acoustic
signal, even when the voice is not generated, the sound source may be outputted due to the
influence of the environmental noise.
That is, since normalization is performed using the power information of the input acoustic
signal, if there is a noise whose phase is coincident by chance in the background noise, high
normalized cross correlation even if the power information of the input signal is small There is a
risk of false detection if there is an utterance even though there is no value.
[0005]
Also, as a measure against such so-called background noise, it may be used that the power
information of the target sound such as voice becomes larger than the background noise power
information. Specifically, it is determined that the signal source of the target sound does not exist
when the power information from the input acoustic signal is less than or equal to a
predetermined level, and the signal source does not exist. Prevents the reduction in accuracy due
to background noise. In this case, it is important to set a threshold for separating power
information of target sound and power information of background noise. If the power
information of the target sound and the background noise are obviously different, it is possible to
separate them relatively easily. However, the power information fluctuates depending on the
04-05-2019
2
variation in sensitivity of the microphones used and the amplification factor of the set amplifier.
[0006]
For this reason, in the prior art, it is required that the sensitivity of the microphone and the
amplification factor of the amplifier be constant, and it has to be configured by very expensive
parts. Alternatively, when an expensive apparatus having a certain quality can not be used, a
calibration operation is required to generate a reference sound from a specific place after the
installation of the apparatus and adjust a threshold for separating the target sound and the noise.
For this reason, the person who acquired the handling of the device to a high degree is needed,
and the cost of installation work will increase.
[0007]
Furthermore, in a general environment, the background noise level varies depending on the
installation location and also varies with time, and various factors such as temporary noise other
than the target sound may cause the above-mentioned hardware calibration to be performed.
Even if implemented, it is not easy to set the threshold for separation from the target sound.
[0008]
Therefore, in order to solve such problems, the present invention eliminates the need for
calibration work at the installation place while reducing the quality dependence of the
microphone and amplifier, and reduces the dependence of the installation place on the noise
environment in a predetermined direction. It aims at realization of the utterance detection device
from
[0009]
The present invention is an utterance detection apparatus that detects an utterance in a
predetermined direction using acoustic signals acquired by two sound collectors, and cuts
predetermined low frequency components from the acoustic signals by the two sound collectors.
An instantaneous power correlation calculation unit for calculating a data set for each voice
frame in which the instantaneous power information for each of the two acoustic signals and the
specific direction instantaneous cross correlation value are set, and the data set calculated by the
instantaneous power correlation calculation unit A storage unit for storing each voice frame; a
frame sorting unit for sorting acoustic frames of a predetermined number of data sets having a
small absolute value of the specific direction instantaneous cross correlation value from the data
sets stored in the storage unit; An addition coefficient calculation unit for calculating an artificial
04-05-2019
3
sound addition coefficient from the frame selected in the above, and the artificial signal of the
low frequency range component amplified using the An artificial sound addition unit to be added
to each acoustic signal from which a predetermined low frequency range component has been
cut, and a cross correlation calculation to calculate a normalized cross correlation value for two
acoustic signals to which an artificial signal is added by the artificial sound addition unit A
speech detection apparatus comprising: a section and a speech detection section for detecting a
speech by an output of the cross correlation calculation section.
[0010]
In a preferred aspect of the present invention, an artificial sound addition coefficient such that a
normalized cross correlation value in a predetermined direction in the cross correlation
calculation unit substantially becomes a set target value using a data set of a selected acoustic
frame. The additional coefficient calculation unit calculates
[0011]
Further, as a preferred aspect of the present invention, the artificial signal is a signal of opposite
phase for every two sound collectors, and the target value is a predetermined value of normalized
cross-correlation value from -1 to 0.
[0012]
According to the present invention, even if there is a variation in hardware quality such as a
variation in microphone sensitivity or an amplification factor of an amplifier, an utterance from a
predetermined direction can be detected accurately.
In addition, only by making the input range of the audio signal properly fall within the range of
the A / D converter, no special calibration operation is required.
In addition, it is possible to set a threshold for speech detection that is less dependent on the
noise environment at the installation site.
[0013]
04-05-2019
4
It is a block diagram of utterance detection device 1 to which the present invention is applied.
It is an arrangement | positioning figure of the utterance detection apparatus for recognizing the
ATM user in a financial institution as a speaker.
FIG. 10 is a diagram illustrating an example of processing in the frame cutting out unit 13;
It is a figure explaining the processing transition of the acoustic frame of a voice section. It is a
figure explaining the relationship between the acoustic frame waveform of a voice section, the
acoustic frame waveform of a voiceless section, and those normalization cross correlation values.
FIG. 5 is a functional block diagram of a background noise frame selection unit 16; It is a
processing flow of the background noise flame | frame selection part 16. FIG.
[0014]
Hereinafter, preferred embodiments of a voice detection apparatus to which the present
invention is applied will be described with reference to the drawings. FIG. 2 shows an example of
installation when the speech detection apparatus according to the present embodiment is used
for extracting only the voice of the operator 4 of the CD / ATM 3 of the financial institution. In
the case of wire transfer fraud, a criminal may use a mobile phone to induce the victim to operate
the ATM3 with a mobile phone and use a method to transfer the victim's money to the victim's
account. Under normal circumstances, the operator 4 of the ATM 3 rarely utters a voice while
operating the ATM 3. On the other hand, operators who have the possibility of wire transfer
fraud are often operated by the mobile phone while operating at the front of the ATM 3 and
therefore often talk to the other party using the mobile phone, so the voice can It often emits.
Therefore, it is important to first detect an utterance from the front of the ATM 3 as a factor to
prevent a wire transfer fraud. Therefore, the utterance detection apparatus according to the
present invention accurately detects that the victim is uttering in front of the ATM 3 by analyzing
the acoustic signals from the two microphones 2 installed at the upper left and right ends of the
ATM 3. It is
[0015]
FIG. 2 is a diagram showing an example of the arrangement of the utterance detection apparatus
04-05-2019
5
for detecting the utterance of the user 4 of the ATM 3 in the financial institution. The main body
device 1 is installed on the wall surface, and two microphones 2 are installed at the left and right
ends of the upper part of the ATM 3 with a predetermined distance apart. Although two
microphones 2 are used in the present embodiment, the present invention is not limited to this,
and three or more microphones may be used in an appropriate number in an appropriate
arrangement. The processing described later may be executed with a pair of microphones.
[0016]
FIG. 1 shows a block diagram of a speech detection apparatus to which the present invention is
applied. The utterance detection apparatus includes two microphones 2 which are sound
collectors, an amplifier 10, an A / D converter 11, a low band cut processing unit 12, a frame
cutting out unit 13, a whitening processing unit 14, and an instantaneous power correlation
calculation unit 15 The background noise frame selection unit 16, the pure tone addition
coefficient calculation unit 17, the pure tone addition unit 18, the cross correlation calculation
unit 19, and the utterance detection unit 20.
[0017]
The microphone 2 is omnidirectional because it is desirable to collect sound from all directions.
The microphones 2 are installed at an interval of a predetermined distance. The predetermined
distance is determined according to the sampling period, the distance to the speaker, and the like.
Also, the microphone 2 does not have to be specially prepared with high quality.
[0018]
The amplifier 10 is an amplifier that amplifies the sound collected by the microphone 2 so that
the A / D converter 11 can process it. The voice, which is the amplified analog signal, is sampled
at 6000 Hz or more and converted into a discrete time signal (digital signal) by the A / D
converter 11. The amplifier 10 and the A / D converter 11 are both well-known components, so
detailed description will be omitted.
[0019]
04-05-2019
6
Next, the low band cut processing unit 12 is configured by a digital filter that cuts low band
signals irrelevant to the audio signal, for example, frequency components of 70 Hz or less. It is
necessary to make the same configuration in both the left and right channels, but there is no
limitation of FIR (Finite Impulse Response) type and IIR (Infinite Impulse Response) type.
Furthermore, processing on the frequency axis using FFT (Fast Fourier Transform) may be used.
[0020]
Next, the frame cutting out unit 13 cuts out the acoustic signal into a frame of a fixed cycle with
a fixed period. Specifically, for example, a Hamming window with a frame length of 30 ms and a
shift length of 20 ms is used as a window function to multiply the acoustic signal as a window
function to cut out a frame. The window function is not limited to the Hamming window, and a
HANNING window or the like may be used.
[0021]
The processing result of the frame cutout unit 13 will be described with reference to FIG. FIG. 3
is a graph of the acoustic signal with the horizontal axis representing time and the vertical axis
representing amplitude. An example of the acoustic signal processed by the A / D converter 11
and the low band cut processing unit 12 is shown in the upper diagram of FIG. And the result of
having cut out the flame | frame extraction part 13 of the flame | frame of the process target
from this acoustic signal is shown in the lower figure of the figure.
[0022]
The whitening processing unit 14 flattens the frequency characteristics of the extracted frame.
The intention of the flattening processing, that is, the whitening processing is to reduce the shape
variation of the normalized cross-correlation value sequence due to the difference in phoneme (/
a //// i / e) in the cross-correlation processing unit 19 described later. It is in.
[0023]
04-05-2019
7
Specific processing of the whitening processing unit 14 will be described. The whitening
processing unit 14 calculates the LPC cepstrum coefficient from the acoustic signal of the frame
cut out by the frame cut-out unit 13 (lower figure in FIG. 3). Then, the frequency response of the
calculated LPC cepstrum coefficient is calculated to obtain a spectral envelope. First, FFT (Fast
Fourier Transform) processing is performed on the acoustic signal of the frame extracted by the
frame extraction unit 13. Thereafter, the acoustic signal is whitened by dividing the result of the
FFT processing by the spectrum envelope.
[0024]
FIGS. 4A to 4C show the whitening process. FIG. 4A shows the frequency spectrum of the frame
cut out by the frame cut-out unit 13 from the input acoustic signal. The horizontal axis in FIG. 4
is frequency, and the vertical axis is spectral intensity. The low band side level is low because the
low band cut processing unit 12 cuts the low band signal. LPC cepstrum analysis is performed on
this acoustic signal to determine its envelope (spectral envelope) (dashed line in FIG. 4B). It is
FIG.4 (c) which whitened based on this envelope. Needless to say, the whitening process is not
limited to this, and it is possible to apply a known whitening process such as filtering on the time
axis.
[0025]
Furthermore, as an option, the whitening processing unit 14 may perform whitening followed by
tilting so that the spectral intensity has a downward slope on the frequency axis. This is shown in
FIG. 4 (d). Note that what is indicated by a broken line in FIG. 4D is an auxiliary line shown to
make it easy to understand the falling of the spectral intensity. Adding a downward slope to the
spectral intensity corresponds to widening the pulse width in the normalized cross-correlation
value sequence in the processing performed by the cross-correlation calculating unit 19
described later. In particular, when the sampling frequency in the A / D converter 11 is small, the
pulse width becomes too narrow, so it becomes difficult to evaluate in the instantaneous power
correlation calculation unit 15 and the cross correlation calculation unit 19. It will be possible to
adjust. FIG. 4E shows an acoustic signal when a pure tone which is a kind of artificial signal is
added by a pure tone adding unit 18 described later.
[0026]
04-05-2019
8
The instantaneous power correlation calculation unit 15 receives an acoustic signal input from
the left and right microphones 2 and subjected to processing of each of the amplifier 10, the low
band cut processing unit 12, the frame cutting out unit 13 and the whitening processing unit 14
to t from the left microphone 2 Y11 (t) which is power information in a frame, Y22 (t) which is
power information in a t frame from the right microphone 2, Y12 (t) which is an instantaneous
cross-correlation value of the acoustic signal of the right and left microphones 2 Calculate and
output these to the background noise frame sorting unit 16. Here, when the two outputs
(frequency domain) of the whitening processing unit 14 are X1 (k, t) and X2 (k, t), the calculation
method of the power and the specific direction instantaneous cross correlation value is Y11 (t). )
= Σ k {X 1 (k, t) · X 1 <*> (k, t)}, Y 22 (t) = Σ k {X 2 (k, t) · X 2 <*> (k, t)}, Y 12 (t ) = と な る k {X
1 (k, t) · X 2 <*> (k, t)}. Here, k is the discrete frequency of the FFT, (·) <*> is the complex
conjugate, and k k {·} is the addition for the discrete frequency k.
[0027]
The background noise frame sorting unit 16 is an acoustic frame in which no utterance is made
using the history of the past Y11 (t), Y22 (t) and Y12 (t) obtained by the instantaneous power
correlation calculation unit 15. Sort the background noise frame. Specifically, as an index for
determining that the voice is not uttered, the voiceless sound frame utilizes the fact that the
absolute value of the specific direction instantaneous cross-correlation value Y12 (t) is small. In a
non-voice sound frame, a chaotic sound signal appears at the input of the left and right
microphones 2 so that the cross-correlation value is relatively small, whereas in a voice sound
frame, for example, the operator 4 of ATM3 Since the voice from the front direction appears at
the input of both microphones 2 in phase when speaking, the cross-correlation value becomes
relatively large.
[0028]
With reference to FIG. 6, the background noise frame selection unit 16 for selecting a
background noise frame that is typical of an audio frame not including voice uttered from the
front of the ATM 3 will be described in detail. FIG. 6 shows a functional block of the background
noise frame sorting unit 16. The background noise frame selection unit 16 includes a data set
generation unit 161, an update unit 162, an estimation unit 163, and a storage unit 164.
[0029]
04-05-2019
9
The data set generation unit 161 includes power information Y11 (t) of the sound signal in the
frame t from the left microphone and power information Y22 (t) of the sound signal from the
right microphone, calculated by the instantaneous power correlation calculation unit 15, and The
set of instantaneous direction cross-correlation values Y12 (t) of the acoustic signals from the
microphones of (1) is generated to generate a data set with a valid period. Here, the term of
validity uses a term of validity longer than the duration of the sudden noise generated at the
installation site. In the present embodiment, when it is desired to eliminate the influence of a
sudden noise that lasts for 15 seconds, 50 frames are analyzed per second, so the effective
period is set to 1000 (corresponding to 20 seconds). Since this value differs depending on the
analysis period of the acoustic signal, etc., it needs to be determined appropriately. The longer
the effective period is set, the longer the observation section, so the influence of the temporally
continuous sudden noise can be reduced. On the other hand, as the effective period is set shorter,
since the background noise observation section becomes shorter, it is possible to quickly follow
the fluctuation of the background noise level. Therefore, the effective period takes an appropriate
value depending on the environment in which the microphone 2 is installed, the purpose of the
application, and the like. As will be described later, in the background noise frame sorting unit
16, setting the expiration date in the data set generation unit 161 eliminates the need for
increasing hardware and processing cost.
[0030]
The storage unit 164 is a memory having a capacity permitted as hardware to store a data set,
and the data set generated by the data set generation unit 161 is an absolute value of the specific
direction instantaneous cross correlation value Y12 (t). The values are stored in ascending order.
Here, the number of data sets that can be stored in the storage unit 164 is referred to as a first
predetermined number. In this embodiment, 100 data sets can be stored as the first
predetermined number. One hundred data sets correspond to two seconds worth of data sets.
The first predetermined number needs to be a capacity that can be prepared as hardware, and at
least a number that can be reliably relied upon when the background noise section is statistically
processed by the estimation unit 163. For example, in the present embodiment, a data set (two
seconds) equivalent to one hundred storage capacities of the storage unit 164 is sufficient for the
observation period of the effective period 1000 (20 seconds). If the first predetermined number
of storage units 164 are prepared, it is possible to easily widen and narrow the observation
section of the background noise by setting the effective period to an appropriate value. This
makes it possible to freely set the observation interval without increasing hardware.
[0031]
04-05-2019
10
The update unit 162 includes a comparison unit 1621 and an expiration date confirmation unit
1622, and is a unit that updates the data set stored in the storage unit 164. If there is a vacant
area sufficient to additionally store the data set in the storage unit 164, the inputted data set is
stored in the storage unit 164 in ascending order of the absolute value of the specific direction
instantaneous cross correlation value Y12 (t). For example, the processing in the comparison
means 1621 is performed.
[0032]
The validity period confirmation unit 1622 subtracts one from the validity period of the data set
stored in the storage unit 164 every time the data set is generated from the data set generation
unit 161, and when the validity period of the data set becomes zero, It is deleted from the storage
unit 164. That is, since the data set recorded in the storage unit 164 is always deleted when the
valid period comes, no old data set remains. As a result, the effective period limits the
observation interval on the time axis, and an appropriate observation interval is realized. In the
present embodiment, after all data sets are recorded as data sets in the storage unit 164, they are
forcibly deleted if there are inputs for 1000 frames, that is, if about 20 seconds have elapsed.
[0033]
The comparison means 1621 determines that the largest specific direction instantaneous crosscorrelation value Y12 (t) among the data sets stored in the storage unit 164 is available when
there is no free space for the storage unit 164 to additionally store the data set. The magnitude
relationship between the absolute value and the absolute value of the specific direction
instantaneous cross correlation value Y12 (t) of the input data set is compared, and the absolute
value of the specific direction instantaneous cross correlation value Y12 (t) of the input data set
Discards the input data set if it is larger. On the other hand, when the absolute value of the
specific direction instantaneous cross correlation value Y12 (t) of the input data set is smaller,
data having the absolute value of the maximum specific direction instantaneous cross correlation
value Y12 (t) from the storage unit 164 The set is deleted, and the input data set is inserted and
stored at a position where the input data set is arranged in ascending order of the absolute value
of the specific direction instantaneous cross correlation value Y12 (t). As a result, in the storage
unit 164, the data set is stored in the state of being sorted in the ascending order of the absolute
values of the specific direction instantaneous cross correlation values Y12 (t). In this
embodiment, the calculation load is reduced by comparing with the absolute value of the
04-05-2019
11
maximum specific direction instantaneous cross correlation value Y12 (t) among the data sets
stored in storage unit 164. If the average value in the stored data set is exceeded if there is a
margin, it may be updated with the input data set. Other than the absolute value of the maximum
specific direction instantaneous cross-correlation value Y12 (t), the specific direction
instantaneous cross-correlation value Y12 (t) is given a predetermined weight according to the
effective period while allowing a certain degree of performance degradation. A comparison may
be made between a data set of absolute values and a data set of absolute values of the second
third specific direction instantaneous cross-correlation value Y12 (t).
[0034]
As described above, the comparison unit 1621 stores a data set having a small absolute value of
the specific direction instantaneous cross-correlation value Y12 (t) in the observation interval
necessary for background noise interval estimation in the storage unit 164, which is unnecessary
for estimation. By not storing the specific direction instantaneous cross-correlation value Y12 (t)
having a relatively large absolute value in the data set, it is possible to accurately estimate the
background noise section in the entire observation section even if the number of data sets in the
storage unit 164 is reduced. It is
[0035]
The estimation unit 163 outputs the appropriate data set group stored in the storage unit 164 to
the pure tone additional coefficient calculation unit 17.
Specifically, frame groups of the second predetermined number of data sets are sorted in
ascending order of the absolute value of the specific direction instantaneous cross correlation
value Y12 (t) for the data sets stored in the storage unit 164. For example, the second
predetermined number uses 20 data sets corresponding to 0.4 second worth of data sets. Here,
the smaller the second predetermined number, the smaller the amount of calculation, and thus
the processing cost can be reduced. However, the influence is increased when there is a data set
unsuitable for the background noise section. In addition, if the second predetermined number is
increased, the influence of a data set unsuitable as background noise can be reduced, but on the
other hand, the calculation amount and the storage unit 164 need to be increased. In the present
embodiment, the calculation amount is reduced by using the data set in which the second
predetermined number is extracted in ascending order from the smallest data set of power
information as the processing target of the average value, but the present invention is limited
thereto Instead, the second predetermined number and the data set to be selected may be
appropriately determined so as to increase the reliability when the background noise section is
04-05-2019
12
statistically processed.
[0036]
Next, the processing flow of the background noise frame sorting unit 16 will be described with
reference to FIG. The background noise frame selection unit 16 starts processing upon receiving
the input of power and correlation information from the instantaneous power correlation
calculation unit 15. First, when the power and correlation information Y11 (t), Y22 (t), Y12 (t) are
input from the instantaneous power correlation calculation unit 15, the data set generation unit
161 inputs Y11 (t), Y22 (t), A data set in which Y12 (t) is associated with the effective period
1000 is generated (step S1).
[0037]
Next, the valid period confirmation means 1622 of the update means 162 subtracts one from the
valid period of all the data sets stored in the storage unit 164, and the data set for which the
result is 0 is output from the storage unit 164. Delete (step S2).
[0038]
Next, in step S3, it is determined whether the storage capacity of the storage unit 164 is full.
If the storage capacity of the storage unit 164 is full, the absolute value of the specific direction
instantaneous cross correlation value Y12 (t) of the input data set and the maximum specific
direction instantaneous cross correlation of the data set stored in the storage unit 164 The
absolute value of the value Y12 (t) is compared by the comparison means 1621 (step S4).
[0039]
On the other hand, if there is space in the storage unit 164 in step S3, the process proceeds to
step S7. In step S4, the absolute value of the specified direction instantaneous cross correlation
value Y12 (t) of the input data set is the absolute value of the maximum specified direction
instantaneous cross correlation value Y12 (t) of the data set stored in the storage unit 164. If it is
smaller than the value, the data set is deleted from the storage unit 164 in step S5. Then, the
04-05-2019
13
input data set is added to the position sorted in the ascending order of the absolute value of the
specific direction instantaneous cross correlation value Y12 (t) (step S7). Also, in step S4, the
absolute value of the specific direction instantaneous cross correlation value Y12 (t) of the input
data set is the absolute value of the maximum specific direction instantaneous cross correlation
value Y12 (t) stored in the storage unit 164. If it is the same or larger, the input data set is
discarded.
[0040]
Then, in step S 8, the background noise frame is selected by the estimation unit 163 using the
data set stored in the storage unit 164, and is output to the pure tone addition coefficient
calculation unit 17. Specifically, a predetermined number of data sets from the beginning of the
sorted data set stored in storage unit 164 are output.
[0041]
The pure tone addition coefficient calculation unit 17 calculates the pure tone addition
coefficient α (t) from the frame group selected by the background noise frame selection unit 16
by executing the recursive equation of Equation 1 multiple times.
[0042]
[0043]
In Equation 1, φ (0, t, β) is a normalized cross-correlation value in the front direction (0-order
direction) when the pure tone additional coefficient β in t frame is used.
Further, a frame group selected by the background noise frame sorting unit 16 is represented by
Ω (t), and E (Ω (t)) [φ (0, t, β)] is represented by φ using the frame group. The averaging
process of (0, t, β) is performed.
It is set as (beta) 0 = (alpha) (t-1) using the pure tone additional coefficient alpha (t-1) in flame |
frame t-1 calculated previously, and (beta) i is updated. Here, μ represents an update step width
and represents a small positive number of about 10 <-4> to 10 <-5>, and σ represents a target
04-05-2019
14
value of a normalized cross-correlation value in a predetermined direction described later. By
repeating the predetermined number of times H, the pure tone addition coefficient in the frame t
is determined as α (t) = βH. H may be one to several times. Here, a new pure tone addition
coefficient is recursively calculated so as to be asymptotically to the desired normalized crosscorrelation value using the pure tone addition coefficient of the previous frame, but the steepest
descent method is also used elsewhere The application of the Newton method and the LMS (Least
Mean Squares) method of adaptive signal processing can be applied. Since the statistical
properties of the background noise frame group output from the background noise frame sorting
unit 16 do not change rapidly between adjacent frames, there is no need for particularly fast
convergence, and the above update method is sufficiently practical. is there.
[0044]
Here, the target value σ will be described with reference to FIG. FIG. 5 is a diagram for
explaining an acoustic signal and the result of its normalized cross correlation, in which (a) and
(c) showing an acoustic signal to be processed, the horizontal axis represents time and the
vertical axis represents amplitude. There is. Further, (b) and (d) showing the processing result in
the cross-correlation calculating unit 19 described later have a horizontal axis representing a
time difference indicating the direction of arrival of the sound signal and a vertical axis
representing a normalized cross-correlation value. In addition, the dotted line of the up-down
direction described in (b) and (d) is for showing the normalized cross correlation value of the
sound from front direction. Also, dotted lines shown in the horizontal direction indicate positions
where the normalized cross correlation value is zero.
[0045]
Further, FIG. 5A shows an acoustic signal in a state where no voice is uttered, that is, one of the
waveforms obtained by adding a pure tone, which is an artificial acoustic to background noise, in
opposite phase to the left and right acoustic signals. Is shown. As can be seen from the figure,
since the power of the added pure tone is dominant, a waveform in which the pure tone
component is dominant appears. Although illustration is omitted, the waveform of the other
sound signal has a waveform in which pure tones in antiphase are dominant since the added
pure tones are in antiphase. At this time, as shown in FIG. 5 (b), the normalized cross-correlation
value of both signals appears as a negative value because the power of pure tone is larger than
the power of background noise. The pure tone to be added is a signal that does not cover the
band of the audio signal, for example, a tone signal of 40 Hz, may be a direct current, or may be
an artificial sound subjected to frequency modulation at 70 Hz or less.
04-05-2019
15
[0046]
The target value σ is a normalized cross-correlation value that allows an acoustic signal to be
determined to be dominated by a pure tone when a pure tone is added to a background noise
frame. The target value is appropriately determined depending on the installation environment,
the detection target, and the like. Specifically, as shown in FIG. 5 (b), for the frame group Ω (t)
selected by the background noise frame selection unit 16 in which no voice is uttered, the added
pure tone is dominant for the acoustic signal. And the normalized cross-correlation value in the
front direction, which is the target direction, is a negative value. In the present embodiment,
“−0.5” is set. However, if the target value σ is set too low, even if the voice signal is included,
the pure tone component becomes dominant by increasing the pure tone addition coefficient, and
the utterance can not be detected. is there. On the other hand, if the target value σ is set too
high, the pure tone addition coefficient becomes small, the control by the pure tone becomes
small, and a noise to such an extent that it does not lead to speech may be detected as speech.
Therefore, the setting of the target value σ is appropriately set depending on how much the
utterance of the detection target is set.
[0047]
In FIG. 5C, since the sound signal in a state where the voice is uttered from the front, that is,
when the speaker utters, the voice signal input in the same phase becomes dominant over the
added pure pure tone of opposite phase. As shown in (d), the value in the predetermined
direction of the normalized cross-correlation value (the value at 0 in the case of the front) swings
in the positive direction. As described above, by appropriately providing the level of the pure
tone to be added, it becomes possible to enhance the superiority of the pure tone in the
background noise state and to control the normalized cross correlation value to be a negative
constant level. This facilitates thresholding to determine whether the speaker has uttered. In
other words, in addition to the characteristics of the microphone 2 and the amplifier 10, the
cross-correlation value is normalized to include the environmental sound of the installation site.
For example, when there is no utterance from the front direction, the normalized crosscorrelation value is in the vicinity of “−0.5” which is the target value, and exceeds the
threshold value of the utterance detection, for example, “0.2”. In this case, it can be
determined that an utterance has occurred.
[0048]
04-05-2019
16
Next, returning to Equation 1, an efficient calculation method of calculation of φ (0, t, α (t)) will
be described. φ (0, t, α (t)) corresponds to the zeroth-order term of the inverse Fourier
transform of the normalized cross spectrum ((k, t, α (t)) of Equation 2. Therefore, in general, to
calculate the normalized cross-correlation value φ (n, t, α (t)) when using a specific α (t), as in
the upper part of Equation 2, It is necessary to calculate the transforms X1 (k, t), X2 (k, t) and the
inverse Fourier transform. However, since there is no overlap in frequency between the Fourier
transforms X1 (k, t) and X2 (k, t) of both signals and D (k) which is the Fourier transform of a
pure tone, the second stage approximation formula of Formula 2 Is obtained.
[0049]
[0050]
In Equation 2, Φ (k, t, α (t)) is a normalized cross spectrum with discrete frequency k, frame
number t, pure tone addition coefficient α (t), D (k) is a Fourier transform of a pure tone, M
represents the size of FFT (Fast Fourier Transform), X1 (k, t) and X2 (k, t) represent the Fourier
transform of the left and right acoustic signals, and (·) <*> represents a complex conjugate.
As described above, X1 (k, t) and X2 (k, t) have the low-pass signal cut by the low-pass cutting
processing unit 12 and D (k) has only the low-pass component. The approximation formula holds
with high precision. Although the 0th term of the inverse Fourier transform of this equation is φ
(0, t, α (t)), since the numerator is summed with respect to the frequency k from the definition
equation of the Fourier transform, Equation 3 is obtained. [0051] In Equation 3, φ (0, t, α (t)) is
a normalized cross-correlation value in the direction 0 (front direction), the frame t, and the pure
tone addition coefficient α (t). is there. Also, Δ is power information of the added pure tone. Y11
(t) and Y22 (t) are power information of an acoustic frame from the left microphone 2 in frame t,
power information of an acoustic frame from the right microphone 2 and Y12 (t) are
instantaneous directions in the front direction of the left and right microphones 2 It is a cross
correlation value. Here, it should be noted that the variables necessary for the calculation are
Y11 (t), Y22 (t), Y12 (t) and α (t), and the constant is only the power of pure tone Δ, Only
requires three multiplications, three additions and subtractions, one square root operation, and
one division. In the calculation of E (.OMEGA. (T)) [. Phi. (0, t, .beta.)], The data sets stored in the
storage unit 164, which are the data sets of the past Y11 (t), Y22 (t), Y12 (t) It can be calculated
only with the history of three real numbers. On the other hand, generally, in order to calculate
the value of the cross correlation, it is necessary to maintain the history of X1 (k, t) and X2 (k, t).
04-05-2019
17
Assuming that the size of the FFT is 256, storage of 256 real numbers is required per frame, and
the inverse FFT operation is required to calculate the cross correlation for one frame.
Incidentally, k indicates a discrete frequency, t indicates a frame number, and X1 (k, t) and X2 (k,
t) indicate Fourier transforms of the left and right acoustic signals in the frame t. From the above,
the difference between the storage capacity and the calculation amount is clear. This is because
the calculation of the correlation function takes advantage of the fact that an approximate
expression holds, noting that the acoustic signal and the pure tone to be added do not overlap
spectrally. Of course, if there is room between the storage capacity and the operation power, it is
possible to have the history of the FFT results and use many inverse Fourier transforms to update
Equation 1, but compare the operation results to obtain an approximate expression. There is no
significant difference from using the low storage capacity and low calculation used. [0054] By
determining the pure tone addition coefficient using the efficient memory and arithmetic
processing as described above, it is possible to add an appropriate pure tone in the frame from
the background noise section estimated retroactively to a long time in the past. It becomes
possible to calculate the coefficient for each frame.
Next, the pure tone addition unit 18 uses the whitened acoustic signal from the whitening
processing unit 14 and the pure tone addition coefficient α (t) obtained by the pure tone
addition coefficient calculation unit 17 to generate the left and right A pure tone whose phase is
opposite to each other is added to the sound signal from the microphone 2 with the magnitude of
the pure tone addition coefficient α (t). The cross-correlation calculation unit 19 receives the
output from the pure tone addition unit 18 from the left and right microphones 2, calculates the
normalized cross spectrum according to Equation 2, and performs inverse FFT on the normalized
cross correlation A value sequence is calculated and output to the utterance detection unit 20. In
Equation 2, Φ (k, t, α (t)) is a normalized cross spectrum, which is equal to the Fourier transform
of the normalized cross-correlation value sequence. k is a discrete frequency, t is an analysis
frame number, M is a size of FFT (Fast Fourier Transform), X1 (k, t), X2 (k, t) is a Fourier
transform of the left and right acoustic signals in frame t, -) <*> Shows complex conjugation,
respectively. By performing inverse Fourier transform on such Φ (k, t), a normalized crosscorrelation value sequence in frame t can be obtained.
[0057] Next, in the utterance detection unit 20, based on the peak height and the peak width of
the normalized cross-correlation value sequence calculated by the cross-correlation calculation
unit 19, it is determined whether or not the utterance is generated from the designated direction.
Specifically, the height of the peak giving the maximum value of the normalized cross-correlation
value sequence is at least a constant, and the width satisfies a constant or less, and the peak
position is near a predetermined direction, and the condition is When a plurality of frames are
satisfied, it is determined that sound is emitted.
04-05-2019
18
[0058] In the present embodiment, the normalized cross-correlation value sequence is calculated
by the cross-correlation calculating unit 19. However, the present invention is not limited
thereto, and another simple method may be adopted if accuracy may be sacrificed. . That is, the
normalized cross-correlation value in a predetermined direction may be calculated by Equation 3
without calculating the normalized cross-correlation value sequence. In this case, if the calculated
normalized cross-correlation value exceeds the threshold for utterance detection over a plurality
of frames in the utterance detection unit 20, it is determined that speech has been issued.
[0059] In the present embodiment, whether or not there is an utterance from a specific direction,
for example, the front of two microphones, is described, but plural directions are set in the
instantaneous power correlation calculation unit 15 and later with the whitening processing unit
14 in common. Thus, it is possible to determine whether sound is emitted from multiple
directions. Assuming that the cross correlation index is n0, the speed of sound is c, the distance
between the microphones is d, the angle between the sound source and the centerline of the
microphone is θ, and the sampling frequency is fs,
[0060]
[0061] θ is the voice input oblique angle (unit: radian).
[0062] Further, in the present embodiment, since it is desired to distinguish the voice emitted by
the user 4 located in front of the ATM 3 and the voice other than that, the direction equidistant
from the left and right microphones 2, that is, θ is 0 rad You will find the indicated voice. For
example, the number of times the sound signal from the front is collected is counted by a
counter, and if there is a predetermined number of times within a predetermined time, it is
determined that the user is talking in front, and the lamp or Display and output on a buzzer or
the like. As a result, it is possible to notify the store person of the operation of the ATM 3 in
response to an instruction from the mobile phone, which helps to alert the person who is
unknowingly in the transfer fraud. Although not described in the present embodiment, it is not
limited to the voice from the speaker on the front of the ATM 3 but only when the transfer may
be induced as a result of recognition processing of the voice signal, It may be output to a lamp or
a buzzer.
[0063] The features of this method are summarized as follows. The background noise section is
estimated in a section set back in time, and based on the background noise level, the normalized
04-05-2019
19
cross correlation value in a predetermined direction is adjusted so as to indicate an appropriate
target value, so the left and right microphone signals are appropriate. As long as the A / D
converter range is in place, no special hardware or calibration work is required, and the speaker's
utterance from a specific direction can be made high without changing the threshold setting
under various noise environments. The accuracy can be detected.
[0064] Furthermore, it is not affected by the sudden noise which continues for a shorter time
than the set background noise estimation time. That is, the speech detection performance does
not deteriorate immediately after such sudden noise disappears. Moreover, the amount of
calculation and storage capacity required to estimate the background noise level and adjust the
normalized cross-correlation value are both significantly less than those of the conventional
method.
[0065] DESCRIPTION OF SYMBOLS 1 ... Main part of a speech detection apparatus 10 ...
Amplifier 11 ... A / D converter 12 ... Low-pass cutting process part 13 ... Frame cutting part 14 ...
Whitening process part 15 ... Instantaneous power correlation calculation unit 16 dark noise
frame selection unit 17 pure tone addition coefficient calculation unit 18 pure tone addition unit
19 cross correlation calculation unit 20 utterance estimation unit 2 Microphone 3 ... ATM 4 ...
Speaker
04-05-2019
20
Документ
Категория
Без категории
Просмотров
0
Размер файла
37 Кб
Теги
jp2010250152
1/--страниц
Пожаловаться на содержимое документа