close

Вход

Забыли?

вход по аккаунту

?

JP2008064892

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2008064892
The present invention relates to a general including a directional noise source or a nondirectional
noise source which is stationary or moving by combining a voice recognition means having a
function of correcting residual noise which can not be eliminated by microphone array
processing and a microphone array. Abstract: A speech recognition method and a speech
recognition apparatus using the same for realizing speech recognition with high accuracy even in
a noisy environment. A voice recognition method includes a step 1 of collecting an input voice by
a microphone array in which a plurality of microphones are arranged, an arrival direction of a
sound wave in which a sound wave arrival direction of a sound source at a long distance is
estimated from the collected input voice signal. A step 2 of creating an estimation signal, a step 3
of creating a position estimation signal for estimating the position of a sound source located at a
short distance from the collected input voice signal, and 3 the collection based on the sound
wave arrival direction estimation signal and the position estimation signal A procedure 4 of
detecting / separating and outputting the voice of the user from the input voice sounded, a
procedure 5 of correcting the feature of the voice signal of the user, and a procedure 6 of voice
recognition of the voice signal subjected to the correction process. [Selected figure] Figure 1
Speech recognition method and speech recognition apparatus using the same
[0001]
The present invention detects speech of a user in a noise environment in which various
environmental noises and voices of others are present, separates speech of the user and noise,
and recognizes speech of the separated user. Speech recognition apparatus using the
[0002]
04-05-2019
1
A close-talking type headset in which one microphone is placed near the mouth in order to
record the user's voice at a high SNR in a noise environment where various environmental noises
and voices of others are present (see, for example, Patent Document 1) A microphone (see, for
example, Patent Document 2) is used.
When speech recognition is performed, it is essential to realize robust speech recognition against
noise. In the past, close-talking microphones such as headset microphones have been widely used
to suppress noise mixing.
[0003]
Patent Document 1: Japanese Patent Application Publication No. 2003-076393 Patent Document
2: Japanese Patent Application Publication No. 2002-152365
[0004]
However, for example, when voice recognition is incorporated into a ticket vending machine or
the like, it is necessary to wear a headset microphone every time the ticket vending machine is
used, which causes a problem that the work for the ticket vending machine user is cumbersome
and impractical.
In order to avoid this problem, it is necessary to fix the microphone to the ticket vending
machine so that the user can use the ticket vending machine without wearing any microphone.
However, if the distance between the user and the microphone is increased, ambient noise is
likely to be mixed in, and the speech recognition accuracy may be degraded, or a malfunction of
the ticket vending machine caused by the ambient noise may occur. In addition, when there is a
plurality of ambient noises other than the user's voice, there is a problem that it is difficult to
determine which sound is a voice to be recognized. On the other hand, a static noise source with
directivity can be sufficiently suppressed by using a microphone array, but in a real environment,
for example, a mobile noise source such as a speaker while walking or a car traveling while
making a horn Not a few. In the case of a moving noise source having such directivity, even when
using a microphone array, particularly when the moving speed is high, a sufficient suppression
effect can not be obtained, and the influence of residual noise can not be ignored. Furthermore,
in microphone array processing, although directional noise can be suppressed to some extent,
there is also a problem that sufficient suppression effect can not be obtained for omnidirectional
04-05-2019
2
noise.
[0005]
An object of the present invention is to combine a microphone array with a speech recognition
means having a function of correcting residual noise that can not be eliminated by processing
using a microphone array, to obtain a directional noise source stationary or moving or not. It is
an object of the present invention to provide a speech recognition method and a speech
recognition apparatus using the same that realize speech recognition with high accuracy even in
a more general noise environment in which directional noise sources are mixed.
[0006]
A speech recognition apparatus according to the present invention comprises a microphone
array processing unit for separating only a speech signal to be recognized from an input speech
signal including a stationary or moving noise source; noise distortion remaining in the separated
speech signal And a voice recognition processing unit that performs voice recognition while
correcting
The microphone array processing unit includes voice input means and sound source separation
processing means for suppressing ambient noise and emphasizing only the user's voice. The
speech recognition processing unit has speech recognition means having a function of correcting
noise distortion remaining in the separated speech.
[0007]
The speech recognition apparatus comprises a microphone array processing unit and a speech
recognition processing unit. The microphone array processing unit is a microphone array voice
input device for receiving input voice (user's voice etc.) by the microphone array, and from the
multi-channel voice data of this voice input device, the voice of the user and the voice of another
person around him or Sound source position estimation means for noise (hereinafter, ambient
noise) and sound source arrival direction estimation means for a sound source located at a long
distance for estimating the sound source arrival direction of a sound source located at a long
distance; The sound source separation processing means for separating the voice to be
recognized based on the sound source position information of the sound source located at a short
distance for estimating the position, the sound source position estimation means, and the user's
04-05-2019
3
speech based on the sound source position information Switching means for switching and
outputting the voice signal from the sound source separation processing means according to the
detection signal from the user's speech detection means to be detected and the user's speech
detection means Constructed. The voice recognition processing unit performs feature correction
processing means for performing feature correction processing on the voice signal from the
switching means, and voice recognition for performing voice recognition of the voice signal
whose feature is corrected from the feature correction processing means and outputting a
recognition result It consists of means.
[0008]
Specifically, it is as follows. (1) A voice recognition method includes: step 1 of collecting input
voice by a microphone array in which a plurality of microphones are arranged; sound wave
arrival direction estimation of sound source arrival direction of a sound source at a long distance
estimated from the collected input voice signal Step 2 of creating a signal, Step 3 of creating a
position estimation signal for estimating the position of a sound source at a short distance from
the collected input voice signal, 3 said sound collection based on the sound wave arrival direction
estimation signal and the position estimation signal And a procedure 4 for separating and
outputting only the voice of the user from the input voice, a procedure 5 for correcting the
feature of the voice signal of the user, and a procedure 6 for voice recognition of the voice signal
subjected to the correction process. Do. (2) In the voice recognition method according to (1), the
sound wave arrival direction estimation signal and the position estimation signal are output in a
procedure 4 in which only the user's voice is separated and output from the collected input voice
It is characterized in that it is a procedure performed by a signal that detects the user's speech
state based on the above.
[0009]
(3) In the voice recognition method according to (1), a procedure for detecting a user's speech
state based on the sound wave arrival direction estimation signal and the position estimation
signal is defined as the sound wave arrival direction estimation signal and the position estimation
signal. It is characterized in that it is a procedure of selecting a sound source that falls within a
user speech area assumed beforehand based on that. (4) In the speech recognition method
according to any one of the above (1) to (3), directional noise is suppressed in step 1 of collecting
the input speech by the microphone array, and the feature of the speech signal of the user In
step 5 of correcting the distortion, distortion due to nondirectional noise or sudden noise that
can not be removed by the microphone array processing is removed. (5) The voice recognition
04-05-2019
4
device collects an input voice by means of a microphone array in which a plurality of
microphones are arranged, and estimates a sound wave arrival direction estimation signal
obtained by estimating the sound wave arrival direction of a sound source at a long distance
from the collected input voice signal. At the same time, a position estimation signal for estimating
the position of a sound source located at a short distance is generated from the collected input
speech signal, and from the input speech collected based on the sound wave arrival direction
estimation signal and the position estimation signal The microphone array processing unit
separates and outputs only the voice of the user, and the voice recognition processing unit
performs voice processing of correcting the feature of the voice signal of the user and
performing voice processing of the correction process. .
[0010]
(6) In the voice recognition device according to (5), the input voice is collected by the
microphone array in which a plurality of microphones are arranged, and the sound wave arrival
direction of the sound source at a long distance is estimated from the collected input voice signal.
A sound wave arrival direction estimation signal is generated, and a position estimation signal for
estimating the position of a sound source at a short distance is generated from the collected
input voice signal, and the sound wave arrival direction estimation signal and the position
estimation signal are used. A microphone array processing unit that separates separated voices
from the collected input voices, and switches transmission of the separated voices according to
the user's speech detection signal obtained based on the sound wave arrival direction estimation
signal and the position estimation signal; The present invention is characterized in that it
comprises a voice recognition processing unit that performs correction processing on the
features of the switched separated voice signal and voice recognition of the voice signal
subjected to the correction processing.
[0011]
(7) The voice recognition device according to (5) or (6) includes a microphone array voice input
device for collecting input voice by a microphone array in which a plurality of microphones are
arranged, and an output signal of the microphone array voice input device. Sound wave arrival
direction estimation means for estimating a sound wave arrival direction of a sound source
located at a distant distance, position estimation means for receiving a signal from the
microphone array voice input device and estimating a position of a sound source at a short
distance; Sound source separation processing means for separating a sound signal of a sound
source from an output signal of the microphone array voice input device based on an output
signal of the estimation means and an output signal of the position estimation means; Speech
detection means for detecting a user's speech state based on an output signal, and an output
signal of the speech detection means A switch for transmitting or not transmitting the separated
sound of the sound source separation processing means on the basis of the feature, a feature
04-05-2019
5
correction processing means for taking in the separated voice signal from the switch and
correcting a feature of the separated voice signal; And voice recognition means for recognizing
voice based on the voice signal whose feature has been corrected from the correction processing
means.
[0012]
The present invention makes it possible to estimate the sound source position and sound wave
arrival direction of the user's voice and ambient noise by using a microphone array.
By predetermining the relative position and direction of the user with respect to the system, the
microphone array is used to estimate the position and direction of each sound source, even if
there are multiple ambient noises other than the user's voice By this, it is possible to correctly
detect the presence or absence of the user's voice and to prevent the malfunction of the system
due to the ambient noise.
In addition, even when the user's voice and ambient noise are simultaneously generated, sound
source separation processing may be performed to emphasize only the user's voice based on the
information of the user's voice and the ambient sound and the sound source position and the
sound wave arrival direction. Robust voice recognition is realized in a noise environment that is
possible and has various noises.
[0013]
Further, the speech recognition apparatus of the present invention can estimate the sound source
position and sound wave arrival direction of the user's speech and ambient noise by using the
microphone array as the speech input means. By predetermining the relative position of the user
to the system and the direction of arrival of sound waves, even if multiple ambient noises exist
other than the user's voice, based on the estimated position of each sound source and the arrival
direction of sound waves Therefore, it is possible to correctly detect the presence or absence of
the user's speech and to avoid the malfunction of the system due to the ambient noise. In
addition, even when the user's voice and ambient noise are simultaneously generated, sound
source separation processing may be performed to emphasize only the user's voice based on the
information of the user's voice and the ambient sound and the sound source position and the
sound wave arrival direction. Robust voice recognition is realized in a noise environment that is
04-05-2019
6
possible and has various noises.
[0014]
In the present invention, the microphone can be disposed at any position, but in the following,
the headset microphone array voice input device shown in FIG. 6 will be described as an
example. However, the shape of the microphone array of the present invention is not limited to
the headset microphone array of FIG. The conventional headset microphone has a structure in
which a post is fixed to only one of the left and right of the headset and one microphone is
disposed at the tip. On the other hand, the headset microphone array voice input device
according to the present invention has a structure in which the headset microphones are fixed to
the left and right sides of the headset and one microphone is disposed at the tip.
[0015]
The headset 1 according to the present invention includes a headband 3 for mounting on the
head, storage cases 2R and 2L with ear pads attached to both ends of the headband 3, and
substantially rod-shaped columns 4R provided on the storage case 2R with ear pads. And a
substantially rod-shaped support 4L provided in the storage case 2L with an ear pad. The earpadequipped housings 2R and 2L are respectively composed of case bodies 2Ra and 2La and ear
pads 2Rb and 2Lb.
[0016]
In the columns 4R and 4L, the same number of one or more arbitrary number of microphones 5
are disposed apart from each other. Preferably, the same three microphones 5 are provided for
each of the columns 4R and 4L. The headband 3 can be configured to be slide adjustable as
follows to allow for length adjustment.
[0017]
The case body 2Ra and 2La of the storage case 2R and 2L with ear pad stores a battery box, a
wireless transmission / reception circuit, and a processing circuit that performs necessary
04-05-2019
7
processing for input signals from the microphones 5 of the microphone array 6 as necessary. Do.
The case bodies 2Ra and 2La and the ear pads 2Rb and 2Lb adjust the distance between each
other, for example, by screwing a hollow bolt provided on the ear pad and a nut integrated with
the case. The distance between the two may be adjusted by other means.
[0018]
The present invention forms the microphone array 6 by fixing the columns 4R and 4L to both the
case bodies 2Ra and 2La of the earpad storage cases 2R and 2L, respectively, and arranging a
plurality of microphones 5 on the columns. As the microphone 5 used for mounting, a very small
microphone having a size of about 5 mm × 3 mm, such as a silicon microphone, is used. The
number of microphones to be placed on the columns 4R and 4L and the spacing between the
microphones are arbitrary because they can be adjusted by software. In the microphone array 6,
it is often necessary that the relative positional relationship between the microphones 5 is always
maintained. However, in the case of a headset, the distance between the left and right
microphone arrays 6 may change depending on the size of the head. In order to cope with this,
as shown in FIG. 2, the distance between the left and right microphone arrays 6, 6 corresponds to
the case bodies 2Ra and 2La and the ear pads 2Rb and 2Lb of the storage cases 2R and 2L with
the columns 4R and 4L attached. Adjust by adjusting the interval.
[0019]
(Parallel Microphone Array Voice Input Device) The voice input means comprises a sound
receiving means comprising a plurality of microphone arrays 6 spaced apart from one another
for receiving user voice. The configuration of the parallel microphone array voice input device
shown in FIG. 6 will be described below. As shown in FIG. 6, the two metal fittings to which the
microphones are attached have one end fixed to the headband, have a length of, for example, 20
cm apart and extend beyond the user's mouth in parallel. An arbitrary number of microphones
(for example, two microphones in total) are disposed at arbitrary intervals, for example, 3 cm.
[0020]
FIG. 7 is a block diagram of a processing circuit housed in the case main body. As shown in FIG.
7, the voice input means includes parallel microphone arrays 30 a and 30 b and a microphone
amplifier and an ADC (analog / digital converter) 32. The sound receiving means comprises at
04-05-2019
8
least a plurality of microphones, preferably a microphone array in which a large number of
microphones are arranged in an array. Also, the arrangement directions of the microphones are
at least mutually separated so that the vectors from the sound source are different. More
preferably, the microphones are disposed on both sides of the mouth of the user. By being
arranged on both sides of the user's mouth in this way, the user's voice input becomes easy and
clear.
[0021]
FIG. 7 is a block diagram of the processing circuit housed in the case body of the present
invention. In particular, it is an example of a processing circuit that performs necessary
processing on input signals from the microphones of the microphone array. In the processing
circuit of the present invention, the parallel microphone arrays 30a, 30b are connected to a CPU
(central processing unit) board 33 via microphone amplifiers and an ADC 32, and the CPU
(central processing unit) board 33 is a bus to a storage unit 34. Connected A CPU (central
processing unit) board 33 is connected to the display 31 for output display, is connected to the
earphone speaker 35 in the ear pad, and is further connected to the transmitting device 36 in the
case main bodies 2Ra and 2La. . The transmission / reception device 36 can adopt any
communication means, not limited to wired and wireless.
[0022]
A CPU (central processing unit) board 33 is a board on which the CPU is mounted, and includes a
voice recognition device and control means. The voice recognition device and control means are
constituted by a CPU board 33 and a storage device 34 connected thereto. The speech
recognition apparatus estimates the speech signal of the user based on the multi-channel speech
data received by the parallel microphone arrays 30a and 30b, and outputs a recognition result.
The sampling rate in parallel microphone array 30a, 30b can be set arbitrarily, for example, 8
kHz, and the number of quantization bits can be set arbitrarily, for example, 16 bits. To increase
the processing accuracy, increase the sampling rate and the number of quantization bits.
[0023]
(Image Display Means) The headset type microphone array voice input device may be provided
with a small and thin display (for example, liquid crystal, EL (electroluminescence, plasma
display, etc.)), a head mounted display, etc. as an image display means. And visually show the
result of the utterance position estimation process.
04-05-2019
9
[0024]
Speech Recognition Device FIG. 1 is a block diagram of the speech recognition device according
to the present invention.
This speech recognition apparatus is composed of a CPU board 33 and a storage unit 34 in FIG.
The voice recognition device 40 includes a microphone array processing unit 41 and a voice
recognition processing unit 42. The microphone array processing unit 41 estimates the sound
wave arrival direction of the sound source located at a far distance from the microphone array
sound input device 43 for the input voice and the sound spread from the sound spread at the
output of the device 43. Means 45, the position estimation means 46 of the sound source at a
short distance for estimating the position of the sound source at a short distance from the sound
of the output of the device 43, the output of the device 43 based on the sound source position
information of the means 45 Sound source separation processing means 44 for separating the
sound of the sound source to be extracted from the wide-ranging sound, and based on the sound
source position information of means 45 and 46 to detect the speech of the user (headset type
microphone array voice input device wearer) The voice signal from the sound source separation
processing means 44 is switched according to the detection signal from the user's speech
detection means 47 and the user's speech detection means 47 Composed of switching means 48
for force. The voice recognition processing unit 42 performs feature correction processing means
49 for performing feature correction processing on the voice signal from the switching means
48, and voice for performing voice recognition of the voice signal whose feature is corrected
from the means 49 and outputting a recognition result. It comprises a recognition means 50.
[0025]
The speech recognition apparatus using the microphone array of the present invention is
composed of the following five element technologies as also shown in FIG. 1. Position
estimation of sound source located at a short distance from the microphone array Estimation of
sound wave arrival direction of sound source located far from microphone array 3. User speech
detection 4. Sound source separation processing 5. Speech recognition processing (Japanese
Patent Application No. 2003-320183) The details of these elementary techniques will be
described below.
04-05-2019
10
[0026]
(Sound Source Position Estimation) FIG. 8 is a function explanatory view of the microphone array
of the present invention. The microphones 1, 2, 3, 4 and the microphones 5, 6, 7, 8 are disposed
opposite to each other as shown in FIG. Further, it is assumed that the positions of the
microphones and the sound source are as shown in the figure. The method of estimating the
position of the sound source located within a short distance of about 1 m from the microphone
array by the microphone array will be described below.
[0027]
The plurality of microphones can be arranged at any position in the three-dimensional space.
Sound signals output from a sound source placed at an arbitrary position in a three-dimensional
space are received by Q microphones arranged at an arbitrary position in the three-dimensional
space. The distance Rq between the sound source and each microphone can be obtained by the
following equation.
[0028]
The propagation time τ q from the sound source to each microphone can be obtained by the
following equation, where the sound velocity is v. The gain gq for the narrow band signal of the
center frequency ω received by each microphone to that of the sound source is generally defined
as a function of the distance Rq between the sound source and the microphone and the center
frequency ω.
[0029]
For example, a function such as the following equation experimentally obtained using the gain as
a function of the distance Rq is used.
[0030]
The transfer characteristic between the sound source and each microphone for a narrow band
signal of center frequency ω is expressed as
04-05-2019
11
Then, a position vector a (ω, P0) representing the sound source at the position P0 is defined as a
complex vector having as an element the transfer characteristic between the sound source and
each microphone related to the narrow band signal as in the following equation.
[0031]
The estimation of the sound source position is performed by using the MUSIC method (the
eigenvalue decomposition of the correlation matrix to obtain the signal subspace and the noise
subspace, and the reciprocal of the inner product of the arbitrary sound source position vector
and the noise subspace. Perform the following procedure using the method of checking the
position). The short time Fourier transform of the qth microphone input
[0032]
The observation vector is defined as follows using Here, n is an index of the frame time. The
correlation matrix is determined by the following equation from N continuous observation
vectors.
[0033]
Let the eigenvalues of this correlation matrix arranged in descending order be, and let the
corresponding eigenvectors be
[0034]
とする。
Then, the number of sound sources S is estimated by the following equation. Alternatively, it is
also possible to provide a threshold for the eigenvalues and to set the number of the eigenvalues
exceeding the threshold as the number of sound sources S. Define the matrix R n (ω) from the
basis vectors of the noise subspace as
04-05-2019
12
[0035]
Let the search range U of the frequency band and the source position estimate be
[0036]
Calculate
Then, a coordinate vector at which the function F (P) takes a maximal value is determined. Here,
it is assumed that P1, P2,..., Ps are estimated as coordinate vectors giving S maximum values.
Next, the power of the sound source in each of the coordinate vectors is determined by the
following equation.
[0037]
Then, two thresholds Fthr and Pthr are prepared, and when F (Ps) and P (Ps) in each position
vector satisfy the following condition,
[0038]
It is determined that an utterance has occurred in coordinate vector P1 in N consecutive frame
times.
The estimation process of the sound source position processes N consecutive frames as one
block. In order to estimate the sound source position more stably, the number of frames N is
increased, and / or it is determined that an utterance has occurred if the condition of the
equation (30) is satisfied in all the consecutive Nb blocks. The number of blocks is set arbitrarily.
In the case where the sound source is moving at such a speed that the sound source can be seen
to be approximately stationary within the time of N consecutive frames, the movement miracle of
the sound source can be captured by the above method. (Sound wave arrival direction estimation
of ambient noise)
[0039]
04-05-2019
13
A method of estimating the direction in which the sound wave of the sound source located at a
long distance from the microphone array arrives with the microphone array will be described
below. The plurality of microphones can be arranged at any position in the three-dimensional
space. It is considered that sound waves coming from a long distance are observed as plane
waves.
[0040]
FIG. 2 is an explanatory view for explaining a sound receiving function using the microphone
array of the present invention. FIG. 2 shows, as an example, a case where three microphones m1,
m2 and m3 arranged at arbitrary positions receive sound waves coming from a sound source. In
FIG. 2, a point c indicates a reference point around which the arrival direction of the sound wave
is estimated. In FIG. 2, the plane s shows the cross section of the plane wave including the
reference point c. The normal vector n of the plane s is defined as the following equation, with
the direction of the vector opposite to the propagation direction of the sound wave.
[0041]
3
The sound wave arrival direction of the sound source in the dimensional space is represented by
two parameters (θ, φ). Sound waves coming from directions (θ, φ) are received by each
microphone, and the Fourier transform is determined to resolve the received signal into narrow
band signals, and gain and phase are obtained for each narrow band signal of each received
signal. A vector which is expressed as a complex number and which is arranged as an element by
the entire sound receiving signal for each narrowband signal is defined as a position vector of the
sound source. In the following processing, the sound wave coming from the direction (θ, φ) is
expressed as the above-mentioned position vector. The position vector is specifically determined
as follows. The distance rq between the qth microphone and the plane s is determined by the
following equation.
[0042]
The distance rq is positive if the microphone is located on the sound source side with respect to
the plane s, and conversely, takes a negative value if it is on the opposite side of the sound
04-05-2019
14
source. Assuming that the sound velocity is v, the propagation time Tq between the microphone
and the plane s is expressed by the following equation.
[0043]
The gain with respect to the amplitude at a distance rq from the amplitude in the plane s is
defined as follows as a function of the center frequency ω of the narrowband signal and the
distance rq. The phase difference at a position separated by a distance rq from the phase in the
plane s is expressed by the following equation.
[0044]
From the above, the gain and the phase difference of the narrow band signal observed by each
microphone are represented by the following equation with reference to the plane s.
[0045]
When sound waves coming from the (θ, φ) direction are observed with Q microphones, the
position vector of the sound source is defined as a vector whose element is a value obtained
according to equation (26) for each microphone as defined by the following equation Ru.
[0046]
Once the position vector of the sound source is defined, the direction of arrival estimation of the
sound wave is performed using the MUSIC method.
Using the matrix R n (ω) given by equation (15), the search region I of the sound wave arrival
direction estimation is
[0047]
Calculate
04-05-2019
15
Then, the direction (θ, φ) in which the function J (θ, φ) gives the maximum value is
determined. Here, it is assumed that there are K sound sources, and K sound wave arrival
directions ((θ1, φ1),..., (ΘK, φK)) giving maximum values are estimated. Next, the power of the
sound source in each sound wave arrival direction is determined by the following equation.
[0048]
Then, two thresholds Jthr and Qthr are prepared, and when J (θk, φk) and Q (θk, φk) in each
direction of arrival satisfy the following conditions,
[0049]
It is determined that there is an utterance in the direction of arrival (θ k, φ k) within N
consecutive frame times.
The process of estimating the direction of arrival of sound waves treats N consecutive frames as
one block. In order to estimate the direction of arrival more stably, it is assumed that the number
of frames N is increased and / or that the sound wave has arrived from that direction if the
condition of equation (31) is satisfied in all of the consecutive Nb blocks. to decide. The number
of blocks is set arbitrarily. If the sound source is moving at such a speed that the sound source
can be seen to be approximately stationary within the time of N consecutive frames, the above
method can capture the movement miracle of the direction of arrival of the sound wave. .
[0050]
Although the position estimation result of the near-field sound source and the sound wave arrival
direction estimation result of the far-field sound source play an important role in the subsequent
speech detection processing and sound source separation processing, the near-field sound source
and the far-distance sound source are generated simultaneously When the power of the nearfield sound source becomes significantly larger than the sound wave coming from the far-field
sound source, the arrival direction of the far-field sound source may not be estimated well. In
such a case, it is dealt with by using the arrival direction estimation result of the sound source of
the far distance sound source estimated immediately before the near distance sound source is
generated.
04-05-2019
16
[0051]
(Utterance Detection Process) When there are a plurality of sound sources, it is generally difficult
to specify which sound source is a voice to be recognized. On the other hand, in a system
employing an interface using speech, it is possible to determine in advance a user speech area
representing in what position the user of the system speaks relative to the system. In this case,
even if a plurality of sound sources exist around the system by the above-described method, if
the position of each sound source and the arrival direction of the sound wave can be estimated,
the system selects the sound source entering the user speech area assumed in advance. This
makes it possible to easily identify the user's voice.
[0052]
By satisfying the conditions of Expression (20) and Expression (31), the presence of the sound
source is detected, and further, the conditions of the position of the sound source and the arrival
direction of the sound wave are satisfied, and the user's utterance is detected. This detection
result plays an important role in speech recognition processing as speech zone information.
When speech recognition is performed, it is necessary to detect the start time point and the end
time point of the speech section from the input signal. However, it is not always easy to detect a
speech segment in a noise environment where ambient noise is present. In general, if the start
point of the speech segment is shifted, the speech recognition accuracy is significantly degraded.
On the other hand, even if there are a plurality of sound sources, the function represented by the
equation (18) or (29) shows a sharp peak at the position where the sound source is located or
the arrival direction of the sound wave. Therefore, the speech recognition apparatus of the
present invention which performs speech zone detection using this information has the
advantage of being able to perform robust speech zone detection even in the presence of a
plurality of ambient noises, and to maintain high speech recognition accuracy. Have.
[0053]
For example, the user's utterance area as shown in FIG. 3 can be defined. FIG. 3 is a functional
explanatory diagram of the speech detection process according to the present invention.
Although this drawing shows only the XY plane for the sake of simplicity, in general, any user
speech area can be defined in a three-dimensional space as well. In FIG. 3, assuming that
processing is performed using eight microphones m1 to m8 arranged at arbitrary positions, a
user utterance area is defined in each of the search area of the short distance sound source and
04-05-2019
17
the search area of the far distance sound source. There is. The search space for the short-range
sound source is a rectangular area whose diagonal is a straight line connecting two points (PxL,
PyL) and (PxH, PyH), and (PTxL1, PTyL1) and (PTxH1, PTyH1), Two rectangular areas having a
straight line connecting two points PTxL2, PTyL2 and (PTxH2, PTyH2) as diagonal lines are
defined as the user's utterance area. Therefore, it is possible to select the sound source position
determined by the equation (20) from among the sound source positions determined to have an
utterance, by selecting the one whose coordinate vector is within the user's utterance region. It
can identify the voice.
[0054]
On the other hand, with respect to the search space of the far-distance sound source, the
direction from the angles θL to θH is defined as the search area with respect to the point C, and
the area from the angles θTL1 to θTH1 is defined as the user's utterance area in that area.
Therefore, among sound sources existing at a long distance, it is selected by selecting one whose
arrival direction is within the user's utterance region, from among the arrival directions of sound
waves determined to have been vocalized by equation (31). User voice can be identified.
[0055]
(Sound Source Separation Process) A sound source separation process for emphasizing the user's
voice and suppressing ambient noise using the position estimation result of the detected sound
source or the arrival direction estimation result of the sound wave will be described below. The
utterance position or arrival direction of the user voice is obtained by the utterance detection
process. Also, the source position or the incoming direction of ambient noise is already estimated.
Using these estimation results, the sound source position vectors of Equations (8) and (27), and
σ representing the variance of the omnidirectional noise, the matrix V (ω) is defined as the
following equation.
[0056]
Eigenvalues arranged in descending order of this correlation matrix
[0057]
Let eigenvectors corresponding to each be.
04-05-2019
18
[0058]
Here, since the correlation matrix V (ω) includes S sources for S near-field sources and K for farrange sources and includes (S + K) sources, the eigenvalues and eigenvectors of (S + K) from the
largest eigenvalue Define Z (ω) as
Then, a separation filter W (ω) for emphasizing the voice of the user present in the coordinate
vector P at a short distance is given by the following equation.
[0059]
By multiplying the separation filter of equation (36) by the observation vector of equation (10),
the voice v (ω) of the user present in the coordinate vector P can be obtained.
[0060]
The waveform signal of the emphasized user speech can be obtained by calculating the inverse
Fourier transform of equation (37).
[0061]
On the other hand, the separation filter M (ω) in the case of emphasizing the voice of the user
who is in the far direction (θ, φ) is given by the following equation.
[0062]
By multiplying the separation filter of equation (38) by the observation vector of equation (10),
the emphasized speech v (ω) of the user in the direction (θ, φ) is obtained.
[0063]
The waveform signal of the emphasized user speech can be obtained by calculating the inverse
Fourier transform of equation (37).
In the case where the sound source is moving at such a speed that the sound source can be seen
04-05-2019
19
to be approximately stationary within the time of N consecutive frames, the enhanced voice of
the moving user can be obtained by the above method.
[0064]
(Speech recognition process) Although the sound source separation process is effective for
directional noise, noise remains to some extent for nondirectional noise.
Also, the noise suppression effect can not be expected so much for noise generated in a short
time, such as sudden noise.
Therefore, for the recognition of the user voice emphasized by the sound source separation
process, for example, the feature correction method described in Japanese Patent Application No.
2003-320183 "correction method of background noise distortion and speech recognition system
using it" is disclosed. By using the built-in speech recognition engine, the influence of residual
noise is reduced.
The present invention is not limited to the Japanese Patent Application No. 2003-320183 as a
speech recognition engine, and it is conceivable to use a speech recognition engine in which
various methods resistant to noise are implemented.
[0065]
The feature correction method described in Japanese Patent Application No. 2003-320183
performs feature amount correction of noise superimposed speech based on Hidden Markov
Model (HMM) that the speech recognition engine has in advance as a template model for speech
recognition. .
The HMM is learned based on Mel-Frequency Cepstrum Coefficient (MFCC) obtained from clean
speech without noise. For this reason, there is an advantage that it is relatively easy to
incorporate a feature correction method into an existing recognition engine without having to
prepare a new parameter for feature correction. In this method, noise is considered as a
stationary component and a non-stationary component that temporarily changes, and the
04-05-2019
20
stationary component of noise is estimated from several frames immediately before speech
regarding the stationary component.
[0066]
A copy of the distribution possessed by the HMM is generated, and a stationary component of
the estimated noise is added to generate a feature amount distribution of the stationary noise
superimposed speech. By evaluating the a posteriori probability of the feature quantity of the
observed noise-superimposed speech with the feature quantity distribution of this stationary
noise-superimposed speech, distortion due to a stationary component of noise is absorbed.
However, since distortion due to the non-stationary component of noise is not taken into
consideration by this processing alone, the posterior probability obtained by the above means is
not accurate when the non-stationary component of noise is present. On the other hand, by using
the HMM for the feature correction, the temporal structure of the feature amount time series and
the accumulated output probability obtained along it can be used. By assigning a weight
calculated from the accumulated output probability to the above-mentioned posterior probability,
it is possible to improve the reliability of the posterior probability degraded by the temporarily
changing non-stationary component of noise.
[0067]
When speech recognition is performed, it is necessary to detect the start time point and the end
time point of the speech section from the input signal. However, it is not always easy to detect a
speech segment in a noise environment where ambient noise is present. In particular, since the
speech recognition engine incorporating the feature correction estimates stationary features of
ambient noise from several frames immediately before the start of speech, recognition accuracy
is significantly degraded if the start point of the speech section is shifted. On the other hand,
even if there are a plurality of sound sources, the function represented by the equation (18) or
(29) shows a sharp peak at the position where the sound source is located or the arrival direction
of the sound wave. Therefore, the speech recognition apparatus according to the present
invention, which performs speech zone detection using this information, can perform robust
speech zone detection even in the presence of a plurality of ambient noises, and can maintain
high speech recognition accuracy.
[0068]
04-05-2019
21
Hereinafter, in FIG. 8, voice recognition experiments were performed under a noise environment
using a microphone array in which eight microphones are linearly arranged at intervals of 2 cm
so as to be symmetrical with respect to point C on the X axis. Examples will be described. In this
example, it is assumed that all sound sources are located at a long distance, the search area in the
arrival direction is θL = 0 °, θH = 180 °, and the user's speech area is θTL1 = 70 °, θTH1
= 110 ° did. The user utters five simple command voices 19 times from a position of 1.5 m in
front of the microphone array (θ = 90 °). As ambient noise, different television sound is made
to flow from two speakers placed at positions of 1.5 m in the direction of θ = 20 ° and θ = 160
°, respectively. In addition, there are noises such as noises and reflections of fans of several
computers located about 5 m away from the microphone array.
[0069]
FIG. 4 shows a waveform signal received by one of eight microphones. The horizontal axis
represents a time axis, and the vertical axis represents an amplitude value. FIG. 5 shows a
waveform signal of a user voice emphasized by performing sound wave arrival direction
processing of a sound source, speech detection processing and sound source separation
processing. The horizontal axis represents a time axis, and the vertical axis represents an
amplitude value. In a normal speech recognition decoder that does not include the correction
process of the speech feature amount, when the emphasized speech is recognized, only 11 of the
19 speeches were correctly recognized. This is mainly due to the fact that omnidirectional noise,
which can not be eliminated by the microphone array processing, remains in the emphasized
sound of FIG. On the other hand, the speech recognition decoder incorporating the abovedescribed feature correction and the speech detection signal obtained by the speech detection
processing can correctly recognize all the nineteen speeches when the emphasized speech is
recognized again.
[0070]
Voice recognition in ticket vending machines and various vending machines, realization of voice
remote control by incorporating in home appliances, voice recognition in car navigation systems,
voice control of vehicles such as electric wheelchairs, control of equipment by voice in noise
environment such as plant, etc. .
[0071]
04-05-2019
22
It is a block block diagram of the speech recognition apparatus of this invention.
It is explanatory drawing explaining the sound reception function using the microphone array of
this invention. It is function explanatory drawing of the speech detection process by this
invention. 8The waveform signal received by one of the microphones is shown. The waveform
signal of the user voice emphasized by performing sound wave arrival direction processing of a
sound source, speech detection processing and sound source separation processing is shown.
FIG. 1 is a schematic view of a headset microphone array voice input device of the present
invention. It is a block diagram of the processing circuit stored in the case main body of this
invention. It is function explanatory drawing of the microphone array of this invention.
Explanation of sign
[0072]
1 Headset 2R, 2L Earpad storage case 3 Headband 4R, 4L Support 5 Microphone 6 Microphone
array 30a, 30b Parallel microphone array 31 Display 32 Microphone amplifier and ADC 33 CPU
board 34 Memory device 35 Earphone speaker 36 Transmitter / receiver 40 Voice recognition
Device 41 Microphone array processing unit 42 Speech recognition processing unit 43
Microphone array voice input device 44 Sound source separation processing means 45 Sound
wave arrival direction estimation means of sound source at long distance 46 Position estimation
means of sound source at short distance 47 User's speech detection means 48 switching device
49 feature correction processing means 50 voice recognition means m1, m2, m3, m4, m5, m6,
m7, m8 microphone
04-05-2019
23
Документ
Категория
Без категории
Просмотров
0
Размер файла
38 Кб
Теги
jp2008064892
1/--страниц
Пожаловаться на содержимое документа