close

Вход

Забыли?

вход по аккаунту

?

JP2017153065

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2017153065
Abstract: The present invention provides a speech recognition method and the like for improving
the response of speech recognition while suppressing the dead end. According to a voice
recognition method, voice information is acquired through a plurality of microphones, a sound
source section including sound is detected from the sound information, and a direction is
estimated with respect to a voice section in the sound source section. Sound information to be
held in the buffer based on the information on the sound source section, the information on the
estimated direction, and the information on the convergence state of the adaptive processing.
The amount of buffer is determined, sound information is held in the buffer according to the
amount of buffer, beamforming processing is performed using the sound information held in the
buffer and the filter coefficient, audio information is acquired, and beamforming processing is
performed. Speech recognition is performed on the received speech information. In this method,
immediately after the start of processing of sound information, the amount of buffer sufficient
for convergence of the adaptive processing is determined as the amount of buffer held in the
buffer. [Selected figure] Figure 9
Speech recognition method, speech recognition apparatus and program
[0001]
The present disclosure relates to a speech recognition method, a speech recognition device, and a
program.
[0002]
03-05-2019
1
A technology for recognizing a specific sound signal using a sound collection device such as a
microphone, that is, a speech recognition technology has been proposed.
For example, Patent Document 1 discloses a microphone system which picks up a specific sound
signal using a plurality of directional microphones in a sound field having a plurality of speakers.
In the microphone system described in Patent Document 1, each microphone is disposed toward
the speaker. Further, among the audio signals of the microphones, an audio signal whose audio
signal level is faster than the noise level in the silence state is selected and output.
[0003]
Japanese Patent Application Laid-Open No. 7-336790
[0004]
The microphone system described in Patent Document 1 includes a delay element that delays an
audio signal in order to suppress a missing head due to processing time of the audio signal.
The audio signals of the plurality of microphones are amplified by an amplifier, passed through a
delay element, and then selected for output. The amount of delay is taken as the largest amount
of delay among the amounts of delay due to various processing elements of the audio signal in
the microphone system.
[0005]
Thus, the present disclosure provides a speech recognition method, a speech recognition device,
and a program that improve the response of speech recognition while suppressing the dead end.
[0006]
A speech recognition method according to an aspect of the present disclosure is a speech
recognition method for speech recognition from sound information acquired by a plurality of
microphones, and (a1) processing for acquiring sound information through a plurality of
microphones; B.) A process of detecting a sound source section including a sound from the
acquired sound information; and (a3) a process of acquiring an estimated direction of a speech
by estimating a direction of a speech section among the detected sound source sections; (A4)
adaptive processing for estimating filter coefficients for extracting voice information in the
03-05-2019
2
estimated direction using the acquired sound information; (a5) information of the sound source
section, information of the estimated direction, and the adaptive processing A process of
determining a buffer amount of sound information to be held in the buffer based on the
information of the convergence state of (a6) and a process of holding the acquired sound
information in the buffer according to the determined buffer amount And (a7) processing for
acquiring voice information by beamforming processing using the sound information held in the
buffer and the filter coefficient estimated by the adaptive processing; (a8) processing for the
beam information subjected to the beamforming processing In the processing (a5), a buffer
amount for holding a buffer amount of an amount sufficient for convergence of the adaptive
processing in the buffer immediately after the processing of the acquired sound information is
started. Decide as.
[0007]
A speech recognition apparatus according to an aspect of the present disclosure is a speech
recognition apparatus that recognizes speech from sound information acquired by a plurality of
microphones, and includes a sound information acquisition unit that acquires sound information
from a plurality of microphones; A sound source section detection unit that detects a sound
source section including sound from a sound source, a direction estimation unit that acquires an
estimated direction of a voice section in the sound source section by direction estimation, and
filter coefficients for extracting voice in the estimated direction An adaptive processing unit that
performs adaptive processing to estimate using the sound information, an adaptive processing
convergence monitoring unit that acquires information on a convergence state of the adaptive
processing, and the sound information is held according to the determined buffer amount Buffer
amount to determine the buffer amount of sound information to be held in the buffer, based on
the information on the sound source section, the information on the estimated direction, and the
information on the convergence state of the adaptive processing A recognition unit, a
beamforming processing unit for acquiring voice information by beamforming processing using
the sound information held in the buffer and the filter coefficient, and voice recognition for the
voice information subjected to the beamforming processing The buffer amount determination
unit determines a buffer amount of an amount sufficient for convergence of the adaptive
processing as the buffer amount held in the buffer immediately after the start of the processing
of the sound information.
[0008]
A program according to an aspect of the present disclosure is a program to be executed by a
computer, and (c1) acquires sound information from a plurality of microphones, (c2) detects a
sound source section including sound from the sound information, c3) An estimated direction of
the voice section in the sound source section is obtained by direction estimation, and (c4) a filter
coefficient for extracting voice information in the estimated direction is estimated by adaptive
03-05-2019
3
processing using the sound information. (C5) determine the buffer amount of sound information
to be held in the buffer based on the information of the sound source section, the information of
the estimated direction, and the information of the convergence state of the adaptive processing,
and (c6) the determined buffer amount (C7) beam forming processing is performed using the
sound information held in the buffer and the filter coefficient to obtain sound information, 8)
speech recognition on the speech information after the beam forming process, immediately after
the start of the processing of the sound information, holds the sound information sufficient
amount in the buffer on the convergence of the adaptive process.
[0009]
A comprehensive or specific aspect of the above-described configuration may be realized by a
device, method, integrated circuit, computer program, or non-transitory recording medium such
as a computer readable CD-ROM, a system, and an apparatus The present invention may be
implemented by any combination of a method, an integrated circuit, a computer program, and a
storage medium.
[0010]
According to the speech recognition method and the like according to the present disclosure, it is
possible to improve the speech recognition response while suppressing the dead end.
[0011]
FIG. 1 is a diagram showing the relationship between the delay of the speech waveform input to
the microphone and the waveform of the signal output by the conventional speech recognition
method.
FIG. 2 is a block diagram showing a schematic configuration of the speech recognition apparatus
according to the first embodiment.
FIG. 3 is a diagram showing an example of a process flow of detection of a sound source section.
FIG. 4 is a diagram showing an example of a process flow of sound direction estimation.
03-05-2019
4
FIG. 5 is a diagram showing an example of the process flow of filter coefficient estimation in
adaptive processing.
FIG. 6 is a diagram showing an example of an expected value of an error for each update of the
filter coefficient in the estimation of the filter coefficient.
FIG. 7 is a diagram illustrating an example of the flow of the process of determining the buffer
amount. FIG. 8 is a diagram showing an example of a list of cases of buffer amounts determined
by the process of FIG. 7. FIG. 9 is a flowchart showing an example of the flow of the operation of
the speech recognition apparatus according to the first embodiment. FIG. 10 is a diagram for
explaining an example of holding of sound information in the buffer. FIG. 11 is a diagram
comparing the voice waveform input to the microphone, the waveform of the signal output by
the conventional voice recognition method, and the waveform of the signal output by the voice
recognition apparatus according to the first embodiment. is there. FIG. 12 is a block diagram
showing a schematic configuration of the speech recognition apparatus according to the second
embodiment. FIG. 13 is a flow chart showing an example of the operation flow of the speech
recognition apparatus according to the second embodiment. FIG. 14 compares the voice
waveform input to the microphone with the waveform of the first output signal output by the
first beamforming processor and the waveform of the second output signal output by the second
beamforming processor. FIG.
[0012]
[Findings of Inventors] The inventors of the present disclosure, that is, the inventors examined
the technology of speech recognition, and came to the following findings. For example, as in
speech recognition technology with a speech interface in an interactive robot, speech recognition
technology may be applied to an environment where it is assumed that speakers exist in various
directions. In such speech recognition technology, in order to recognize speech with high
accuracy even in a noisy environment, it is necessary for the speaker to extract speech uttered
from various directions with high quality.
[0013]
As the prior art for that purpose, the technology which estimates the speaker direction on the
basis of the arrival time difference of the sound wave making use of plural microphones, and the
beamformer etc which emphasizes the sound source of arbitrary direction by controlling the
directivity of the sound collection area Microphone array technology.
03-05-2019
5
[0014]
A method of forming a beam in a desired direction by using a fixed filter in which a
predetermined filter coefficient is set in a beamformer, and adaptation in which filter coefficients
are sequentially estimated by adaptive processing using an observation signal There is a method
of forming a beam in a desired direction using a filter and forming a blind spot of the beam in an
unwanted noise direction.
[0015]
Here, the adaptive processing is processing for estimating filter coefficients by iterative
calculation so that a signal obtained as a result of processing an observation signal with a filter
becomes a desired signal.
In the adaptive processing, when the conditions of the sound source such as noise and a desired
sound source change with time, the filter coefficients largely fluctuate each time estimated by the
adaptive processing, and approach different optimum values for the respective conditions of the
sound source.
In addition, when the conditions of the sound source such as noise and a desired sound source do
not change, the fluctuation of the filter coefficient for each estimation by the adaptive processing
gradually decreases and converges to the optimal value. Thus, the adaptive processing converges
as the filter coefficients approach or converge to the optimal value.
[0016]
With speech enhancement using a beamformer using an adaptive filter, the filter coefficients do
not converge sufficiently at the unstable speech head immediately after the change of the
speaker direction, and the filter coefficients that are optimal for emphasizing the speaker speech
are estimated Because it is not done, there may be a talk-off that can not pick up the beginning of
the speech.
[0017]
03-05-2019
6
Since the conditions of the sound source such as the speaker and noise usually change with time,
it is necessary to consider all sound source directions and their combinations when adopting a
method of obtaining and holding filter coefficients in advance.
Therefore, a large amount of buffers are required to hold filter coefficients, which is not realistic.
[0018]
Therefore, the prior art uses a buffer for holding sound information acquired by the microphone.
If the amount of buffer is an amount sufficient for the adaptive processing in the beamformer to
converge, the processing by the beamformer is performed on the sound information held in the
buffer after the convergence of the adaptive processing, so that there is no head cut High quality
speech can be obtained, and speech recognition with few recognition errors can be realized.
[0019]
However, since the processing by the beam former is performed after the sound information
acquired by the microphone is held in the buffer, a delay occurs before the sound information
acquired by the microphone is processed by the beam former. Specifically, this delay is a delay
corresponding to a time during which processing by the beam former is performed on sound
information held in the buffer. Referring to FIG. 1, there is shown a relationship of delay between
an audio waveform which is an input signal of a microphone and a waveform of an output signal
in processing using a buffer according to the prior art. As shown in FIG. 1, since a delay
corresponding to the time for performing processing by the beam former on sound information
held in the buffer is always required until the result of speech recognition is obtained, the speech
recognition response is slow. There was a problem of becoming In order to solve the abovementioned problems of the prior art, the present inventors examined a technology having high
voice recognition performance and quick response of voice recognition even when the speaker
direction changes variously.
[0020]
Then, in order to solve the above problems, the present inventors increase or decrease the
amount of buffer used for speech enhancement processing by the beam former based on the
03-05-2019
7
direction of speech, the convergence state of the adaptation processing, and the information of
the sound source section. I found out to do. Specifically, for example, immediately after the voice
direction such as the speaker direction changes, the buffer amount is increased regardless of the
convergence state of the adaptive processing and the information of the sound source section,
and the buffer of an amount sufficient for the convergence of the adaptive processing The filter
coefficients used for processing in the beamformer are obtained using In addition, when the
direction of the voice such as the speaker direction is constant, it is determined based on the
information on the convergence state of the adaptive processing and the sound source section
whether to decrease the buffer amount. And, by reducing the amount of buffer used for adaptive
processing, the real-time property of adaptive processing is improved. For example, when the
adaptation processing has converged and a silent section such as a breath, a speech or a break in
speech has been detected, the speech enhancement processing by the beamformer is not
performed on the sound information of the detected silent section. Furthermore, reduction of the
buffer amount used in the processing after the silent section and the silent section is performed.
With the above-described configuration, it is possible to increase or decrease the amount of
buffer used for processing by the beam former according to the direction of speech and the
convergence state of adaptive processing.
[0021]
Therefore, the inventors have found that it is possible to obtain high-quality speech without
speech loss by using a sufficient amount of buffer for adaptive processing immediately after the
direction of speech changes. Furthermore, when the direction of voice is constant, the present
invention omits the voice enhancement processing of the silent section and reduces the buffer
amount used in the processing of the silent section and the silent section, thereby reducing the
voice recognition response. I found that it was possible to be faster. Therefore, the present
inventors have found that even when the direction of speech changes, speech recognition with
high speech recognition performance and quick response is possible.
[0022]
Hereinafter, an embodiment disclosed by the present inventors based on the above-described
findings will be specifically described with reference to the drawings. Note that all the
embodiments described below show general or specific examples. The numerical values, shapes,
materials, components, arrangement positions and connection forms of the components, the
processing, the order of the processing, and the like described in the following embodiments are
merely examples, and are not intended to limit the present disclosure. Further, among the
03-05-2019
8
components in the following embodiments, components not described in the independent claim
indicating the highest concept are described as arbitrary components. In addition, the first,
second and third ordinal numbers may be appropriately added in terms of components and the
like.
[0023]
Further, in the following description of the embodiments, expressions accompanied by
“abbreviation” such as substantially parallel or substantially orthogonal may be used. For
example, “substantially parallel” means not only completely parallel but also substantially
parallel, that is, including, for example, a difference of several% or so. The same applies to
expressions accompanied by other "abbreviations". Further, each drawing is a schematic view,
and is not necessarily illustrated exactly. Further, in the drawings, substantially the same
components are denoted by the same reference numerals, and overlapping descriptions may be
omitted or simplified.
[0024]
First Embodiment [Configuration of Speech Recognition Device] The configuration of the speech
recognition device 100 according to the first embodiment will be described with reference to FIG.
FIG. 2 is a block diagram showing a schematic configuration of the speech recognition apparatus
100 according to the first embodiment. The speech recognition apparatus 100 is connected to
the plurality of microphones 1 and is configured to receive an electrical signal generated by the
plurality of microphones 1 collecting sound. The microphone 1 acquires sound information of
the collected sound as an electrical signal. Hereinafter, the electrical signal generated by the
microphone 1 is also referred to as an observation signal or a sound signal. The plurality of
microphones 1 are arranged such that at least one of the arrangement position and the sound
collection direction is different from each other. For example, the plurality of arranged
microphones 1 may constitute a microphone array.
[0025]
The voice recognition apparatus 100 recognizes a specific sound such as voice using sound
information acquired through the plurality of microphones 1. The speech recognition apparatus
100 may constitute one system together with a plurality of microphones 1 or may constitute one
03-05-2019
9
apparatus alone. Alternatively, the voice recognition device 100 may be incorporated as
hardware or software into an information processing device such as a computer, a device
equipped with a device for acquiring voice such as a microphone, or other various devices. You
may constitute a part.
[0026]
The speech recognition apparatus 100 includes a sound information acquisition unit 10, a sound
source section detection unit 11, a direction estimation unit 12, an adaptive processing unit 13,
an adaptive processing convergence monitoring unit 14, a buffer amount determination unit 15,
a buffer 16, a beam forming processing unit 17, and A voice recognition unit 18 is included.
[0027]
Each component of the sound information acquisition unit 10, the sound source section detection
unit 11, the direction estimation unit 12, the adaptive processing unit 13, the adaptive
processing convergence monitoring unit 14, the buffer amount determination unit 15, the beam
forming processing unit 17 and the speech recognition unit 18 , Is an element that executes each
process.
Each component may individually constitute one element, and a plurality of components may
constitute one element.
[0028]
Each component may be configured by dedicated hardware or may be realized by executing a
software program suitable for each component. In this case, each component may include, for
example, an arithmetic processing unit (not shown) and a storage unit (not shown) that stores a
control program. As an arithmetic processing part, MPU (Micro Processing Unit), CPU (Central
Processing Unit), etc. are illustrated. A memory etc. are illustrated as a memory | storage part.
Each component may be configured by a single element performing centralized control, or may
be configured by a plurality of elements performing distributed control in cooperation with each
other. The software program may be provided as an application by communication via a
communication network such as the Internet, communication by a mobile communication
standard, or the like.
03-05-2019
10
[0029]
Each component may be a circuit such as an LSI (Large Scale Integration: large scale integrated
circuit) or a system LSI. A plurality of components may constitute one circuit as a whole, or may
constitute separate circuits. Also, each circuit may be a general-purpose circuit or a dedicated
circuit.
[0030]
The system LSI is a super multifunctional LSI manufactured by integrating a plurality of
component parts on one chip, and specifically, a microprocessor, a ROM (Read Only Memory), a
RAM (Random Access Memory), etc. It is a computer system comprised. A computer program is
stored in the RAM. The system LSI achieves its functions by the microprocessor operating
according to the computer program. The system LSI and the LSI may be a field programmable
gate array (FPGA) that can be programmed after the LSI is manufactured, and may include a
reconfigurable processor that can reconfigure connection and setting of circuit cells in the LSI. .
[0031]
In addition, part or all of the above components may be composed of a removable IC card or a
single module. The IC card or module is a computer system including a microprocessor, a ROM, a
RAM, and the like. The IC card or module may include the above LSI or system LSI. The IC card or
module achieves its function by the microprocessor operating according to the computer
program. These IC cards and modules may be tamper resistant.
[0032]
The buffer 16 is an element that temporarily stores and stores information. For example, the
buffer 16 may be configured of a semiconductor memory or the like, and may be configured of a
volatile memory or a non-volatile memory or the like. The buffer 16 holds the sound information
acquired by the microphone 1 according to the buffer amount determined by the buffer amount
determination unit 15 as described later. That the buffer 16 holds the sound information
according to the buffer amount may be that the buffer 16 stores the buffer amount of sound
03-05-2019
11
information, and the buffer 16 may store the sound information in the buffer 16. The sound
information may be secured, or the buffer 16 storing the sound information may release the
memory corresponding to the buffer amount.
[0033]
The sound information acquisition unit 10 acquires observation signals from the plurality of
microphones 1 as sound information. The sound information acquisition unit 10 sends the
acquired observation signal to various components of the speech recognition apparatus 100 such
as the sound source section detection unit 11 and the buffer 16. The sound information
acquisition unit 10 may include a communication interface for communicating the microphone 1
and the like.
[0034]
The sound source section detection unit 11 detects a sound source section using the observation
signal acquired by the microphone 1. The sound source section is a section in which the sound
from the sound source is included in the observation signal generated by the microphone 1. For
example, the sound source section may be composed of time sections. The sound source segment
detection unit 11 detects a sound source segment, for example, by the method described below.
Referring to FIG. 3, an example of a process flow of detection of a sound source section is shown.
In the example shown in FIG. 3, the energy of the observation signal is taken as the feature
amount of the observation signal, and the feature amount of the observation signal is extracted
(step S41). Further, it is determined whether or not the energy of the noise sound signal is larger
than the energy of the noise signal (step S42). Then, an observation signal of energy larger than
that of the noise sound signal is detected, and information of a section formed by the detected
observation signal is used as sound source section information (step S43). For example, for the
observation signal of one of the plurality of microphones 1, the short-time energy p of the
observation signal is calculated by the following equation 1.
[0035]
[0036]
03-05-2019
12
Here, in the above equation 1, t represents the discrete time of the observed signal, x (t)
represents the sound information of the observed signal acquired by one microphone 1, and T
represents the number of samples used for calculation of short-time energy Indicates
Note that x (t) is a discrete time signal of the observation signal. And the area which the
observation signal which comprises the short time energy p forms is a sound source area, if p is
larger than a threshold value, and if p is smaller than a threshold value, it is a non-sound source
area. By setting another threshold, it is also possible to further detect a voice section in which a
voice signal is included in the sound source section. In this way, it is possible to distinguish
between the voice section and the non-voice section in the sound source section. The method of
detecting the sound source segment implemented by the sound source segment detection unit 11
may be a method other than the above method. In addition, although the detection method of the
sound source section using only the sound information of one microphone 1 has been described
above, all of the sound information acquired by the plurality of microphones 1 may be used for
detection of the sound source section.
[0037]
The direction estimation unit 12 performs direction estimation on the signal of the voice section
in the sound source section detected by the sound source section detection unit 11, and acquires
the estimated direction of the voice. The direction estimation unit 12 acquires the estimated
direction regarding the observation signal of the voice section, for example, by the method
described below. Referring to FIG. 4, an example of a process flow of sound direction estimation
is shown. In the example shown in FIG. 4, from the observation signals acquired by the plurality
of microphones 1, a correlation matrix representing the arrival time difference of the signals
between the microphones 1 is calculated (step S51). When M microphones 1 are used, the sound
information X ω of the observation signal acquired by the M microphones 1 is expressed by
Equation 2 below.
[0038]
[0039]
Here, in Equation 2, ω represents the discrete frequency of the observation signal, and X ω, m
(m = 1,..., M) represents the sound information acquired by the m-th microphone 1 .
03-05-2019
13
Then, the correlation matrix R ω is calculated by Equation 3 below. Note that Xω <H> is an
adjoint matrix of Xω.
[0040]
[0041]
Further, when the sound source direction is expressed using two variables of an azimuth angle θ
in the horizontal direction and an elevation angle φ in the vertical direction, a vector d ω
representing an arrival time difference of observed signals between the microphones 1 at the
discrete frequency ω θ, φ) is calculated by the following equation 4 (step S52).
[0042]
[0043]
Here, in Equation 4, j represents a complex symbol, and τ m (θ, φ) (m = 1,..., M) represents the
first sound wave arriving from the sound source direction (θ, φ). It represents a relative delay
time when the m-th microphone 1 receives a sound with reference to the m-th microphone 1.
Although the first microphone 1 is a microphone which first receives a sound wave arriving from
the sound source direction (θ, φ) in the present embodiment, the first microphone 1 may be
selected arbitrarily from M microphones 1. Good.
Next, the similarity P ω (θ, φ) between the correlation matrix Rω representing the arrival time
difference of the observed signal between the microphones 1 and the vector d ω (θ, φ) is
calculated by the following equation 5 ( Step S53).
[0044]
[0045]
03-05-2019
14
Then, the sound source direction (θ, φ) that maximizes the similarity P ω (θ, φ) is searched
(step S 54), and the sound source direction (θ, φ) is the estimation result of the sound source
direction, that is, sound source direction information (Step S55).
The estimation method of the direction of the sound used by the direction estimation unit 12
may be a method other than that described above.
[0046]
The adaptive processing unit 13 estimates the filter coefficient for extracting the speech in the
speaker direction acquired by the direction estimation unit 12 using the observation signal
acquired by the microphone 1.
The adaptive processing unit 13 estimates filter coefficients, for example, by the method
described below. Referring to FIG. 5, an example of a process flow of filter coefficient estimation
in adaptive processing is shown. In the example shown in FIG. 5, first, a signal d (t) obtained by
emphasizing the signal of the sound source in the estimated direction (θ, φ) acquired by the
direction estimation unit 12 is calculated by the following equation 6 (step S61).
[0047]
[0048]
Here, in Equation 6, M represents the number of microphones 1, t represents the discrete time of
the observed signal, and x m (t) (m = 0, 1,..., M−1) represents the microphones. 1 represents an
input signal from 1, that is, an observation signal, and τ m (θ, φ) (m = 0, 1,..., M−1) is a delay
when there is a sound source in the direction (θ, φ) Indicates time.
As described above, the delay time τ m (θ, φ) is a relative value when the sound wave arriving
from the sound source direction (θ, φ) is received by the m-th microphone 1 with respect to the
first microphone 1 Represents a typical delay time. For example, x m (t) may be a signal of a
03-05-2019
15
voice section detected by the sound source section detection unit 11. Further, the signal y m (t)
(m = 0, 1,..., M−2) in which the signal in the desired direction is blocked is calculated by the
following equation 7 (step S62). The desired direction is the estimated direction of speech.
[0049]
[0050]
Next, the estimated noise signal n (t) is calculated by filtering and adding to y m (t) as shown in
the following equation 8 (step S63).
[0051]
[0052]
Here, in Equation 8, L indicates the number of taps of the adaptive filter, and w m (k) indicates
the coefficient of the adaptive filter.
Next, the output signal y (t) is calculated by the following equation 9 (step S64).
[0053]
[0054]
Here, in Equation 9, τ represents a delay for aligning the phase between d (t) and n (t).
The coefficient w m (k) of the adaptive filter is sequentially updated such that the expected value
J of the error e (t) shown in the following Equation 10 and Equation 11 becomes smaller.
[0055]
03-05-2019
16
[0056]
Here, in Equation 10, E [·] represents expected value calculation.
The adaptive processing method used for the adaptive processing unit 13 may be a method other
than the above.
[0057]
The adaptive processing convergence monitoring unit 14 determines convergence / nonconvergence of the adaptive processing from the state of the filter coefficient update in the
adaptive processing unit 13.
Specifically, when the expectation value J of the error e (t) calculated by the adaptive processing
unit 13 becomes smaller than a predetermined threshold, the adaptive processing convergence
monitoring unit 14 determines that the adaptive processing has converged, If the expected value
J is greater than or equal to the threshold value, it is determined that the adaptive processing has
not converged.
Referring to FIG. 6, an example of the expected value J of the error for each update of the filter
coefficient in the estimation of the filter coefficient is shown. In FIG. 6, the vertical axis indicates
the magnitude of the expected value J, and the horizontal axis indicates the number of times the
filter coefficient w m (k) has been updated. As shown in FIG. 6, it can be seen that the expected
value J gradually decreases as the number of updates of the filter coefficient w m (k) increases.
Then, the adaptive processing convergence monitoring unit 14 determines that the adaptive
processing is not converged when the expected value J is equal to or larger than the threshold
J1, and determines that the adaptive processing has converged when the expected value J is
smaller than the threshold J1. The adaptive processing convergence monitoring unit 14
determines that the adaptive processing has converged when the number of filter coefficient
updates reaches the predetermined number N1, and the adaptive processing convergence
monitoring unit 14 performs the adaptive processing when the number of filter coefficient
updates is less than the predetermined number N1. However, it may be determined that
Furthermore, the adaptive processing convergence monitoring unit 14 may determine the
convergence of the adaptive processing using both the expectation value J and the number of
03-05-2019
17
updates.
[0058]
The buffer amount determination unit 15 includes information on the sound source section
acquired by the sound source section detection unit 11, information on the estimated direction
estimated by the direction estimation unit 12, and information on the convergence state of the
adaptive processing acquired by the adaptive processing convergence monitoring unit 14. The
amount of sound information buffer used for processing in the beamforming processing unit 17
is determined based on the above. Specifically, the buffer amount of sound information held in
the buffer 16 is determined. In addition, since the buffer amount is the amount of sound
information arranged in time series, it may be expressed using a length. For example, a long
buffer amount indicates that the buffer amount is large, and a short buffer amount indicates that
the buffer amount is small.
[0059]
The buffer amount determination unit 15 determines the buffer amount, for example, by the
method described below. Referring to FIG. 7, an example of the flow of the process of
determining the buffer amount is shown. Further, FIG. 8 shows an example of a list of cases of
the buffer amount determined by the process of FIG. In the example shown in FIGS. 7 and 8, in
the buffer amount determination unit 15, first, the estimated direction information (θ, φ)
acquired by the direction estimation unit 12 is the estimated direction information (θ t−1, φ t)
acquired last time. If it has changed from -1) (Yes in step S81), the amount of sound information
buffer held in the buffer 16 is increased (step S85). The determination as to whether or not the
estimated direction of speech such as the speaker direction has changed can be made, for
example, using Equation 12 below. The buffer amount determination unit 15 determines that
there is a change if the change amount Δ in the estimation direction calculated by Equation 12 is
larger than a threshold, and determines that there is no change if it is equal to or less than the
threshold.
[0060]
[0061]
03-05-2019
18
Immediately after the estimation direction information changes, the buffer amount determination
unit 15 returns the buffer amount to the initial value Q regardless of the convergence state of the
adaptive processing and the information of the sound source section.
The initial value Q is the maximum amount of buffer required for the processing in the
beamforming processing unit 17 between the start of the adaptive processing in the adaptive
processing unit 13 and the convergence. Therefore, at this time, the buffer amount is maximum.
As a result, since a sufficient amount of buffer corresponding to the start of the adaptive
processing to the convergence is used to obtain the filter coefficient used for processing in the
beam former, high-quality speech without speech loss can be obtained.
[0062]
Next, the estimated direction information does not change from the previously acquired
estimated direction information (No in step S81), and the adaptive processing convergence
monitoring unit 14 determines that the adaptive processing is not converged (No in step S82).
The buffer amount determination unit 15 does not change the buffer amount (step S86). Even
when the voice section changes to a non-voice section, it is assumed that the estimated direction
information has not changed from the previously acquired estimated direction information.
[0063]
Further, even when the estimated direction information has not changed from the previous
estimated direction information (No in step S81), and the adaptive processing convergence
monitoring unit 14 determines that the adaptive processing has converged (Yes in step S82),
When the sound source section information newly acquired by the section detection unit 11 is a
voice section (Yes in step S83), the buffer amount determination unit 15 does not change the
buffer amount (step S87).
[0064]
Next, information on the sound source section newly detected in a state where the estimated
direction information has not changed from the previous estimated direction information (No in
step S81) and the adaptive processing has converged (Yes in step S82) Is a non-voice section (N0
in step S83) and the current buffer size is smaller than a preset lower limit (N0 in step S84), the
buffer size determination unit 15 does not change the buffer size (step S88). ).
03-05-2019
19
[0065]
Finally, in the state where the estimated direction information has not changed from the previous
estimated direction information (No in step S81) and the adaptive processing has converged (Yes
in step S82), information of the sound source section newly detected Is a non-voice section (N0 in
step S83), and when the current buffer size is larger than the preset lower limit (Yes in step S84),
the buffer size determination unit 15 decreases the buffer size (step S84). S89).
When the estimation direction information is constant and the adaptive processing converges,
when the non-speech section is detected as the information of the sound source section, the
response of the speech recognition becomes faster by reducing the buffer amount.
[0066]
Here, when the lower limit value of the buffer amount is not set, the estimated direction
information does not change from the previously acquired estimated direction information (No in
step S81), and the adaptive processing converges (Yes in step S82). When the information of the
newly detected sound source section is a non-voice section (N0 in step S83), the buffer amount
may be reduced as shown in FIG.
[0067]
The beam forming processing unit 17 uses the sound information such as the observation signal
acquired by the microphone 1 and the filter coefficient estimated by the adaptive processing unit
13 to perform beam forming processing on the input signal held by the buffer 16, that is, the
sound information. I do.
The beam forming processing unit 17 outputs the audio information of the sound information
acquired by the microphone 1 as an output signal by performing the beam forming processing.
Here, the flow of processing for obtaining an output signal by beamforming processing is the
same as the procedure in which the adaptive processing unit 13 performs the processing of steps
S61 to S64 shown in FIG. 5 to calculate the output signal y (t). It is. That is, the beam forming
processing unit 17 calculates the output signal y (t). The speech recognition unit 18 performs
03-05-2019
20
speech recognition on the speech information processed by the beamforming processing unit 17.
For example, the voice recognition unit 18 converts voice information into a voice signal.
[0068]
[Operation of Speech Recognition Device] An example of the operation of the speech recognition
device 100 according to the first embodiment will be described with reference to FIGS. 2 and 9.
FIG. 9 is a flowchart showing an example of an operation flow of the speech recognition
apparatus 100 according to the first embodiment. The sound information acquisition unit 10 of
the speech recognition apparatus 100 acquires an observation signal of sound, that is, sound
information, through the plurality of microphones 1 (step S101). The observation signal acquired
by the sound information acquisition unit 10 is also used by the sound source section detection
unit 11, the direction estimation unit 12, and the adaptive processing unit 13 of the speech
recognition apparatus 100, as described later, and is temporarily stored in the buffer 16. Be
done.
[0069]
The sound source segment detection unit 11 detects a sound source segment and a voice
segment in the sound source segment from the observation signals of the plurality of
microphones 1 (step S102). That is, the voice section and the non-voice section in the sound
source section are specified. The sound source segment detection unit 11 sends information such
as a sound source segment and a voice segment to the direction estimation unit 12 as sound
information. As described later, the information such as the sound source section and the voice
section is also used by the adaptive processing unit 13, the buffer amount determination unit 15,
and the beam forming processing unit 17.
[0070]
The direction estimation unit 12 estimates the direction of the voice section in the information
received from the sound source section detection unit 11, and calculates the estimated direction
of the voice (step S103). The direction estimation unit 12 sends, as sound information,
information in which the calculated estimated direction is associated with the voice section to the
adaptive processing unit 13. The information of such estimated direction and voice section is also
used by the buffer amount determination unit 15.
03-05-2019
21
[0071]
The adaptive processing unit 13 uses a filter coefficient for extracting the voice in the estimated
direction received from the direction estimation unit 12 using the observation signal acquired by
the microphone 1 and the information of the voice section detected by the sound source section
detection unit 11. It estimates (step S104). The adaptive processing unit 13 sequentially
estimates and updates the filter coefficients, and sends the filter coefficients to the adaptive
processing convergence monitoring unit 14 of the speech recognition apparatus 100 each time
the updating is performed.
[0072]
The adaptive processing convergence monitoring unit 14 determines whether the adaptive
processing is converged or not converged from the state of the filter coefficient sequentially
received from the adaptive processing unit 13 (step S105). That is, it is determined whether the
filter coefficients are converged or not converged. The adaptive processing convergence
monitoring unit 14 generates an error between the signal d (t) in which the observation signal in
the estimation direction is emphasized and the output signal y (t) obtained by subtracting the
noise signal n (t) from the signal d (t). The above determination may be made based on the
expected value J of e (t), or the above determination may be made based on the number of
updates of the filter coefficient, or the above determination may be made based on both the
expected value J and the number of updates of the filter coefficient. May be The adaptive
processing convergence monitoring unit 14 sends the determination result of the convergence
and unconvergence of the adaptive processing to the buffer amount determination unit 15.
[0073]
The buffer amount determining unit 15 determines the buffer amount of sound information such
as an observation signal held in the buffer 16 based on the convergence state of the adaptive
processing received from the adaptive processing convergence monitoring unit 14 (step S106).
More specifically, the buffer amount determination unit 15 uses the information such as the
sound source section and the voice section detected by the sound source section detection unit
11, the estimated direction estimated by the direction estimation unit 12, and the adaptive
processing convergence monitor unit 14. The amount of buffer is determined based on the
03-05-2019
22
convergence state of the received adaptive processing.
[0074]
Then, the buffer amount determination unit 15 determines whether the sound information to be
subjected to the speech recognition process in the speech recognition apparatus 100, that is, the
observation signal is a signal immediately after the change of the estimation direction (step
S107). That is, it is determined whether or not the target observation signal changes the
estimation direction with respect to the observation signal observed immediately before that. The
estimation direction also changes when a sound is emitted from the sound source in a silent
state. Therefore, the observation signal immediately after the process of speech recognition is
started for the first time is also included in the signal immediately after the change of the
estimation direction.
[0075]
If the observation signal is a signal immediately after the estimation direction has changed (Yes
in step S107), the buffer amount determination unit 15 sets the buffer amount to an initial value
(step S108). Specifically, the buffer amount determination unit 15 determines the buffer amount
to an initial value that is an amount sufficient for convergence of the adaptive processing. For
example, the amount of buffer sufficient for convergence of adaptive processing is the maximum
amount required from the start of adaptive processing to convergence in beamforming
processing performed on the observation signal before current processing. The buffer amount of
Alternatively, the amount of buffer sufficient for convergence of the adaptive processing may be
a preset amount. The buffer amount determination unit 15 changes the buffer amount held by
the buffer 16 based on the determined buffer amount. For example, FIG. 10 shows a diagram for
explaining an example of holding in the buffer 16. The method 1 shows an example in which the
sound information is held by securing the sound information of a predetermined buffer amount
among the sound information stored in the buffer 16 by the buffer 16. Specifically, in the buffer
16, a memory for the maximum amount of buffer is secured, and by limiting the reference range
to the secured memory, holding of sound information at a predetermined buffer amount or
changing of the buffer amount To be realized. In the second method, an example is shown in
which the sound information is held by the buffer 16 storing the sound information releasing the
memory corresponding to the predetermined buffer amount. Specifically, by securing and
releasing the memory for the required buffer amount in the buffer 16, it is possible to realize
holding of sound information at a predetermined buffer amount or change of the buffer amount.
03-05-2019
23
[0076]
If the observation signal is not a signal immediately after the estimation direction has changed
(No in step S107), the buffer amount determination unit 15 determines the buffer amount from
the buffer amount adopted in the processing of the observation signal before the target
observation signal. Determine the buffer size so as not to change or reduce the buffer size. The
buffer amount determination unit 15 adjusts the buffer amount held by the buffer 16 based on
the determined buffer amount.
[0077]
Specifically, the buffer amount determination unit 15 determines whether the adaptive
processing has converged and the sound source section detected from the target observation
signal is a non-speech section (step S109). If the adaptation process is not converged or the
detected sound source section is a voice section (No in step S109), the buffer amount
determining unit 15 determines the buffer amount so as not to change the buffer amount (step
S111). The buffer amount determination unit 15 maintains the buffer amount held by the buffer
16 based on the determined buffer amount.
[0078]
If the adaptive processing has converged and the detected sound source section is a non-voice
section (Yes in step S109), the buffer amount determining unit 15 determines whether the
current buffer amount is larger than a preset lower limit value. Is determined (step S110). For
example, in the case where the estimated direction of voice does not change, and the detected
sound source section is a non-voice section, the case where the speaker who was emitting the
voice does not emit voice due to breath holding, speech etc Be
[0079]
Here, the lower limit value of the buffer amount will be described. For example, even when the
filter coefficient converges to the optimum value by the adaptive processing, the optimum value
of the filter coefficient changes due to the change in the sound environment around the
03-05-2019
24
microphone 1 such as noise change. For this reason, another adaptation process is required. In
the sound environment where the change is small, the change in the optimum value of the filter
coefficient is also small, and in the sound environment where the change is large, the change in
the optimum value of the filter coefficient is also large. For example, if the optimum value of the
filter coefficient changes significantly while the buffer amount is set small, the noise suppression
performance in the beamforming process may be degraded. The lower limit value of the buffer
amount is set to a buffer amount that can suppress a decrease in beamforming processing
capacity even if the optimum value of the filter coefficient changes. For example, in a sound
environment where the change is large, the lower limit of the buffer amount is set to a large
value, and in a sound environment where the change is small, the lower limit of the buffer is set
to a small value or zero. That is, the lower limit value of the buffer amount is set according to the
sound environment around the microphone 1.
[0080]
If the current buffer amount is larger than the lower limit (Yes in step S110), the buffer amount
determining unit 15 determines the buffer amount so as to reduce the buffer amount (step
S112). When the current buffer amount is equal to or less than the lower limit (N-1 in step
S110), the buffer amount determining unit 15 determines the buffer amount so as not to change
the buffer amount (step S111).
[0081]
The beamforming processing unit 17 uses the observation signal acquired by the microphone 1
and the filter coefficient estimated by the adaptive processing unit 13 to beamform the sound
information such as the observation signal held by the buffer 16 as described above. A process is
performed (step S113). As a result, the beam forming processing unit 17 outputs voice
information of the observation signal acquired by the microphone 1.
[0082]
The speech recognition unit 18 performs speech recognition on the speech information
processed by the beamforming processing unit 17 (step S114). Thereby, the speech recognition
apparatus 100 recognizes speech from the observation signal acquired by the microphone 1.
03-05-2019
25
[0083]
Immediately after the time when the direction of the voice changes such as the start of the talk of
the speaker, the change of the position of the speaker, the change to a different speaker, etc. by
the processes of steps S107 and S108, the speech recognition apparatus 100 Using the sound
information, speech recognition is performed by beamforming processing or the like. Thus, highquality speech recognition can be achieved without a broken head.
[0084]
If a silent period such as a speaker's breath or a pause is generated by the processes of steps
S107, S109, S110 and S112, the speech recognition apparatus 100 uses the reduced buffer
amount sound information to perform beamforming processing or the like. Implement voice
recognition. Thus, the processing relating to speech recognition in a silent section is speeded up.
[0085]
By the processes of steps S107, S109, S110, and S111, the speech recognition apparatus 100
suppresses the excessive reduction of the buffer amount, thereby suppressing the deterioration
of the speech recognition accuracy and speed due to the beam forming process or the like.
[0086]
By the processing of steps S107, S109, and S111, the speech recognition apparatus 100
maintains the buffer amount at the time of processing of the speech section, thereby suppressing
the decrease in speech recognition accuracy and speed due to the beam forming process or the
like.
Alternatively, the speech recognition apparatus 100 maintains the buffer amount when the
adaptation processing is not converged, thereby suppressing the convergence of the adaptation
processing from being delayed.
[0087]
03-05-2019
26
In the process of step S110, the current buffer size is compared with the preset lower limit value,
but this process may be omitted. In this case, in the case of Yes in step S109, the buffer amount
determination unit 15 determines the buffer amount so as to reduce the buffer amount.
[0088]
An example of a signal output by the speech recognition apparatus 100 as described above is
shown in FIG. FIG. 11 compares the voice waveform input to the microphone 1, the waveform of
the signal output by the conventional voice recognition method, and the waveform of the signal
output by the voice recognition apparatus 100 according to the first embodiment. FIG. In FIG. 11,
the horizontal axis is a time axis. In the output signal of the speech recognition apparatus 100,
the processing speed for the silent section of the input signal is significantly improved over the
speech recognition method of the prior art. As a result, the delay of the output of the second
input signal with respect to the output of the first input signal in the speech recognition device
100 is significantly shorter than in the speech recognition method of the prior art, and the two
inputs in the speech recognition device 100 The onset of speech recognition of the signal is
faster than prior art speech recognition methods. Therefore, according to the speech recognition
apparatus 100, real-time speech recognition can be performed for the pronunciation of the
speaker.
[0089]
Second Embodiment [Configuration of Speech Recognition Device] The configuration of the
speech recognition device 200 according to the second embodiment will be described with
reference to FIG. The speech recognition apparatus 200 according to the second embodiment
selects and compares the processing with the already calculated existing filter coefficients and
the processing with the newly calculated filter coefficients when the direction of the voice
changes, and selects the selected filter coefficients. Are configured to perform the speech
recognition process after the change of the direction of the speech. Hereinafter, the speech
recognition apparatus 200 according to the second embodiment will be described focusing on
differences from the speech recognition apparatus 100 according to the first embodiment, and
the description of the same points will be omitted.
[0090]
03-05-2019
27
As shown in FIG. 12, the speech recognition apparatus 200 according to the second embodiment
is similar to the speech recognition apparatus 100 according to the first embodiment in that the
sound information acquisition unit 10, the sound source section detection unit 11, and the
direction estimation unit 12. , An adaptive processing unit 13, an adaptive processing
convergence monitoring unit 14, a buffer amount determination unit 15, and a buffer 16. The
speech recognition apparatus 200 includes a beamforming processing unit and a speech
recognition unit having the same configuration as that of the speech recognition apparatus 100
according to the first embodiment as a first beamforming processing unit 17 and a first speech
recognition unit 18. Furthermore, the speech recognition apparatus 200 includes a second
beamforming processing unit 217, a second speech recognition unit 218, a coefficient holding
unit 219, and a recognition result selection unit 220. The second beamforming processing unit
217 and the second speech recognition unit 218 have the same functions as the first
beamforming processing unit 17 and the first speech recognition unit 18, respectively.
[0091]
The coefficient holding unit 219 holds the existing filter coefficients of the adaptive filter, that is,
the filter coefficients used in the speech recognition process. For example, like the buffer 16, the
coefficient holding unit 219 holds the filter coefficient by temporarily storing the filter
coefficient. The coefficient holding unit 219 holds the filter coefficient before the change when
the estimated direction of speech changes due to the change of the speaker direction or the like.
The filter coefficients held by the coefficient holding unit 219 may not be the filter coefficients in
use, but may be filter coefficients used in the past. The coefficient holding unit 219 may hold a
plurality of filter coefficients. The filter coefficients held by the coefficient holding unit 219 may
be filter coefficients after convergence of the adaptive processing.
[0092]
The second beam forming processing unit 217 outputs a second output signal, which is audio
information, using the filter coefficient held by the coefficient holding unit 219 and the
observation signal acquired by the microphone 1. That is, the second beamforming processing
unit 217 performs the beamforming processing without using the sound information of the
buffer 16. The audio information output by the first beam forming processing unit 17 is referred
to as a first output signal. The second speech recognition unit 218 performs speech recognition
on the second output signal processed by the second beamforming processing unit 217.
03-05-2019
28
[0093]
The recognition result selection unit 220 compares the first output signal from the first
beamforming processing unit 17 with the second output signal from the second beamforming
processing unit 217. Specifically, the recognition result selection unit 220 determines which of
speech recognition results of the first output signal and the second output signal is reliable. Then,
the recognition result selection unit 220 outputs the reliable speech recognition result out of the
speech recognition results of the first output signal and the second output signal.
[0094]
[Operation of Speech Recognition Device] An example of the operation of the speech recognition
device 200 according to the second embodiment will be described with reference to FIGS. 12 and
13. FIG. 13 is a flowchart showing an example of the operation flow of the speech recognition
apparatus 200 according to the second embodiment. In the following description, an example in
which the direction of the voice changes while the voice recognition device 200 operates in the
same manner as the voice recognition device 100 according to the first embodiment will be
described. Before the direction of voice changes, the speech recognition apparatus 200 does not
use the second beamforming processing unit 217 and the second speech recognition unit 218,
but uses the first beamforming processing unit 17 and the first speech recognition unit 18. And
perform speech recognition processing.
[0095]
During the speech recognition process by the speech recognition apparatus 200, the coefficient
holding unit 219 holds the filter coefficient used for the process (step S201). Further, the
coefficient holding unit 219 acquires the estimated direction of the voice detected by the
direction estimation unit 12 through the adaptive processing unit 13.
[0096]
When the direction in which the direction estimation unit 12 estimates the voice changes (Yes in
03-05-2019
29
step S202), the coefficient holding unit 219 outputs the holding filter coefficient, which is the
held filter coefficient, to the second beam forming processing unit 217. If there is no change in
the estimated direction of the voice (No in step S202), the voice recognition process by the
processing method currently being performed is continued.
[0097]
As a result of the change in the estimation direction of the voice by the direction estimation unit
12, the second beamforming processing unit 217 starts the beamforming processing using the
holding filter coefficient, and the first beamforming processing unit 17 continues the
beamforming processing. Do. That is, the first beamforming processing unit 17 and the second
beamforming processing unit 217 simultaneously perform beamforming processing in parallel
(step S203). Then, the first beamforming processing unit 17 and the second beamforming
processing unit 217 output the respective output signals to the first speech recognition unit 18,
the second speech recognition unit 218, and the recognition result selection unit 220.
[0098]
The first beamforming processing unit 17 performs the beamforming process using the filter
coefficient newly calculated and the sound information of the buffer amount returned to the
initial value. For this reason, the first output signal from the first beamforming processing unit
17 is output with a delay due to the buffer amount from the change time of the estimation
direction of the voice.
[0099]
The second beam forming processing unit 217 performs the beam forming processing using the
holding filter coefficient and the observation signal acquired from the microphone 1. Therefore,
the processing amount in the second beam forming processing unit 217 is significantly smaller
than the processing amount in the first beam forming processing unit 17. Furthermore, it is not
necessary to calculate the filter coefficients. Therefore, the second output signal from the second
beamforming processing unit 217 is output with a delay that is significantly smaller than that of
the first beamforming processing unit 17 from the change of the estimation direction of the
voice.
03-05-2019
30
[0100]
For example, an output example of the first output signal and the second output signal is shown
in FIG. FIG. 14 shows the sound waveform input to the microphone 1, the waveform of the first
output signal output by the first beamforming processing unit 17, and the waveform of the
second output signal output by the second beamforming processing unit 217. And FIG. In FIG.
14, the horizontal axis is a time axis. As shown in FIG. 14, the output time of the first output
signal is significantly delayed from the output time of the second output signal.
[0101]
The first speech recognition unit 18 and the second speech recognition unit 218 perform speech
recognition on the first output signal and the second output signal received from the first
beamforming processing unit 17 and the second beamforming processing unit 217, respectively.
(Step S204).
[0102]
The recognition result selection unit 220 compares the first output signal and the second output
signal received from the first beamforming processing unit 17 and the second beamforming
processing unit 217, and determines whether the second output signal is reliable or not. (Step
S205).
That is, the recognition result selection unit 220 examines the reliability of the second output
signal.
[0103]
For example, the recognition result selection unit 220 compares two output signals in the
following manner. Specifically, the recognition result selection unit 220 determines the first
output signal ya (t) and the second output signal corresponding to the same observation signal
with respect to the first output signal ya (t) and the second output signal yb (t). yb (t) is extracted,
and the sum of the magnitudes of the respective output signals is calculated as in the following
Equation 13 and Equation 14. Furthermore, the recognition result selection unit 220 calculates
03-05-2019
31
the difference between the two as shown in the following equation 15. For example, the same
observation signal may be an observation signal acquired by one microphone 1 at the same time,
or a plurality of observation signals acquired by a plurality of microphones 1 at the same time.
[0104]
[0105]
The recognition result selection unit 220 determines that the second output signal is reliable
when the difference is smaller than the threshold.
The target for extracting the first output signal ya (t) and the second output signal yb (t) may be
the entire output waveform of the first output signal and the second output signal shown in FIG.
It may be a part. When it is a part of the output waveform, the same section in the output
waveform is selected as a target.
[0106]
If the second output signal is reliable (Yes in step S205), the recognition result selection unit 220
outputs a speech recognition result for the second output signal (step S206). Thereby, the speech
recognition processing by the first beamforming processing unit 17 and the first speech
recognition unit 18 is stopped, and the speech recognition processing by the second
beamforming processing unit 217 and the second speech recognition unit 218 is continued. This
process is continued until the speech estimation direction changes next. After the change of the
estimated direction of the next speech, the speech recognition process by the first beamforming
processing unit 17 and the first speech recognition unit 18 is resumed, and the processes of
steps S202 to S205 are performed again.
[0107]
If the second output signal is not reliable (N0 in step S205), the recognition result selection unit
220 outputs a speech recognition result for the first output signal (step S207). Thereby, the
speech recognition processing by the second beamforming processing unit 217 and the second
03-05-2019
32
speech recognition unit 218 is stopped, and the speech recognition processing by the first
beamforming processing unit 17 and the first speech recognition unit 18 is continued. This
process is continued until the speech estimation direction changes next. After the change of the
estimated direction of the next speech, the speech recognition processing by the second
beamforming processing unit 217 and the second speech recognition unit 218 is resumed, and
the processing of steps S201 to S205 is performed again.
[0108]
The speech recognition apparatus 200 outputs an output delay time by outputting the second
output signal when the second output signal is reliable when the estimated direction of speech
changes, compared to when the first output signal is output. Reduce voice recognition and start
speech recognition earlier. Therefore, the speech recognition result can be obtained quickly, and
the speech recognition response is improved.
[0109]
[Effects, Etc.] As described above, the speech recognition apparatus 100 according to the first
embodiment recognizes speech from the sound information acquired by the plurality of
microphones 1. The speech recognition apparatus 100 includes a sound information acquisition
unit 10 for acquiring sound information from a plurality of microphones 1, a sound source
section detection unit 11 for detecting a sound source section including sound from sound
information, and a sound section of the sound source section. Direction estimation unit 12 for
obtaining estimation direction by direction estimation, adaptive processing unit 13 for
performing adaptive processing for estimating filter coefficients for extracting speech in the
estimated direction using sound information, Convergence state of adaptive processing Adaptive
processing convergence monitoring unit 14 for acquiring information on the basis, buffer 16 for
holding sound information according to the determined buffer amount, information on a sound
source section, information on an estimated direction, and information on a convergence state of
adaptive processing To determine the buffer amount of the sound information to be held in the
buffer 16, and the beamforming processing using the sound information and the filter coefficient
held in the buffer 16 A beam forming process unit 17 for acquiring voice information, and a
voice recognition unit that recognizes 18 relative audio information beam forming process.
Immediately after the start of the processing of the sound information, the buffer amount
determination unit 15 determines a buffer amount of an amount sufficient for convergence of the
adaptive processing as the buffer amount to be held in the buffer 16.
03-05-2019
33
[0110]
In the above-described configuration, the buffer amount determination unit 15 determines a
buffer amount of an amount sufficient for convergence of the adaptive processing as the buffer
amount held in the buffer 16 immediately after the start of the processing of sound information.
As a result, the first part of the sound information is processed using a buffer amount sufficient
for convergence of the adaptive processing, so that the occurrence of a defect in the speech
recognition result of the first part of the sound information is suppressed and high-quality
speech The recognition result is obtained. In other words, the dead end of speech is suppressed.
In portions other than the first portion of the sound information, the buffer amount to be held in
the buffer 16 is determined based on the information on the sound source section, the
information on the estimated direction, and the information on the convergence state of the
adaptive processing. This makes it possible to reduce the amount of buffer used for beamforming
processing. Therefore, it is possible to accelerate the response of speech recognition. Therefore, it
is possible to improve the response of speech recognition while suppressing the dead end. The
first part of the sound information may be, for example, sound information at the beginning of
the talk of the speaker, and is sound information immediately after the voice direction of the
speaker changes due to the change or movement of the speaker, etc. It is also good.
[0111]
The voice recognition method according to the first embodiment includes (a1) processing of
acquiring sound information through the plurality of microphones 1, (a2) processing of detecting
a sound source section including sound from the acquired sound information, (a3 ) Sound
information obtained by processing for obtaining an estimated direction of voice by performing
direction estimation on a voice section among the detected sound source sections, and (a4) filter
coefficients for extracting voice information of the estimated direction (A5) a process of
determining a buffer amount of sound information to be held in the buffer 16 based on (a5)
information of a sound source section, information of an estimated direction, and information of
a convergence state of the adaptation process; (A6) A process of holding the acquired sound
information in the buffer 16 according to the determined buffer amount, and (a7) using the
sound information held in the buffer 16 and the filter coefficient estimated by the adaptive
process, beam forming Grayed processing includes processing for acquiring audio information,
and a speech recognition process on (a8) beam forming process audio information. In the speech
recognition method, in the process (a5), immediately after the start of the process of the acquired
sound information, the buffer amount of an amount sufficient for convergence of the adaptive
process is determined as the buffer amount to be held in the buffer. For example, holding the
03-05-2019
34
sound information in the buffer according to the buffer amount means securing the sound
information of the buffer amount in the sound information stored in the buffer, and the buffer
amount in the buffer storing the sound information. And / or releasing the memory
corresponding to the sound information. According to the above-described method, the same
effect as the speech recognition apparatus 100 according to the first embodiment can be
obtained.
[0112]
In the speech recognition method according to the first embodiment, in the process (a5), when
the information of the estimated direction of speech changes from the information of the
estimated direction of speech acquired last time, the buffer amount is returned to the initial
value. According to the above-mentioned method, when the estimated direction of speech
changes, the buffer amount is returned to the initial value. For example, the initial value may be
the buffer amount immediately after the start of processing of sound information. As a result,
even when the estimated direction of speech changes, high-quality speech recognition results can
be obtained in which the dead end is suppressed. The buffer amount determination unit 15 of the
speech recognition apparatus 100 may also carry out the above process.
[0113]
In the speech recognition method according to the first embodiment, in the process (a5), the
information of the estimated direction of speech does not change from the information of the
estimated direction of speech acquired at the previous time, and the adaptation process
converges. In the state, when the information of the detected sound source section is the
information of the non-speech section, the buffer amount is decreased. According to the abovedescribed method, it is possible to reduce the amount of processing for sound information in the
non-voice section. Furthermore, since the estimation direction is constant and the adaptive
processing converges, it is possible to reduce the influence that the reduction of the buffer
amount slows down the speed of each processing. Therefore, the speech recognition processing
speed can be improved. The buffer amount determination unit 15 of the speech recognition
apparatus 100 may also carry out the above process.
[0114]
03-05-2019
35
In the speech recognition method according to the first embodiment, in the process (a5), the
information of the estimated direction of speech does not change from the information of the
estimated direction of speech acquired at the previous time, and the adaptation process
converges. In the state, when the information of the detected sound source section is the
information of the non-speech section, and the current buffer size is larger than the preset lower
limit, the buffer size is decreased. According to the above-described method, it is possible to
suppress the buffer amount from becoming a small amount equal to or less than the lower limit
value. If the buffer amount is too small, the accuracy of the speech recognition process is
reduced. In addition, when the buffer amount is too small, when the information of the detected
sound source section changes to the information of the voice section, it is necessary to increase
the buffer amount and then perform the speech recognition process. Thus, the response of the
processing is reduced. By maintaining the buffer amount above the lower limit value,
deterioration in the accuracy and response of the speech recognition process can be suppressed.
The buffer amount determination unit 15 of the speech recognition apparatus 100 may also
carry out the above process.
[0115]
In the speech recognition method according to the first embodiment, in the process (a5), the
information of the estimated direction of speech does not change from the information of the
estimated direction of speech acquired at the previous time, and the adaptation process is not
converged. If you do not change the buffer size. According to the above-described method, the
processing speed of the adaptive processing is reduced, and an increase in the time required for
convergence of the adaptive processing can be suppressed. The buffer amount determination
unit 15 of the speech recognition apparatus 100 may also carry out the above process.
[0116]
In the speech recognition method according to the first embodiment, in the process (a5), the
information of the estimated direction of speech does not change from the information of the
estimated direction of speech acquired at the previous time, and the adaptation process
converges. And, when the information of the detected sound source section is the information of
the voice section, the buffer amount is not changed. In the above-mentioned method, a certain
amount of buffer is required to obtain clear and high-quality speech recognition results for
speech segment information. By not changing the buffer amount, the accuracy of the speech
recognition process implemented so far is maintained. The buffer amount determination unit 15
of the speech recognition apparatus 100 may also carry out the above process.
03-05-2019
36
[0117]
In the speech recognition method according to the second embodiment, in addition to the
method according to the first embodiment, (b1) processing for holding filter coefficients
estimated by adaptive processing; and (b2) beamforming using the held filter coefficients. (B3) a
process of acquiring speech information, (b3) a process of speech recognition on speech
information acquired in the process (b2), and (b4) when the estimated direction changes, after
the estimated direction changes. A process of acquiring voice information subjected to
beamforming processing by performing the processes (a1) to (a7) on the acquired sound
information; (b5) the speech information acquired in the process (b2); A process of determining
whether the speech recognition result obtained in process (b3) can be trusted or not using the
speech information acquired in process (b4), and (b6) as a result of judgment in process (b5) If ,
Processing to output the speech recognition result by (b3), if the untrusted further includes a
process of outputting a speech recognition result obtained by speech recognition on the speech
information obtained by the processing (b4).
[0118]
In the above method, when processing (b2) and processing (b4) are compared, the speech
recognition processing using processing (b2) realizes a faster response than the speech
recognition processing using processing (b4) Do.
By performing the speech recognition process using the process (b2) in the case of being reliable,
it is possible to accelerate the response of the speech recognition process without reducing the
processing accuracy of the speech recognition. The recognition result selection unit 220 or the
like of the speech recognition apparatus 200 may also carry out the above process.
[0119]
The above method may be realized by an MPU, a CPU, a processor, a circuit such as an LSI, an IC
card, or a single module.
[0120]
Also, the processing in the embodiment may be realized by a software program or a digital signal
03-05-2019
37
consisting of a software program.
For example, the processing in the embodiment is realized by the following program.
[0121]
That is, this program is a program to be executed by a computer, and (c1) acquires sound
information from a plurality of microphones 1, (c2) detects a sound source section including
sound from the sound information, and (c3) sound source section The estimated direction of the
voice section of the above is acquired by direction estimation, (c4) filter coefficients for
extracting the voice information of the estimated direction are estimated by adaptive processing
using sound information, and (c5) of the sound source section Based on the information, the
information on the estimated direction, and the information on the convergence state of the
adaptive processing, the buffer amount of the sound information to be stored in the buffer 16 is
determined (c6). (C7) using the sound information and filter coefficients held in the buffer 16,
beamforming processing to obtain voice information; (c8) voice after beamforming processing
Speech recognition for the broadcast. Furthermore, immediately after the start of the processing
of sound information, this program holds in the buffer 16 a sufficient amount of sound
information for convergence of the adaptive processing. According to the above-mentioned
program, the same effect as the speech recognition apparatus 100 and speech recognition
method according to the first embodiment can be obtained.
[0122]
Note that the program and the digital signal formed of the program can be recorded on a
computer-readable recording medium, such as a flexible disk, a hard disk, a CD-ROM, an MO, a
DVD, a DVD-ROM, a DVD-RAM, a BD (Blu-ray (registered It may be recorded on a trademark
(trademark) Disc), a semiconductor memory or the like.
[0123]
Further, the program and the digital signal including the program may be transmitted via a
telecommunication line, a wireless or wired communication line, a network represented by the
Internet, data broadcasting, and the like.
[0124]
In addition, the digital signal including the program and the program may be implemented by
03-05-2019
38
another independent computer system by being recorded on a recording medium and
transported, or transported via a network or the like. .
[0125]
As described above, the comprehensive or specific aspects of the present disclosure may be
realized by a system, method, integrated circuit, computer program, or recording medium such as
a computer readable CD-ROM.
In addition, the comprehensive or specific aspects of the present disclosure may be realized by
any combination of a system, a method, an integrated circuit, a computer program, and a
recording medium.
[0126]
As described above, the embodiment has been described as an example of the technology
disclosed in the present application.
However, the technology in the present disclosure is not limited to these, and is also applicable to
a modification of the embodiment or another embodiment in which changes, replacements,
additions, omissions, and the like are appropriately made.
Moreover, it is also possible to combine each component demonstrated by embodiment and a
modification, and to set it as a new embodiment or a modification.
[0127]
The voice recognition devices 100 and 200 according to the embodiment do not include the
microphone 1, but may be configured to include the microphone 1. Furthermore, an image
capturing unit such as a camera and a processing unit that processes the captured image may be
provided. For example, the speech recognition apparatus may be configured to combine and
output the speech recognition result of the speaker and the photographed image obtained by the
03-05-2019
39
camera. Furthermore, even if a device such as a robot including a voice recognition device or a
voice recognition device is configured to recognize a person by collating the voice recognition
result of the speaker with the image recognition result of the speaker by the captured image
Good.
[0128]
The technology of the present disclosure implements speech recognition using a microphone
such as a television, an interactive robot, a portable terminal, a wearable terminal, etc. that has a
hands-free voice operation function without requiring the user to be aware of the position of the
microphone. Can be used as voice recognition devices in various devices.
[0129]
Reference Signs List 1 microphone 10 sound information acquisition unit 11 sound source
section detection unit 12 adaptive processing unit 12 direction estimation unit 13 adaptive
processing unit 14 adaptive processing convergence monitoring unit 15 buffer amount
determination unit 16 buffer 17 beam forming processing unit, first beam forming processing
unit 18 Speech recognition unit, first speech recognition unit 100, 200 Speech recognition unit
217 Second beam forming processing unit 218 Second speech recognition unit 219 Coefficient
holding unit 220 Recognition result selection unit
03-05-2019
40
Документ
Категория
Без категории
Просмотров
0
Размер файла
59 Кб
Теги
jp2017153065
1/--страниц
Пожаловаться на содержимое документа