close

Вход

Забыли?

вход по аккаунту

?

JP2003177777

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2003177777
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
technology used in the field of speech recognition, and more particularly, to a speech feature
extraction method and speech feature extraction apparatus suitable for extracting speech
features in a real sound field, and The present invention relates to a speech recognition method
and a speech recognition apparatus.
[0002]
2. Description of the Related Art In speech recognition technology, an input speech signal is
analyzed for each short analysis interval (frame) overlapping at a constant time interval to obtain
a feature vector of the speech signal, and the time series of the feature vector is determined. A
system that performs speech matching based on signals is the mainstream.
[0003]
Various methods have been proposed as methods for analyzing this feature vector, and
representative ones include cepstrum analysis, spectrum analysis and the like.
[0004]
By the way, various analysis methods such as cepstrum analysis and spectrum analysis converge
on the problem of how to estimate the spectrum of the speech signal although there are
10-05-2019
1
differences in details.
Although the characteristics of the audio signal appear in the structure of the spectrum, these
methods can be effective means, but have the following problems.
[0005]
(1) Since an audio signal contains a wide range of frequency information, complex parameters
are required to reproduce its spectrum.
In addition, among these parameters, many that are not so important in hearing are included,
which may cause prediction errors.
[0006]
(2) The conventional analysis method has a problem of being susceptible to noise, and there is a
limit in analyzing a spectrum whose shape is largely influenced by background noise,
reverberation and the like.
[0007]
(3) In order to realize speech recognition in a real environment, it is necessary to cope with the
movement of the speaker and multiple sound sources, including the so-called "cocktail party
effect", but in the conventional analysis method, Space information of such a sound field is not
considered so much, so it is difficult to perform audio feature extraction reflecting human senses
in a real sound field.
[0008]
The present invention has been made to solve such problems, and to extract speech features in a
real sound field by using minimal parameters corresponding to human auditory sense
characteristics without performing spectrum analysis. It is an object of the present invention to
provide a voice feature extraction method and a voice feature extraction device that can be used,
and to provide a voice recognition method and a voice recognition device using such extraction
method and device.
10-05-2019
2
[0009]
Means for Solving the Problems First, the applicants and inventors found through research that
the autocorrelation function of the speech signal contains important information about speech
features.
Specifically, the delay time of the autocorrelation function is 0, the value Φ (0) is a factor
representing the loudness, the delay time τ1 of the first peak of the autocorrelation function
and the amplitude φ1 are the pitch of the voice And the factor representing its intensity, the
effective duration .tau.e of the autocorrelation function is found to be a factor representing the
repetitive and reverberant components contained in the signal itself.
Furthermore, it has also been found that the local peak appearing up to the first peak of the
autocorrelation function is a factor including information on timbre (details will be described
later).
[0010]
We also found that the interaural cross-correlation function of binaurally measured audio signals
contains important information related to spatial characteristics such as direction localization,
spread feeling, and the width of the sound source.
Specifically, the maximum value IACC of the interaural cross correlation function is a factor
related to subjective diffusion, the peak delay time τ IACC of the interaural cross correlation
function is an important factor for the horizontal perception of the sound source, and the
binaural It has been found that the maximum value IACC of the intercorrelation function and the
width WIACC of the maximum amplitude of the interaural crosscorrelation function are factors
related to the perception of the apparent sound source width (ASW) (details will be described
later).
[0011]
The present invention focuses on such a point, and performs spectrum analysis using each factor
10-05-2019
3
included in the autocorrelation function and the interaural cross correlation function, that is, the
minimum parameter corresponding to the human auditory sense characteristic. Instead, a voice
feature extraction method and a voice feature extraction device capable of extracting voice
features in a real sound field, and a voice recognition method and a voice recognition device are
realized. The specific configuration is shown below.
[0012]
The speech feature extraction method according to the present invention is a method of
extracting speech features necessary for speech recognition, wherein an autocorrelation function
of a speech signal is determined, and the delay time of the autocorrelation function is 0 from the
autocorrelation function. (0), characterized by extracting the delay time .tau.1 and amplitude
.phi.1 of the first peak of the autocorrelation function, and the effective duration .tau.e of the
autocorrelation function.
[0013]
In the speech feature extraction method of the present invention, in addition to the speech
feature amount described above, a local peak up to the first peak of the autocorrelation function
may be extracted.
[0014]
The voice feature extraction device according to the present invention is a device for extracting
voice features necessary for voice recognition, which includes: a microphone; computing means
for obtaining an autocorrelation function of a voice signal collected by the microphone; An
extraction means for extracting the value Φ (0) of the delay time of the autocorrelation function,
the delay time τ1 and the amplitude φ1 of the first peak of the autocorrelation function, and
the effective duration time τe of the autocorrelation function It is characterized by
[0015]
In the voice feature extraction device of the present invention, in addition to the voice feature
amount described above, a local peak up to the first peak of the autocorrelation function may be
extracted.
[0016]
According to the speech recognition method of the present invention, the delay time of the
autocorrelation function has a value Φ (0) of 0, the delay time τ1 of the first peak of the
autocorrelation function, and the amplitude φ1 extracted by the above-described speech feature
10-05-2019
4
extraction method. Each data of the effective duration τ e of the autocorrelation function is
characterized by comparing speech with a template for speech recognition to recognize speech.
[0017]
In the speech recognition method according to the present invention, local peaks up to the first
peak of the autocorrelation function are extracted in addition to the feature quantity of speech
described above, and data including the local peaks is compared with a template to recognize
speech. You may do so.
[0018]
A speech recognition apparatus according to the present invention includes the speech feature
extraction apparatus described above, and a delay time of an autocorrelation function extracted
with this speech extraction apparatus, the value Φ (0) of zero, and the delay of the first peak of
the autocorrelation function. It is characterized by including recognition means for recognizing
speech by comparing each data of time .tau.1, amplitude .phi.1 and effective duration .tau.e of the
autocorrelation function with a template for speech recognition.
[0019]
In the speech recognition apparatus according to the present invention, in addition to the feature
quantities of speech described above, local peaks up to the first peak of the autocorrelation
function are extracted, and data including the local peaks is compared with a template to
recognize speech. It may be configured as follows.
[0020]
The speech feature extraction method according to the present invention is a method of
extracting speech features necessary for speech recognition, and obtains an autocorrelation
function and an interaural cross correlation function of a binaurally measured speech signal,
respectively, From the interaural cross correlation function, the delay time τ1 and amplitude
φ1 of the first peak of the autocorrelation function, the effective duration τe of the
autocorrelation function, the maximum value IACC of the interaural cross correlation function,
the interaural cross correlation function The peak delay time τ IACC, the maximum amplitude
width WIACC of the interaural cross correlation function, and the delay time of the
autocorrelation function or the interaural cross correlation function are characterized by
extracting a value ((0) of zero.
[0021]
10-05-2019
5
In the voice feature extraction device of the present invention, in addition to the voice feature
amount described above, a local peak up to the first peak of the autocorrelation function may be
extracted.
[0022]
The speech feature extraction apparatus according to the present invention is an apparatus for
extracting speech features necessary for speech recognition, and includes a binaural microphone,
and an autocorrelation function and an interaural cross correlation function of a speech signal
collected by the microphone. From the calculation means to be determined and its
autocorrelation function and interaural cross correlation function, the delay time τ1 and
amplitude φ1 of the first peak of the autocorrelation function, the effective duration τe of the
autocorrelation function, the maximum of the interaural cross correlation function Value IACC,
peak delay time τ IACC of interaural cross correlation function, maximum amplitude width
WIACC of interaural cross correlation function, and delay time of an autocorrelation function or
an interaural cross correlation function with a value Φ of 0 (0 Is characterized by including
extraction means for extracting.
[0023]
In the voice feature extraction device of the present invention, in addition to the voice feature
amount described above, a local peak up to the first peak of the autocorrelation function may be
extracted.
[0024]
According to the speech recognition method of the present invention, the delay time .tau.1 and
the amplitude .phi.1 of the first peak of the autocorrelation function, the effective duration .tau.e
of the autocorrelation function, and the interaural cross correlation function extracted by the
above-mentioned speech feature extraction method. Maximum value IACC, interaural cross
correlation function peak delay time τ IACC, interaural cross correlation function maximum
amplitude width WIACC, and autocorrelation function or interaural cross correlation function
delay time value 0 (zero) Each data of 0) is compared with a template for speech recognition to
recognize speech.
[0025]
In the speech recognition method according to the present invention, local peaks up to the first
peak of the autocorrelation function are extracted in addition to the feature quantity of speech
described above, and data including the local peaks is compared with a template to recognize
speech. You may do so.
10-05-2019
6
[0026]
The speech recognition apparatus according to the present invention comprises the speech
feature extraction apparatus described above, and the delay time τ1 and amplitude φ1 of the
first peak of the autocorrelation function and the effective duration time τe of the
autocorrelation function extracted by the speech extraction apparatus. Maximum value IACC of
interaural cross correlation function, peak delay time τ IACC of interaural cross correlation
function, width WIACC of maximum amplitude of interaural cross correlation function, and delay
of autocorrelation function or interaural cross correlation function It is characterized by
including recognition means for recognizing speech by comparing each data of the time 0 value
Φ (0) with a template for speech recognition.
[0027]
In the speech recognition apparatus according to the present invention, in addition to the feature
quantities of speech described above, local peaks up to the first peak of the autocorrelation
function are extracted, and data including the local peaks is compared with a template to
recognize speech. It may be configured as follows.
[0028]
Here, the template for speech recognition used in the present invention is, for example, a set of
feature quantities (ACF factors) of autocorrelation functions related to all syllables calculated in
advance.
Also, the template may include a set of feature quantities (IACF factors) of the interaural cross
correlation function calculated in advance.
[0029]
Hereinafter, the present invention will be described in detail.
[0030]
First, an analysis method of an audio signal used in the present invention will be described.
10-05-2019
7
[0031]
In the present invention, the analysis method of the speech signal is based on the human
auditory function model shown in FIG.
This model consists of neural mechanisms that calculate the ACF in each of the left and right
pathways and the IACF between the ears, and also considers the processing characteristics of the
left and right cerebral hemispheres.
[0032]
In FIG. 1, r0 is defined as the position of the sound source p (t) in a three-dimensional space, and
r is defined as the position of the center of the listener's head.
hr, l (r / r0, t) is an impulse response between r0 and the left and right ear canal entrances.
The impulse responses of the ear canal and the ossicles are represented by el, r (t) and cl, r (t),
respectively.
The velocity of the basement membrane is represented by Vl, r (x, ω).
[0033]
The effectiveness of such ACF and IACF models has been substantiated by studies of perception
of basic attributes of sound sources and subjective evaluation of sound fields including
preferences (preferences) (Y. Ando (1998) ), Architectural acoustics, blending sound sources,
sound fields, and listeners.
See AIP Press / Springer-Verlag, New-York).
10-05-2019
8
[0034]
Furthermore, recent research in the physiological field has revealed that the firing pattern of the
auditory nerve exhibits behavior close to that of the ACF of the input signal, and the existence of
an ACF model in the neural mechanism is being clarified (PA Cariani (1996) , Neural correlates of
the pitch of complex tones.
I. Pitch and Pitch Salience, Journal of Neurophysiology, 76, 3, 1698-1716).
[0035]
The factors extracted from the ACF allow evaluation of the basic attributes of the sound, such as
loudness (pitch of sound), pitch (pitch of the sound) and timbre, and the factors extracted from
the IACF allow the sound field to be It is possible to evaluate spatial characteristics such as
spread feeling, direction localization, and the width of a sound source.
[0036]
In the sound field, the ACF of the sound source signal reaching the human ear is obtained from
the following equation.
[0038]
Where p '(t) = p (t) * s (t), s (t) is the sensitivity of the ear.
Usually, an impulse response of A characteristic is used for s (t).
The power spectrum of the source signal can also be obtained from the ACF as follows:
Thus, the ACF and the power spectrum contain mathematically the same information.
[0041] One of the important properties of ACF is that it has a maximum value when the delay
time τ = 0 in equation (1).
10-05-2019
9
This value is defined as □□ (0).
Since Φ (0) represents the energy of the signal, normally the normalized ACF (φ (τ)) divided by
this value is used for analysis of the signal.
Furthermore, the relative listening sound pressure level LL at the head position is obtained by
obtaining the geometric mean of left and right □ (0) and performing logarithmic conversion of
10 times.
[0042] In the analysis of ACF, the most important factor (feature amount) that has been missed
so far is the effective duration τe defined by the envelope of the normalized ACF.
The effective duration τ e is defined as a 10% delay time as shown in FIG. 5 and represents the
repetitive and reverberant components contained in the signal itself.
In addition, the fine structure of the ACF, including peaks and dips, contains a lot of information
about the periodicity of the signal.
What is most effective in analyzing the speech signal is information on the pitch, and the delay
time τ1 and the amplitude φ1 (FIG. 6) of the first peak of the ACF are factors representing the
frequency corresponding to the pitch of the speech and its intensity.
[0044] Here, the first peak is often the maximum peak of ACF, and the subsequent periodic peaks
appear in the cycle.
Further, the local peak appearing at the time to the first peak represents the time structure of the
high frequency band of the signal, and includes information on the timbre. Particularly in the
case of speech, it represents the characteristics of the resonance frequency of the vocal tract
called formant. The above ACF factors include all speech features necessary for recognition. That
is, the voice can be specified by the delay time and amplitude of the first ACF peak corresponding
to the pitch and the pitch, and the local peak of the ACF corresponding to the formant, and the
10-05-2019
10
effective duration τe determines noise and reverberation in the real sound field. We can
consider the impact. Next, IACF will be described. [0047] The long-time IACF can be determined
by the following equation. Where p'1, r (t) = plr (t) * s (t), and the sound pressure at the entrance
of the left and right ear canal. Spatial information including the perception of the horizontal
direction of the sound source is expressed by the following equation.
[0051] It is defined by
[0052] τ WI ACC and WI ACC are IACF peak lag times and widths as defined in FIG. Among
these IACC factors, τ IACC in the range of −1 ms to +1 ms is an important factor for the
horizontal perception of the sound source.
[0053] A clear sense of direction is obtained when IACC, which is the maximum value of IACF,
has a large value and normalized IACF has one sharp peak. The direction is to the left of the
listener when τ I ACC has a negative value and to the right when it has a positive value. On the
contrary, when IACC has a small value, the subjective sense of expansion becomes large and the
sense of direction becomes vague. The width of the perceived apparent sound source can be
determined by IACC and WIACC.
[0054] As described above, if the delay time of the ACF is 0, the delay time .tau.1 and the
amplitude .phi.1 of the first peak of the ACF, and the effective duration time .tau.e of the ACF are
extracted for the audio signal. The magnitude of the sound can be determined from Φ (0) of the
ACF, and the pitch (pitch of the sound) and the intensity of the voice can be determined from the
delay time τ1 of the first peak of the ACF and the amplitude φ1. In addition, the effective
duration τe of the ACF can take into consideration the effects of noise and reverberation in a
real sound field.
[0055] Furthermore, by extracting a local peak that appears up to the first peak of ACF in the
audio signal, it becomes possible to specify the timbre of the audio from the local peak.
[0056] In addition, by extracting the maximum value IACC of IACF, the peak delay time τIACC of
IACF, and the width WIACC of the maximum amplitude of IACF for a speech signal, it is possible
to obtain a subjective sense of spread from the maximum value IACC of that IACF. Perception of
the sound source in the horizontal direction can be obtained from the peak delay time τ IACC of
Further, from the IACF maximum value IACC and the IACF maximum amplitude width WIACC,
10-05-2019
11
the perceived apparent sound source width (ASW) can be determined.
[0057] Therefore, by adding these IACF factors, that is, spatial information of the sound field, to
speech recognition, highly accurate recognition that reflects human senses in real sound fields
becomes possible.
[0058] Here, in the present invention, it is not necessary to extract all of the ACF factor and IACF
factor described above, and among these factors, the delay time of the ACF is at least 0, the value
Φ (0) of at least 0, the delay of the first peak of the ACF. If there are four factors of time .tau.1,
amplitude .phi.1 and effective duration .tau.e of ACF, speech features can be extracted, and
speech recognition can be performed reliably.
[0059] BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention
will be described below with reference to the drawings.
[0060] FIG. 2 is a block diagram showing the configuration of the embodiment of the present
invention.
[0061] The speech recognition apparatus shown in FIG. 2 comprises a binaural microphone 2
mounted on a head model 1 of a listener, a low pass filter (LPF) 3 for applying an A characteristic
filter to an audio signal collected by the microphone 2, and an A / D converter 4 and the
computer 5 are mainly configured. The A characteristic filter is a filter corresponding to the
sensitivity s (t) of the ear.
[0062] The computer 5 includes a storage unit 6, an ACF calculating unit 7, an IACF calculating
unit 8, an ACF factor extracting unit 9, an IACF factor extracting unit 10, a voice recognition unit
11, and a database 12.
[0063] The storage unit 6 stores an audio signal collected by the binaural microphone 2.
[0064] The ACF calculating unit 7 reads out the audio signals (left and right two channels) stored
in the storage device 6 and calculates ACF (autocorrelation function). Details of the calculation
process will be described later.
10-05-2019
12
[0065] The IACF operation unit 8 reads the audio signal stored in the storage device 6 and
calculates an IACF (interaural cross correlation function). Details of the calculation process will
be described later.
[0066] From the ACF calculated by the ACF calculation unit 7, the ACF factor extraction unit 9
effectively continues the delay time .tau.1 and amplitude .phi.1 of the first peak of the ACF, and
the ACF, with the delay time of the ACF being zero. Derivate each ACF factor of time τ e.
Furthermore, local peaks ((τ′1, φ′1), (τ′2, φ′2),... Shown in FIG. 6) up to the first peak of
ACF are derived. Details of the calculation process will be described later.
[0067] The IACF factor extraction unit 10 derives each IACF factor from the IACF calculated by
the IACF calculation unit 8, the IACF maximum value IACC, the IACF peak delay time τ IACC, and
the IACF maximum amplitude width WIACC. Details of the calculation process will be described
later.
[0068] The speech recognition unit 11 recognizes (identifies) syllables by comparing the ACF
factor and the IACF factor of the speech signal obtained by the above processing with the speech
recognition template stored in the database 12. Details of the syllable recognition process will be
described later.
[0069] The template stored in the database 12 is a set of ACF factors for all syllables precomputed. The template also contains a set of pre-computed IACF factors.
[0070] Next, the operation of the syllable identification process executed in the present
embodiment will be described with reference to the flowchart shown in FIG.
[0071] First, an audio signal is collected by the binaural microphone 2 (step S1). The sampled
audio signal is led to the A / D converter through the low pass filter 3 and converted into a digital
signal, and the digital signal after the digital conversion is stored in the storage device 6 in the
computer 5 (step S2) .
[0072] The ACF calculating unit 7 and the IACF calculating unit 8 read out the audio signal
10-05-2019
13
(digital signal) stored in the storage device 6 (step S3), and calculate the ACF and IACF of the
audio signal (step S4).
[0073] The calculated ACF and IACF are respectively supplied to the ACF factor extraction unit 9
and the IACF factor extraction unit 9, and the ACF factor and the IACF factor are calculated (step
S5).
[0074] Then, the ACF factor and IACF factor of the voice signal obtained by the above processing
are compared with the template stored in the database 12, and syllables are recognized
(identified) by the processing described later (steps S6 and S7).
[0075] Here, in the apparatus configuration shown in FIG. 2, the head model 1, binaural
microphone 2, low pass filter 3, A / D converter 4, storage device 6 of computer 5, ACF operation
unit 7, IACF operation unit 8, ACF By combining the factor extraction unit 9 and the IACF factor
extraction unit 10, a voice feature extraction device for ACF factor and IACF factor extraction can
be realized.
[0076] Further, by combining the head model 1, the binaural microphone 2, the low pass filter 3,
the A / D converter 4, the storage device 6 of the computer 5, the ACF operation unit 7, and the
ACF factor extraction unit 9, An audio feature extraction device can be realized.
[0077] Next, specific calculation methods of ACF and IACF will be described.
[0078] As shown in FIG. 4, a running ACF and a running IACF are calculated for short-time
segments (hereinafter referred to as frames) Fk (t) within the duration of the target audio signal.
This is the way in which the characteristics of the audio signal change with time. The integration
interval 2T of ACF is selected to be 20 to 40 times the minimum value of τe [ms] extracted from
ACF.
[0079] When analyzing speech, the length of a frame is set to several ms to several tens of ms,
and adjacent frames are set to overlap each other. In the present embodiment, the length of the
frame is set to 30 ms, and the frames are set to overlap each 5 ms.
10-05-2019
14
[0080] The short running ACF, which is a function of the delay time τ, is calculated as follows.
[0082] 【0082】である。
[0083] P '(t) in Formula (8) shows that it is a signal which applied the A characteristic filter to the
extract | collected audio | voice signal p (t).
[0084] The Φ (0) in the denominator of equation (7) is the value of ACF when the delay time τ =
0, and represents the average energy within the frame of the sampled speech signal. Since the
ACF takes the maximum value at the delay time τ = 0, the ACF thus normalized will have the
maximum value 1 at τ = 0.
[0085] (B) Sound pressure level (SPL: Sound pressure level) at the position of the head is
expressed as follows: ACF for signals taken at the positions of the left and right ears is expressed
as Φll (τ) and rrrr (τ) respectively. It is obtained by the formula.
[0087] Φ ref (0) is Φ (0) at a reference sound pressure value of 20 μP.
[0088] From the calculated ACF, we derive the factors needed to recognize syllables. Below, the
definition and derivation method of those factors are described.
[0089] The effective duration .tau.e is defined by the delay time .tau. When the normalized ACF
amplitude decays to 0.1.
[0090] FIG. 5 is a graph in which the absolute value of ACF is represented by logarithm on the
vertical axis. Since it is generally observed that the initial ACF decays linearly, it is possible to
easily determine τ e by linear regression. Specifically, with respect to the peak of ACF obtained
at a predetermined time Δτ, τe is determined using the least mean square method (LMS).
[0091] FIG. 6 shows an example of calculation of normalized ACF. Here, the maximum peak of
the normalized ACF is determined, and its delay time and amplitude are defined as τ1 and φ1,
respectively. Furthermore, local peaks up to the maximum peak are obtained, and their delay
10-05-2019
15
times and amplitudes are defined as τ′k, φ′k, k = 1, 2,.
[0092] The section for obtaining a peak is a section from the delay time τ = 0 to the appearance
of the maximum peak of ACF, and corresponds to one cycle of ACF. As mentioned above, the
maximum peak of ACF corresponds to the pitch of the sound source, and the local peak up to the
maximum peak corresponds to the formant.
[0093] Next, the calculation method of IACF and the factor derived therefrom are described.
[0094] The IACF is defined by the following formula.
[0096] Here, the subscripts l and r indicate that the signals reach the left and right ears.
[0097] FIG. 7 shows an example of the normalized IACF. As the maximum delay time between
both ears, it is sufficient to consider -1 ms to +1 ms. The IACF maximum amplitude IACC is a
factor associated with subjective diffusion.
[0098] Next, the value of τ IACC is a factor indicating the arrival direction of the sound source.
For example, if τ I ACC takes a positive value, the sound source is perceived as if it is located to
the right of the listener, or if the sound source is to the right of the listener. When τ I ACC = 0, it
means that the sound source is perceived in front of the listener.
[0099] Also, the width WIACC of the maximum amplitude is defined as the width which is 0.1
lower than the maximum value. The factor 0.1 is a value obtained by experiment and is used
approximately.
[0100] Next, a method of recognizing syllables based on the distance between syllables of the
input signal and the template will be described.
[0101] The inter-syllable distance is to calculate the distance between the ACF factor and the
IACF factor determined for the sampled speech signal and the template stored in the database.
The template is a pre-computed set of ACF factors for all syllables. Since the ACF factor
10-05-2019
16
represents the characteristic of the sound to be perceived, it is a method using the fact that if the
sounds are aurally similar, the factors to be obtained naturally become similar.
[0102] Distance D (x) (x: Φ (0), τe, τk, φk, τ'k, φ'k, k) between target input data (represented
by symbol a) and a template (represented by symbol b) = 1, 2, ..., I) is calculated as the following
equation.
[0104] Equation (11) finds the distance for に 関 す る (0), where N denotes the number of
analysis frames. The reason for taking logarithms in the calculation is that human senses have
logarithmic sensitivity to physical quantities. Similar equations are used to determine distances
for other independent factors.
[0105] The sum D of the distances is expressed by the following equation in which the distances
D (x) of the respective factors are added.
[0107] In the equation (12), M is the number of factors, and W is a weighting factor. The
template with the smallest calculated distance D is determined to be a syllable of the input signal.
As will be described later, in the real sound field, it is possible to recognize with high accuracy by
adding an IACF factor when obtaining D. In this case, D (x) is calculated according to the equation
(11) for the IACF factors IACC, τ IACC and WIACC, and is added to the equation (12).
[0108] As described above, according to the present embodiment, the delay time of the ACF is 0,
the value ((0) of the ACF, the delay time τ1 and the amplitude φ1 of the first peak of the ACF,
and the effective duration time τe of the ACF. Is extracted, so that the magnitude of the sound
can be determined from 抽出 (0) of the extracted ACF, and from the delay time τ1 and the
amplitude φ1 of the first peak of the ACF, the pitch of the voice (pitch of the sound) And their
strength can be determined. In addition, the effective duration τe of the ACF can take into
consideration the effects of noise and reverberation in a real sound field.
[0109] As described above, according to the present embodiment, since it is possible to extract
the features of speech by using four parameters corresponding to human auditory sense
characteristics, it is not necessary to perform spectrum analysis, compared with the conventional
one. Thus, the speech recognition apparatus can be realized with an extremely simple
configuration.
10-05-2019
17
[0110] Moreover, in the present embodiment, since the local peak appearing up to the first peak
of the ACF is also extracted from the audio signal, it is possible to specify the timbre of the audio
from the local peak.
[0111] Further, in the present embodiment, since the maximum value IACC of IACF, the peak
delay time τIACC of IACF, and the width WIACC of the maximum amplitude of IACF are
extracted from the audio signal, subjective sense of spaciousness is obtained from the maximum
value IACC of IACF. The horizontal perception of the sound source can be determined from the
peak delay time τIACC of the IACF. Further, from the IACF maximum value IACC and the IACF
maximum amplitude width WIACC, the perceived apparent sound source width (ASW) can be
determined.
[0112] Therefore, by adding these IACF factors, that is, spatial information of the sound field, to
speech recognition, highly accurate recognition that reflects human senses in real sound fields
becomes possible.
[0113] In the above embodiment, the value ((0) with a delay time of ACF of 0 is extracted as the
information regarding the loudness, but instead, the value 値 of the delay time of IACF (0 0) may
be extracted and used for recognition.
[0114] In the above embodiment, both the ACF factor and the IACF factor are extracted, but the
present invention is not limited to this, and only the ACF factor may be extracted. When only an
ACF factor is extracted, a binaural microphone may be used to acquire an audio signal, or a
monaural microphone may be used.
[0115] Here, in the embodiment shown in FIG. 2, the speech recognition apparatus of the present
invention is shown by a hardware configuration by functional blocks, but the present invention is
not limited to this, for example, performs speech recognition processing shown in FIG. The voice
recognition program for the present invention is recorded in a computer-readable recording
medium such as a personal computer, and the stored program is executed by the computer to
realize the voice recognition method of the present invention. Good.
[0116] Also, a voice feature extraction program for performing voice feature extraction
processing in steps S1 to S5 of FIG. 3 is recorded in a recording medium readable by a computer
10-05-2019
18
such as a personal computer, and the stored program is used as a computer. The speech feature
extraction method of the present invention may be realized by executing the above method.
[0117] The computer-readable recording medium may be a memory incorporated in the
computer such as a ROM, or a recording medium readable by a reader (external storage device)
provided in the computer, for example, a magnetic Tape systems such as tapes and cassette
tapes, magnetic disk systems such as floppy (registered trademark) disks and hard disks, optical
disk systems such as CD-ROM / MO / MD / DVD, IC cards (including memory cards) / optical
cards etc. It may be a recording medium such as a card system or a semiconductor memory such
as a mask ROM, an EPROM, an EEPROM, a flash ROM or the like.
[0118] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS As an embodiment
showing a specific operation of the apparatus shown in FIG. 2, the prediction result of speech
intelligibility in a real sound field is shown.
[0119] In this example, an experiment was conducted to answer the target sound by
simultaneously presenting the target sound of the monosyllable from the front of the subject, the
white noise as the disturbing sound from the side, or another monosyllable simultaneously.
Clarity is expressed as the subject's rate of correct answers. In addition, the presentation angle of
a disturbance sound was 30 degrees, 60 degrees, 120 degrees, and 180 degrees.
[0120] In order to predict the intelligibility, the ACF factor and the IACF factor in the case where
only the target sound was presented were used as a template (database), and the distance to each
factor under the above experimental conditions was determined by the apparatus shown in FIG.
The results (measured values) and predicted values are shown in FIG. In addition, when
calculating | requiring the distance D by Formula (12), the prediction value was taken as the
value in the case of not including the τ'k and φ'k which are delay time and amplitude of the
local peak of normalized ACF as a factor.
[0121] From the results of FIG. 8, the experimental result of this example is very close to the
predicted value by calculation (predictive rate r = 0.86), and by adding the spatial information of
the sound field, the actual sound field It is understood that recognition reflecting human senses is
possible. Further, it can be understood that, by using the apparatus of FIG. 2, it is possible to
predict even under adverse conditions where there are many strong disturbing sounds in the
sound field.
10-05-2019
19
[0122] As described above, according to the present invention, the ACF (autocorrelation
function) of an audio signal is determined, and the delay time of the ACF from the ACF is zero,
the value ((0), and the first peak of the ACF. Since the delay time .tau.1 and the amplitude .phi.1
of the ACF, and the effective duration .tau.e of the ACF are derived, the characteristics of the
speech can be obtained using the minimum parameters corresponding to the auditory
characteristics without complex spectral analysis. Can be extracted. Moreover, since these ACF
factors contain information important for speech recognition, speech recognition can be
performed with high accuracy.
[0123] Furthermore, in the present invention, the IACF (interaural cross correlation function) of
the speech signal is obtained, and the maximum value IACC of IACF, the peak delay time τIACC
of IACF, and the width WIACC of the maximum amplitude of IACF are extracted from the IACF.
Therefore, by adding these IACF factors, that is, spatial information of the sound field, to speech
recognition, it becomes possible to perform high-accuracy recognition reflecting human senses in
real sound fields. Moreover, by introducing each factor of IACF, speech recognition resistant to
noise can be realized.
10-05-2019
20
Документ
Категория
Без категории
Просмотров
0
Размер файла
34 Кб
Теги
jp2003177777
1/--страниц
Пожаловаться на содержимое документа