close

Вход

Забыли?

вход по аккаунту

?

JP2010154260

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010154260
The present invention is intended to obtain good voice by attenuating noise even for a sound
source that generates voice intermittently. SOLUTION: An image / voice feature information
storage unit 31 for storing image / voice feature information, an object detection unit 24a for
detecting feature information of a subject image, voice detection units 14a and 14b for detecting
voice feature information, and a subject Object position detection unit 24b that calculates the
distance and direction to the sound, the audio position detection unit 12 that calculates the
distance and direction to the sound source, the characteristic information of the object image and
the feature information of the sound, the distance to the object, the direction, and the sound
source An associating unit 40a associating the subject and the sound source as the same object
based on the distance and direction to the point, and a feature information determining unit 40b
determining whether the feature information of the object matches the image / voice feature
information; A tracking control unit 40d for tracking a subject image, and pointing for adjusting
the directivity characteristic of the microphone array 11 based on the tracking result and the
distance and direction of the subject or the sound source Comprising sex adjustment unit 13a,
and 13b. [Selected figure] Figure 1
Voice recognition device
[0001]
The present invention relates to a device for identifying a voice emitted from an object, and more
particularly to a voice recognition device capable of attenuating noise and obtaining a good voice
even for a sound source that generates voice intermittently.
[0002]
04-05-2019
1
In general, a video camera or the like may shoot a large number of people, and in such a case, the
direction of the speaker who emits a voice is detected, and the directivity of the microphone is
increased in the detected direction. We need to attenuate the noise.
[0003]
Therefore, in Patent Document 1, a directional detection unit that detects the direction of the
speaker, a microphone that detects the voice of the speaker, and a directional characteristic of
the microphone are enhanced in the direction of the speaker detected by the direction detector.
There has been proposed an on-vehicle speech recognition apparatus including a gain
adjustment unit to be adjusted and a speech recognition unit to recognize the speech of the
speaker adjusted by the gain adjustment unit.
[0004]
Further, in Patent Document 2, the operator's lip position is specified from the image captured by
the camera, the directivity characteristics of the plurality of microphones are adjusted based on
the specified lip position, and the sound signals of the plurality of microphones There has been
proposed a vehicle voice recognition device that synthesizes
[0005]
Further, according to Patent Document 3, the direction of the speaker is detected based on the
voice input from the microphone, the directivity characteristic of the microphone is adjusted, and
the camera is directed to the detected speaker direction and the image is taken by this camera.
There has been proposed an audiovisual cooperation recognition apparatus which performs face
detection based on an image and performs interactive processing when the face is detected.
JP-A-11-219193 JP-A-2000-10589 JP-A-2006-251266
[0006]
However, in the technique of Patent Document 1, since the directional characteristic of the
microphone is adjusted based on the voice emitted by the speaker, it is possible to adjust the
directional characteristic of the microphone while the speaker is not generating the voice. It was
difficult.
04-05-2019
2
[0007]
Further, in the technique described in Patent Document 2, the directional characteristics of the
microphone are adjusted based on the operator's lip position specified from the captured image,
so that the operator's lips are not moving, that is, the operator While not occurring, it was
difficult to adjust the directional characteristics of the microphone.
[0008]
Furthermore, in the technique described in Patent Document 3, the direction of the speaker is
detected based on the voice input from the microphone, so, for example, when shooting a large
number of people who utter at random, each time voice is input from the microphone In addition,
since it is necessary to match the directional characteristics of the microphone, the device load is
large, and it is difficult to detect the voice immediately after uttering with high sensitivity.
[0009]
The present invention has been made in view of the above problems, and an object of the present
invention is to provide a voice identification device capable of attenuating noise also about a
sound source which generates voice intermittently and obtaining a good voice.
[0010]
In order to achieve the above object, according to a first aspect of the present invention, there is
provided a voice identification apparatus for identifying a voice emitted from an object,
comprising: converting light collected by an optical system into an electrical signal; An image
pickup unit for generating data, a microphone array in which a plurality of microphones for
converting sound emitted from a sound source into an electric signal to generate sound data are
arranged at predetermined intervals, and features of an object image included in the image data
An image / voice feature information storage unit that associates information with feature
information of a voice emitted from the sound source and stores it as image / voice feature
information, and detects feature information of a subject image from image data generated by
the imaging unit An object detection unit, a voice detection unit for detecting voice characteristic
information from voice data generated by the microphone array, and the imaging unit An object
position detection unit that calculates the distance from the voice recognition device to the
subject and the direction of the subject with respect to the voice recognition device based on the
generated image data, and the voice data generated by the microphone array A voice position
detection unit that calculates a distance from the voice recognition device to the sound source
and a direction of the sound source with respect to the voice recognition device, feature
information of a subject image detected by the object detection unit, and the voice detection unit
The object and the object are detected based on the feature information of the detected sound,
the distance and direction of the object calculated by the object position detection unit, and the
04-05-2019
3
distance and direction of the sound source calculated by the sound position detection unit. An
associating unit associating the sound source with the same object and feature information of the
object associated by the associating unit A feature information determination unit that
determines whether or not the image-speech feature information stored in the image-speech
feature information storage unit matches, and the feature information determination unit
determines that the feature information matches the image-speech feature information A
tracking control unit for tracking the subject on the image data, a tracking result of the tracking
control unit, a distance and a direction of the subject calculated by the object position detection
unit, or the voice position detection unit; And a directional characteristic adjusting unit for
adjusting the directional characteristic of the microphone array based on the distance and the
direction of the sound source.
[0011]
In order to achieve the above object, according to a second feature of the voice identification
device according to the present invention, when the feature information determination unit
determines that the feature information does not match, the feature information of the object
associated by the association unit And a storage control unit configured to store the new image /
sound feature information in the image / sound feature information storage unit.
[0012]
In order to achieve the above object, according to a third feature of the voice recognition device
according to the present invention, the object position detection unit is configured to use the
voice recognition device based on an angle of view in the imaging unit and focus information to
the subject. It is to calculate the distance to the subject and the direction of the subject with
respect to the voice recognition device.
[0013]
In order to achieve the above object, a fourth feature of the voice recognition device according to
the present invention is an image reference feature information storage unit that associates
feature information of the subject image with reference dimensions of the subject and stores
them as image reference feature information. The object position detection unit extracts a
reference dimension of the subject corresponding to feature information of the subject image
included in the image data based on the image reference feature information, and the extracted
reference of the subject The distance from the voice identification device to the subject and the
direction of the subject with respect to the voice identification device are calculated based on the
dimensions and the angle of view of the imaging unit.
04-05-2019
4
[0014]
In order to achieve the above object, according to a fifth aspect of the voice identification device
in accordance with the present invention, the voice position detection unit is configured to send
from the voice identification device to the sound source Calculating the distance of the sound
source and the direction of the sound source with respect to the voice recognition device.
[0015]
In order to achieve the above object, according to a sixth feature of the voice recognition device
in accordance with the present invention, the image data is processed when the tracking control
unit determines that the feature information of the subject image matches by the feature
information determining unit. And dividing the displayed image into a plurality of blocks, and
detecting the movement of each block to track the movement of the subject.
[0016]
In order to achieve the above object, according to a seventh feature of the voice recognition
device in accordance with the present invention, the directional characteristic adjustment unit
determines the distance between the object calculated by the tracking control unit and the
tracking result of the tracking control unit. And overlapping the audio data generated by the
plurality of microphones so as to eliminate the time difference of the audio reaching the plurality
of microphones based on the direction and the distance or the direction of the sound source
calculated by the audio position detection unit It is to match.
[0017]
According to the voice identification device of the present invention, it is possible to obtain good
voice by attenuating noise even for a sound source that generates voice intermittently.
[0018]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings.
[0019]
In an embodiment of the present invention, a voice recognition apparatus that attenuates noise
even for a sound source that generates voice intermittently and obtains good voice will be
described as an example.
04-05-2019
5
[0020]
<Configuration of Voice Identification Device> FIG. 1 is a configuration diagram showing a
configuration of a voice identification device according to an embodiment of the present
invention.
[0021]
The speech recognition apparatus 1 according to an embodiment of the present invention
includes a microphone array 11, a speech position detection unit 12, a first directivity
characteristic adjustment unit 13a, a second directivity characteristic adjustment unit 13b, and a
first speech. A detection unit 14a, a second voice detection unit 14b, a camera 21 and a camera
processing unit 22 having an imaging unit, a motion sensor 23, a detection unit 24, a motion
vector detection unit 25, an image / voice feature information storage unit An image reference
feature information storage unit 32, an audio reference feature information storage unit 33, a
directivity characteristic priority storage unit 34, a CPU 40, an operation unit 41, and a display
unit 42 are provided.
[0022]
The microphone array 11 includes a first microphone 11a, a second microphone 11b, and a third
microphone 11c, which are disposed at predetermined intervals of, for example, about 10 mm.
Convert to voice data.
[0023]
The voice position detection unit 12 calculates the distance from the voice identification device 1
to the sound source and the direction of the sound source with respect to the voice identification
device 1 based on the voice data generated by the microphone array 11.
[0024]
The first directional characteristic adjustment unit 13a calculates the tracking result of the
tracking control unit 40d of the CPU 40 described later, the distance and direction of the subject
calculated by the object position detecting unit 24b of the detecting unit 24, or the audio position
detecting unit 12 Based on the distance and direction of the sound source calculated by the
above-mentioned method so as to eliminate the time difference between the voices reaching the
first microphone 11a, the second microphone 11b, and the third microphone 11c. The directivity
characteristics are adjusted by superimposing voice data.
[0025]
04-05-2019
6
The second directivity characteristic adjustment unit 13 b has the same configuration as the first
directivity characteristic adjustment unit 13 a.
[0026]
The first voice detection unit 14a extracts voice feature information from the voice data whose
directivity characteristic has been adjusted by the first directivity characteristic adjustment unit
13a.
Specifically, the first voice detection unit 14a extracts a component waveform, a formant, and the
like from the voice whose directivity characteristic has been adjusted, and supplies these to the
CPU 40 as voice feature information.
[0027]
The second speech detection unit 14 b has the same configuration as the first speech detection
unit 14 a.
[0028]
The camera 21 converts the light collected by the built-in lens into an electrical signal.
[0029]
The camera processing unit 22 converts the electric signal supplied from the camera 21 into
image data such as RGB signal luminance signal Y and color difference signals Cr and Cb.
[0030]
The motion sensor 23 includes, for example, a gyro sensor, and detects the motion of the voice
recognition device 1.
[0031]
The detection unit 24 includes an object detection unit 24 a and an object position detection unit
24 b.
04-05-2019
7
[0032]
The object detection unit 24 a detects feature information of a subject image from the image data
generated by the camera processing unit 22.
For example, the object detection unit 24a detects the shape and color of the subject image from
the image data as feature information, and based on the extracted shape and color and the image
reference feature information stored in the image reference feature information storage unit 32. ,
Identify the type of subject.
Then, the CPU 40 is supplied with the specified type of subject and feature information of the
subject image.
[0033]
The object position detection unit 24 b calculates the distance from the voice recognition device
1 to the subject of the image data and the direction of the subject with respect to the voice
recognition device 1 based on the image data generated by the camera processing unit 22.
[0034]
The motion vector detection unit 25 detects the motion of the image data generated by the
camera 21.
[0035]
The image / sound feature information storage unit 31 associates the feature information of the
subject image of the image data with the feature information of the sound emitted from the
sound source and stores the result as the image / sound feature information.
[0036]
FIG. 2 is a diagram showing an example of the image / sound feature information stored in the
image / sound feature information storage unit 31 included in the sound identification apparatus
1 according to an embodiment of the present invention.
[0037]
04-05-2019
8
As shown in FIG. 2, a column name "feature information ID" (symbol 51), a column name "type"
(symbol 52), a column name "subject image feature information" (symbol 53), and a column
name "voice" Data feature information “(symbol 54) is associated with each other and stored as
image / sound feature information.
[0038]
Further, the feature information 53 of the subject image includes a column name “shape”
(symbol 53 a) and a column name “color” (symbol 53 b).
The voice data feature information 54 includes a column name "component waveform" (symbol
54a) and a column name "formant" (symbol 54b).
[0039]
The image reference feature information storage unit 32 stores the type of the subject and the
image reference feature information in association with each other.
[0040]
FIG. 3 is a diagram showing an example of the image reference feature information stored in the
image reference feature information storage unit 32 provided in the voice identification device 1
according to the embodiment of the present invention.
[0041]
As shown in FIG. 3, the column name “type” (reference numeral 61) and the column name
“image reference feature information” (reference numeral 62) are stored in association with
each other.
The image reference characteristic information 62 includes a column name “type” (reference
numeral 62 a), a column name “color” (reference numeral 62 b), and a column name
“reference dimension” (reference numeral 62 c).
04-05-2019
9
[0042]
The voice reference feature information storage unit 33 stores the type of sound source and the
voice reference feature information in association with each other.
[0043]
FIG. 4 is a diagram showing an example of the voice reference feature information stored in the
voice reference feature information storage unit 33 included in the voice identification device 1
according to the embodiment of the present invention.
[0044]
As shown in FIG. 4, the column name "type" (symbol 71) and the column name "voice reference
characteristic information" (symbol 72) are stored in association with each other.
The speech reference feature information 72 includes a column name "power spectrum" (code
72a) and a column name "sound spectrum" (code 72b).
[0045]
The directivity characteristic priority storage unit 34 stores the priority of the type of the subject
and the sound source supplied from the operation unit 41 described later.
The CPU 40 described later performs processing in accordance with a predetermined priority
stored in advance in the directivity characteristic priority storage unit 34 until the priority in the
operation unit 41 is designated.
[0046]
The CPU 40 centrally controls the voice identification device 1.
Further, the CPU 40 functionally includes an association unit 40a, a feature information
04-05-2019
10
determination unit 40b, a storage control unit 40c, a tracking control unit 40d, and a directivity
adjustment control unit 40e.
[0047]
The association unit 40a includes feature information of the subject image detected by the object
detection unit 24a, feature information of sound detected by the first sound detection unit 14a or
the second sound detection unit 14b, and an object position detection unit 24b. The subject and
the sound source are associated with each other as the same object based on the distance and the
direction of the subject calculated by the above and the distance and the direction of the sound
source calculated by the sound position detection unit 12.
[0048]
The feature information determination unit 40b determines whether the feature information of
the object associated by the association unit 40a matches the image / voice feature information
stored in the image / voice feature information storage unit 31.
[0049]
If the feature information determination unit 40b determines that the feature information does
not match, the storage control unit 40c uses the feature information of the object associated by
the association unit 40a as the new image / voice feature information and stores the image /
voice feature information storage unit 31. Remember.
[0050]
When the feature information determination unit 40b determines that the feature information of
the subject image matches, the tracking control unit 40d divides the image displayed on the
display unit 42 into a plurality of blocks based on the image data. Track the movement of the
subject by detecting the movement of
[0051]
The directivity adjustment control unit 40 e is based on the tracking result of the tracking control
unit 40 d and the distance and direction of the subject calculated by the object position detection
unit 24 b or the distance and direction of the sound source calculated by the audio position
detection unit 12. The directivity characteristic is adjusted by the first directivity characteristic
adjustment unit 13a or the second directivity characteristic adjustment unit 13b.
04-05-2019
11
[0052]
The operation unit 41 performs various operations such as an operation signal for requesting
start and end of photographing based on the operation of the user, and an operation signal for
requesting storage of the priority of the type of the subject in the directivity characteristic
priority storage unit 34. A signal is generated, and the generated operation signal is supplied to
the CPU 40.
[0053]
The display unit 42 includes an image output device such as an organic EL (electroluminescence)
display or a liquid crystal display, and displays various screens based on image data supplied
from the CPU 40.
[0054]
<Operation of Speech Identification Device 1> Next, the operation of the speech identification
device 1 according to an embodiment of the present invention will be described.
[0055]
FIG. 5 is a flow chart showing the processing flow of the speech recognition apparatus 1
according to an embodiment of the present invention.
[0056]
First, when an electric signal is supplied from the camera 21 (step S101), the camera processing
unit 22 of the voice identification device 1 converts the supplied electric signal into an RGB
signal, a luminance signal Y, color difference signals Cr, Cb, and the like. Convert to generate
image data.
[0057]
Next, the object position detection unit 24b corrects the shake based on the movement of the
voice identification device 1 detected by the movement sensor 23 (step S102).
For example, the object position detection unit 24b selects a range of image data to be cut out
from the image data supplied from the camera processing unit 22 so as to cancel the movement
of the voice recognition device 1 detected by the movement sensor 23. The obtained image data
04-05-2019
12
is supplied to the object detection unit 24a.
[0058]
Then, the object detection unit 24a detects the feature information of the subject image from the
image data whose shake has been corrected (step S103).
For example, the object detection unit 24a detects the shape and color of the subject image from
the image data as feature information, and based on the extracted shape and color and the image
reference feature information stored in the image reference feature information storage unit 32. ,
Identify the type of subject.
Then, the CPU 40 is supplied with the specified type of subject and feature information of the
subject image.
[0059]
FIG. 6 is a diagram for explaining detection processing by the object detection unit 24 a included
in the speech recognition apparatus 1 according to the embodiment of the present invention.
[0060]
As shown in FIG. 6, subject A and subject B, which are men, appear on the screen captured by the
camera 21, and the object detection unit 24a detects the shape and color of the subject A and
subject B as feature information. Do.
Then, the object detection unit 24a extracts, from the image reference feature information stored
in the image reference feature information storage unit 32, the type of the subject matching the
detected shape and color, and the extracted subject A and subject B. And the feature information
of the subject image to the CPU 40.
In the example shown in FIG. 6, the object detection unit 24a extracts “male” as the type of the
subject A and subject B, and extracts “male” which is the type of the extracted subject and
04-05-2019
13
feature information of each subject image. Supply to the CPU 40.
[0061]
Next, the object position detection unit 24b calculates the distance from the voice identification
device 1 to the subject and the direction of the subject with respect to the voice identification
device 1 based on the image data whose shake is corrected (step S104).
For example, the object position detection unit 24 b calculates the distance from the voice
identification device 1 to the subject of the image data and the direction of the subject with
respect to the voice identification device 1 based on the angle of view of the camera 21 and focus
information to the subject.
[0062]
FIG. 7 is a diagram for explaining the process of calculating the direction of the subject by the
object position detection unit 24b included in the speech recognition apparatus 1 according to
an embodiment of the present invention.
[0063]
As shown in FIG. 7, the subject A and the subject B shown in FIG. 6 are shown on the screen
captured by the camera 21.
Assuming that the angle of view of the camera 21 is ± Φ, the object position detection unit 24b
has the subject A detected by the object detection unit 24a in the + θ3 direction on the xy plane
when the voice recognition device 1 is viewed from above That is, it is determined that the
subject A exists on the straight line 201 in the + θ3 direction.
[0064]
Then, the object position detection unit 24b calculates the distance from the voice recognition
device 1 to the subject based on the image data whose shake has been corrected.
04-05-2019
14
[0065]
FIG. 8 is a diagram for explaining the process of calculating the distance of the subject by the
object position detection unit 24b included in the speech recognition apparatus 1 according to
an embodiment of the present invention.
[0066]
When the subject A or B is within the range of focus of the camera 21, the object position
detection unit 24b calculates the distance from the focus information of the focus.
[0067]
As shown in FIG. 8, when the subject A is within the range of focus, the object position detection
unit 24b calculates the distance d1 between the camera 21 and the subject A from the focus
information of focus.
[0068]
When the subject A or B is out of the range of focus of the camera 21, the object position
detection unit 24 b generates the subject image of the image data based on the image reference
feature information stored in the image reference feature information storage unit 32. The
reference dimension of the subject corresponding to the feature information is extracted, and the
distance from the camera 21 to the subject of the image data is calculated based on the extracted
reference dimension of the subject and the angle of view of the camera 21.
[0069]
Specifically, when the subject B shown in FIG. 8 is out of the focus range, the object position
detection unit 24 b determines the subject specified in step S 103 from the image reference
feature information stored in the image reference feature information storage unit 32. To extract
the standard dimension L2 corresponding to the type of.
[0070]
Then, assuming that the height of the screen shown in FIG. 6 is Hc, the vertical length H2 of the
face of the subject B is θc, and the angle of view is θc, the object position detection unit 24b
uses Equation 1 below. The angle θ2 is calculated.
[0071]
θ2 = θc × H2 / Hc (Equation 1) Next, the object position detection unit 24b uses the following
04-05-2019
15
Equation 2 to calculate the distance d2 from the extracted reference dimension L2 and the
calculated angle θ2. calculate.
[0072]
d 2 = L 2 / tan θ 2 (Equation 2) Thus, the object position detection unit 24 b determines the
distance from the voice identification device 1 to the subject and the direction of the subject
relative to the voice identification device 1 based on the image data whose shake is corrected.
Can be calculated.
[0073]
Next, when audio data is supplied from the first microphone 11a, the second microphone 11b,
and the third microphone 11c (step S105), the audio position detection unit 12 recognizes the
audio detected by the motion sensor 23 The shake is corrected based on the movement of the
device 1 (step S106).
[0074]
Next, the first voice detection unit 14a or the second voice detection unit 14b outputs the feature
information of the voice whose shake is corrected, which is supplied from the first directivity
characteristic adjustment unit 13a or the second directivity characteristic adjustment unit 13b. Is
detected (step S107).
For example, the first voice detection unit 14a extracts a component waveform, a formant, or the
like as voice feature information from the voice data whose shake is corrected, and stores the
extracted component waveform or formant in the voice reference feature information storage
unit 33. The sound source types are ranked based on the stored voice reference feature
information.
Then, the CPU 40 is supplied with the prioritized sound source type and voice feature
information.
[0075]
04-05-2019
16
FIG. 9 is a view for explaining detection processing by the first speech detection unit 14a or the
second speech detection unit 14b included in the speech recognition apparatus 1 according to
the embodiment of the present invention.
(A) shows an example of the waveform of the corrected voice data, (b) shows the power spectrum
generated based on (a), (c) shows the voice reference feature information storage 14 shows an
example of a power spectrum of voice reference feature information stored in the unit 33.
[0076]
Since the first voice detection unit 14a or the second voice detection unit 14b has the same
configuration, the first voice detection unit 14a will be described.
[0077]
As shown in FIG. 9, the first speech detection unit 14a generates the power spectrum 302 shown
in FIG. 9 (b) from the speech waveform 301 shown in FIG. 9 (a).
[0078]
Then, the first speech detection unit 14a determines the degree of coincidence between the
generated power spectrum 302 and the power spectrum 303 of the speech reference feature
information stored in the speech reference feature information storage unit 33 shown in FIG.
Calculation is performed, and ranking is performed based on the calculated degree of
coincidence.
[0079]
Specifically, the first speech detection unit 14a calculates the value for each of the frequency
components (A1 to A7) of the power spectrum shown in FIG. 9B, and the power spectrum shown
in FIG. 9C. The value of each of the frequency components (A1 to A7) is calculated, and the
absolute value of the difference between the calculated values is calculated for each of the
frequency components (A1 to A7).
[0080]
The sum of the absolute values of the differences calculated for each of the frequency
components (A1 to A7) is smaller as the generated power spectrum 302 and the power spectrum
303 of the voice reference feature information stored in the voice reference feature information
storage unit 33 The first voice detection unit 14a prioritizes the types of sound sources by
04-05-2019
17
rearranging the types of sound sources in ascending order of the sum of absolute values of
differences calculated for each frequency component (A1 to A7). .
[0081]
For example, the first voice detection unit 14a calculates an evaluation point that increases as the
sum of absolute values of differences calculated for each of the frequency components (A1 to A7)
decreases, and the types of sound sources are listed in descending order of evaluation points.
Rearrange.
[0082]
As a result, the type of the detected sound source and the evaluation points are "male" (90
evaluation points), "female" (70 evaluation points), "dog" (50 evaluation points) and "car" (20
evaluation points). As such, the types of sound sources are rearranged in descending order of
evaluation points.
[0083]
The first speech detection unit 14a may perform prioritization based on a sound spectrogram
instead of ranking based on the power spectrum.
[0084]
Also in this case, similarly, the first speech detection unit 14a determines the degree of
agreement between the sound spectrogram generated based on the speech data and the sound
spectrogram of the speech reference feature information stored in the speech reference feature
information storage unit 33. Is calculated, and ranking is performed based on the calculated
degree of coincidence.
[0085]
As shown in FIG. 5, next, the voice position detection unit 12 calculates the distance from the
voice identification device 1 to the sound source and the direction of the sound source with
respect to the voice identification device 1 based on the corrected voice data (step S108). ).
[0086]
FIG. 10 is a view for explaining calculation processing of the direction and distance of the sound
source by the audio position detection unit 12 provided in the speech recognition apparatus 1
according to an embodiment of the present invention.
04-05-2019
18
[0087]
As shown in FIG. 10, since the first microphone 11a, the second microphone 11b, and the third
microphone 11c are disposed apart from each other by a predetermined distance, the voices
uttered by the sound source A are different from each other. Delay time to input is different.
[0088]
Specifically, as shown in FIG. 10, assuming that the time from sound generation from the sound
source A to arrival at the first microphone 11a is t0, the sound generation from the sound source
A to the second microphone occurs. The time to reach 11b is (t0 + t1), and the time from when
the sound source A emits a voice to the time it reaches the third microphone 11c is (t0 + t2).
[0089]
Therefore, the voice position detection unit 12 compares the phases of the voices input to the
first microphone 11a, the second microphone 11b, and the third microphone 11c, thereby
delaying the delay time t1 of the voice input to the microphones. , T2 are calculated, and based
on the calculated delay times t1, t2, the distance from the speech recognition device 1 to the
sound source and the direction of the sound source with respect to the speech recognition device
1 are calculated.
[0090]
FIG. 11 shows an example of phase comparison of voice waveforms input to the first microphone
11a, the second microphone 11b, and the third microphone 11c included in the voice
identification device 1 according to the embodiment of the present invention. FIG.
[0091]
As shown in FIG. 11, at time T10, the voice that has been emitted from the sound source A and
reached the first microphone 11a has a peak, so the voice position detection unit 12 detects this
peak time T10. Set as a standard.
Then, the audio position detection unit 12 sets a time from T10 to a time T11 when a similar
peak waveform reaches in the audio waveform that has reached the second microphone 11b as a
delay time t1.
04-05-2019
19
Further, the audio position detection unit 12 sets a time from T10 to a time T12 when a similar
peak waveform reaches in the audio waveform that has reached the third microphone 11c as a
delay time t2.
[0092]
Then, based on the calculated delay times t1 and t2, the voice position detection unit 12
calculates the distance from the voice recognition device 1 to the sound source and the direction
of the sound source with respect to the voice recognition device 1.
Specifically, assuming that the sound velocity is v, the distance from the sound source A to the
first microphone 11a is v · t0, and the distance from the sound source A to the second
microphone 11b is v · (t0 + t1. And the distance from the sound source A to the third microphone
11c is v · (t0 + t2).
Then, the voice position detection unit 12 is a point separated from the first microphone 11a, the
second microphone 11b, and the third microphone 11c by v · t0, v · (t0 + t1), and v · (t0 + t2),
respectively. That is, with the first microphone 11a, the second microphone 11b, and the third
microphone 11c as a center, the radius from the center is set to be v · t0, v · (t0 + t1), and v · (t0 +
t2) as a circle. When drawn, points overlapping with each other are defined as a point of the
sound source A.
[0093]
Thereby, the voice position detection unit 12 can calculate the distance from the voice
identification device 1 to the sound source and the direction of the sound source with respect to
the voice identification device 1 based on the corrected voice data.
[0094]
Note that, for example, when the sound source A and the sound source B simultaneously emit
sound, the sound position detection unit 12 determines the distance from the sound
identification device 1 to the sound source using the technology described in Japanese Patent
04-05-2019
20
Laid-Open No. 2006-227328, for example. The direction of the sound source with respect to the
voice recognition device 1 is calculated.
Specifically, the audio position detection unit 12 determines whether a band division signal
obtained by band division is a signal in which a plurality of sound sources overlap or a signal
consisting of only one sound source, and the sound source The sound source direction is
calculated using only non-overlapping frequency components.
[0095]
Next, the associating unit 40a of the CPU 40 determines the distance from the voice
identification device 1 to the subject calculated in step S104, the direction of the subject with
respect to the voice identification device 1, and the distance from the voice identification device
1 to the sound source calculated in step S108. Whether it is possible to associate the sound
source with the subject based on the distance and the direction of the sound source with respect
to the voice recognition device 1, the type of subject specified in step S103, and the ranking of
the type of sound source determined in step S109 It is determined (step S109).
[0096]
For example, the associating unit 40a determines the distance from the voice identification
device 1 to the subject calculated in step S104 and the predetermined peripheral range of the
position specified by the direction of the subject relative to the voice identification device 1 and
the voice calculated in step S108. There is an overlapping portion in the predetermined
peripheral range of the position specified by the distance from the identification device 1 to the
sound source and the direction of the sound source with respect to the voice identification device
1, and the type of subject specified in step S103 is If the determined evaluation point is included
in the type of sound source having 80 or more points, it is determined that the subject and the
sound source can be associated as the same object.
[0097]
If it is determined in step S109 that the sound source and the subject can be associated with each
other, the associating unit 40a converts the feature information of the subject image detected in
step S103 and the voice identification device 1 calculated in step S104 to the subject. The
distance and the direction of the subject with respect to the voice recognition device 1, the
feature information of the sound source detected in step S107, the distance from the voice
recognition device 1 to the sound source calculated in step S108, and the direction of the sound
source for the voice recognition device 1 Relate (step S110).
04-05-2019
21
[0098]
Next, the feature information determination unit 40b of the CPU 40 determines whether the
feature information of the subject image and the feature information of the sound source
associated in step S110 match the image / voice feature information stored in the image / voice
feature information storage unit 31 It is determined whether or not it is (step S111).
[0099]
If it is determined in step S111 that the feature information does not match the image / voice
feature information (in the case of NO), the storage control unit 40c of the CPU 40 determines
the feature information of the subject image and the feature information of the sound source
associated in step S110. Are stored in the image / sound feature information storage unit 31 as
new image / sound feature information (step S112).
[0100]
Next, the tracking control unit 40d of the CPU 40 divides the image displayed on the display unit
42 into a plurality of blocks based on the image data, and tracks the movement of the subject by
detecting the movement of each block (step S113) ).
[0101]
Specifically, the tracking control unit 40d divides the screen displayed based on the image data
into a plurality of blocks, and the subject moves based on the motion vector for each block
detected by the motion vector detection unit 25. Detect if there is.
The detection of the motion vector may be either a luminance signal or a color signal.
[0102]
Further, even when there is no moving object in the screen, the tracking control unit 40d always
performs image recognition on the entire screen, and estimates a subject based on the contour,
the color, and the like.
04-05-2019
22
The image recognition is performed on the subject based on the feature information, and the
subject is compared with the subject that has been detected.
If the difference between the subject and the feature information of the subject so far is smaller
than a predetermined value, it is determined that the subject is the same.
Thereby, the tracking control unit 40d can track the subject within the screen.
[0103]
Then, according to an instruction of the directivity adjustment control unit 40e of the CPU 40,
the first directivity characteristic adjustment unit 13a or the second directivity characteristic
adjustment unit 13b is configured to transmit the first microphone 11a, the second microphone
11b, and the third microphone 11c. The directional characteristics are adjusted by
superimposing the audio data generated by the first microphone 11a, the second microphone
11b, and the third microphone 11c so as to eliminate the time difference between the voices that
have arrived at (step S114).
The directivity characteristic adjustment process will be described later.
[0104]
Next, the CPU 40 determines whether or not the operation signal for requesting the end of
shooting is supplied from the operation unit 41 (step S115), and it is determined that the
operation signal for requesting the end of shooting is supplied (in the case of YES). , End the
process.
[0105]
FIG. 12 is a flow chart showing a processing flow of directivity characteristic adjustment
processing in the speech recognition apparatus 1 according to an embodiment of the present
invention.
[0106]
04-05-2019
23
As shown in FIG. 12, the directivity adjustment control unit 40 e of the CPU 40 determines
whether or not at least one of the first directivity characteristic adjustment unit 13 a and the
second directivity characteristic adjustment unit 13 b is usable ( Step S201).
Specifically, the CPU 40 determines whether or not there is the first directional characteristic
adjustment unit 13a or the second directional characteristic adjustment unit 13b for which the
directivity adjustment is not performed.
[0107]
In step S201, when it is determined that none of them can be used, that is, it is determined that
both the first directivity characteristic adjustment unit 13a and the second directivity
characteristic adjustment unit 13b are performing directivity adjustment (in the case of NO),
directivity adjustment control The unit 40e extracts the directivity characteristic priority stored
in the directivity characteristic priority storage unit 34 (step S202).
Specifically, from the image / voice feature information storage unit 31, the directivity
adjustment control unit 40e adjusts the type of the subject whose movement is being tracked in
step S113, the first directivity characteristic adjustment unit 13a, and the second directivity
characteristic adjustment. The type of the subject whose directivity characteristics are adjusted
by the unit 13 b is extracted.
Then, the directivity adjustment control unit 40 e extracts, from the directivity characteristic
priority storage unit 34, directivity characteristic priorities corresponding to the types of the
extracted objects.
[0108]
Next, in the directivity adjustment control unit 40e, the directivity characteristic priority of the
subject tracking the movement in step S113 is adjusted by the first directivity characteristic
adjustment unit 13a or the second directivity characteristic adjustment unit 13b. It is determined
whether it is higher than the directivity characteristic priority of the existing subject (step S203).
[0109]
04-05-2019
24
In step S203, the directivity characteristic priority of the object whose movement is being
tracked in step S113 is the directivity characteristic priority of the object whose directivity
characteristics are adjusted by the first directivity characteristics adjustment unit 13a or the
second directivity characteristics adjustment unit 13b. If it is determined that the degree is
higher than 0 degrees (in the case of YES), the first directivity characteristic adjustment unit 13a
or the second directivity characteristic adjustment unit 13b performs directivity adjustment
based on the instruction of the directivity adjustment control unit 40e (step S204) ).
Specifically, based on the tracking result of the tracking control unit 40d, the first directional
characteristic adjusting unit 13a or the second directional characteristic adjusting unit 13b
determines whether the first microphone 11a, the second microphone 11b, or the third
microphone 11b is used. The directional characteristics are adjusted by superimposing the voice
data generated by the first microphone 11a, the second microphone 11b, and the third
microphone 11c so as to eliminate the time difference between the voices reaching the
microphone 11c.
[0110]
As described above, according to the voice identification device 1 of the embodiment of the
present invention, the feature information of the subject image, the feature information of the
voice, the distance and direction of the subject, and the distance and direction of the sound
source The tracking control unit 40d performs tracking of the subject image on the image data,
in a case where the subject and the sound source are associated with each other as the same
object, and the feature information of the associated object matches the image / voice feature
information; The directional characteristic adjusting unit 13a and the second directional
characteristic adjusting unit 13b adjust the directional characteristics of the microphone array
11 based on the tracking result of the tracking control unit 40d and the distance and direction of
the subject or the distance and direction of the sound source. Therefore, even when the sound
source is out of the angle of view of the camera 21 or when the sound source intermittently
generates sound, the sound position detection unit 12 and the object position detection unit 24b
can Without re-calculating the position, it is possible to obtain a good sound attenuates the noise
by adjusting the directional characteristic of the microphone array 11.
[0111]
In the speech recognition apparatus 1 according to an embodiment of the present invention, two
directivity characteristic adjustment units (a first directivity characteristic adjustment unit 13a
and a second directivity characteristic adjustment unit 13b) and two speech detection units (a
04-05-2019
25
first one) However, the present invention is not limited to this configuration, and may be
configured to include many directional characteristic adjustment units and many voice detection
units.
[0112]
It is the block diagram which showed the structure of the speech recognition device which is one
embodiment of the present invention.
It is the figure which showed an example of the image audio | voice characteristic information
memorize | stored in the image audio | voice characteristic information storage part with which
the audio | voice discrimination apparatus which is one Embodiment of this invention is
equipped.
It is the figure which showed an example of the image reference | standard feature information
memorize | stored in the image reference | standard feature information storage part with which
the speech recognition apparatus which is one Embodiment of this invention is equipped.
It is the figure which showed an example of the audio | voice reference | standard characteristic
information memorize | stored in the audio | voice reference | standard characteristic
information storage part with which the speech recognition apparatus which is one Embodiment
of this invention is equipped.
It is the flowchart which showed the processing flow of the speech recognition device which is
one execution form of this invention.
It is a figure explaining the detection processing by the object detection part with which the
speech recognition device which is one embodiment of the present invention is provided.
It is a figure explaining the calculation process of the direction of the to-be-photographed object
by the object position detection part with which the speech recognition system which is one
embodiment of the present invention is provided.
04-05-2019
26
It is a figure explaining the calculation process of the distance of the to-be-photographed object
by the object position detection part with which the speech recognition system which is one
embodiment of the present invention is provided.
It is a figure explaining the detection process by the 1st speech detection part or the 2nd speech
detection part with which the speech recognition device which is one embodiment of the present
invention is provided.
(A) shows an example of the waveform of the corrected voice data, (b) shows the power spectrum
generated based on (a), (c) shows the voice reference feature information storage 14 shows an
example of a power spectrum of voice reference feature information stored in the unit 33.
It is a figure explaining calculation processing of direction and distance of a sound source by
voice position detection with which a voice recognition device which is one embodiment of the
present invention is provided.
It is a figure showing an example of the phase comparison of the sound waveform inputted into
the 1st microphone, the 2nd microphone, and the 3rd microphone with which the voice
recognition device which is one embodiment of the present invention is provided.
It is the flowchart which showed the processing flow of the directivity characteristic adjustment
processing in the speech recognition device which is one embodiment of the present invention.
Explanation of sign
[0113]
DESCRIPTION OF SYMBOLS 1 ... Speech recognition apparatus 11 ... Microphone array 12 ...
Speech position detection part 13a ... 1st directivity characteristic adjustment part 13b ... 2nd
directivity characteristic adjustment part 14a ... 1st speech detection part 14b ... 2nd speech
detection part 21 ... camera 22 ... camera processing unit 23 ... motion sensor 24 ... detection unit
24a ... object detection unit 24b ... object position detection unit 25 ... vector detection unit 31 ...
image voice feature information storage unit 32 ... image reference feature information storage
unit 33 ... voice Reference feature information storage unit 34 Direction characteristic priority
04-05-2019
27
storage unit 40 CPU 40a Association unit 40b Feature information determination unit 40c
Storage control unit 40d Tracking control unit 40e Directional adjustment control unit 41
Operation unit
04-05-2019
28
Документ
Категория
Без категории
Просмотров
0
Размер файла
40 Кб
Теги
jp2010154260
1/--страниц
Пожаловаться на содержимое документа