close

Вход

Забыли?

вход по аккаунту

?

JP2017059951

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2017059951
Abstract: A microphone system capable of shortening processing time while maintaining the
accuracy of emphasizing a target sound. A microphone system (100A) generates a voice signal F
to F having directivity in different directions from each other, and identifies a voice signal having
a relatively high voice strength from the voice signals F to F. A detection unit 30 for detecting a
direction corresponding to the audio signal as a sound source direction, and a microphone signal
pair f, f used for generating the specified audio signal, and an audio signal f having directivity in
the sound source direction , And a setting unit 60 for setting parameters for adjusting the sound
quality at the time of generation of the audio signals F to F and at the time of generation of the
audio signal f. The setting unit 60 sets the above parameters so that the sound quality of the
audio signals F to F is lower than the sound quality of the audio signal f. [Selected figure] Figure
1
Microphone system, speech recognition device, speech processing method, and speech
processing program
[0001]
The present disclosure relates to control of a microphone system, and more particularly to
control of a microphone system provided with a plurality of microphones.
[0002]
In recent years, a target sound emitted by a speaker or the like using a microphone array
provided with a plurality of microphones (hereinafter, also referred to as “target sound”).
03-05-2019
1
Speech processing technology has been developed that can only emphasize). The speech
processing technology is applied to, for example, speech recognition technology for recognizing
the content of a target sound. By emphasizing the target sound, the accuracy of speech
recognition is improved.
[0003]
As a technique for emphasizing the target sound, Japanese Patent Laid-Open No. 2006-332736
(Patent Document 1) discloses a delay-and-sum process. In the delay-and-sum process, the audio
signals from the microphones are delayed according to the difference in the distance between the
sound collection target emitting the target sound and the microphones, and the delayed audio
signals are added. Thereby, the directivity to the target sound direction is enhanced.
[0004]
As another technique for emphasizing the target sound, Japanese Patent Laid-Open No. 2008236077 (patent document 2) discloses a two-input subtraction process. In the 2-input
subtraction process, the target sound is emphasized by removing the noise signal from the target
sound signal.
[0005]
Similarly, Japanese Patent Application Laid-Open No. 2006-197552 (Patent Document 3)
discloses a voice processing technique by two-input subtraction processing.
[0006]
JP, 2006-332736, A JP, 2008-236077, A JP, 2006-197552, A
[0007]
In recent years, speech recognition in real time has been desired.
03-05-2019
2
Therefore, it is important to emphasize the target sound quickly and accurately.
The two-input subtraction process converts an audio signal in the time domain into a signal in
the frequency domain. The transformation is realized by using, for example, Discrete Fourier
Transform (DFT) or Fast Fourier Transform (FFT). The two-input subtraction processing can
reduce noise with high accuracy, but the processing time is problematic because DFT and FFT,
which have high computational costs, are used. Therefore, there is a demand for a voice
processing technology capable of extracting a target sound while achieving both accuracy and
processing time.
[0008]
The present disclosure has been made to solve the problems as described above, and an object in
one aspect is to provide a microphone system capable of shortening the processing time while
maintaining the accuracy of emphasizing the target sound. It is to provide. An object in another
aspect is to provide a speech recognition apparatus capable of shortening processing time while
maintaining the accuracy of emphasizing a target sound. Another object of the present invention
is to provide an audio processing method capable of shortening the processing time while
maintaining the accuracy of emphasizing the target sound. Another object of the present
invention is to provide an audio processing program capable of shortening the processing time
while maintaining the accuracy of emphasizing the target sound.
[0009]
According to an aspect, the microphone system combines the plurality of microphones and the
respective microphone signals output from the plurality of microphones for each predetermined
combination of the plurality of microphones, and has directivity in different directions. A first
generation unit for generating a first sound signal group, and a first sound signal having a
relatively high sound strength among the first sound signal group is identified, and a direction
corresponding to the first sound signal is a sound source direction A second detection unit that
synthesizes a pair of microphone signals used to generate the specified first audio signal, and
generates a second audio signal having directivity in the sound source direction; For adjusting
the sound quality of each of the first audio signal group and the second audio signal at the time
of generating the first audio signal group and at the time of generating each of the second audio
signal And a setting unit for setting a parameter. The setting unit sets the parameter such that
03-05-2019
3
the sound quality of the first sound signal group is lower than the sound quality of the second
sound signal.
[0010]
Preferably, the setting of the parameter includes setting of a frequency band. The setting unit
sets the parameter such that a frequency band of the first audio signal group is narrower than a
frequency band of the second audio signal.
[0011]
Preferably, setting of the parameters includes setting of frequency resolution. The setting unit
sets the parameter such that the frequency resolution of the first audio signal group is lower than
the frequency resolution of the second audio signal.
[0012]
Preferably, setting of the parameters includes setting of an angular range having directivity. The
setting unit sets the parameter such that the angular range with respect to the first audio signal
group is narrower than the angular range with respect to the second audio signal.
[0013]
Preferably, the microphone system further includes a delay unit that delays each of the
microphone signals output from the plurality of microphones. The first generation unit
synthesizes the microphone signal pair in a state not delayed by the delay unit for each of the
microphone signal pairs output for each of the predetermined combinations, and for each of the
predetermined combinations. A delay synthesis signal is generated as the first voice signal group,
and each of the microphone signal pairs output for each of the predetermined combinations is
delayed by one of the delay units. A pair is synthesized, and a delayed synthesized signal is
generated as the first audio signal group for each of the predetermined combinations.
03-05-2019
4
[0014]
Preferably, the detection unit identifies a delayed combined signal having a relatively high voice
strength from the non-delayed combined signal of the first voice signal group, and the identified
non-delayed combined signal, A voice signal having a relatively high voice strength is specified
from among the delayed composite signals generated from the set of microphones used for
generating the delayed composite signal, and a direction corresponding to the voice signal is
detected as the sound source direction Do.
[0015]
Preferably, the second generation unit does not generate the second audio signal when there is
no first audio signal exceeding a predetermined audio intensity in the first audio signal group. A
microphone signal pair output from a set of microphones provided to collect sound from the
front direction of the microphone system is synthesized, and a third audio signal having
directivity in the front direction is generated.
[0016]
Preferably, the second generation unit generates the third audio signal after expanding the sound
collection range compared to when the second audio signal is generated.
[0017]
Preferably, each of the plurality of microphones is arranged concentrically.
Preferably, the detection unit does not perform re-detection of the sound source direction until a
predetermined time has elapsed since the generation of the second audio signal.
[0018]
According to another aspect, a speech recognition apparatus is provided.
The speech recognition apparatus includes the microphone system, and performs speech
recognition on the second speech signal output from the microphone system.
03-05-2019
5
[0019]
According to still another aspect, the voice processing method combines the microphone signals
output from the plurality of microphones for each predetermined combination of the plurality of
microphones, and generates a first voice having directivity in different directions. Generating a
signal group, identifying a first audio signal having a relatively high audio intensity from the first
audio signal group, and detecting a direction corresponding to the first audio signal as a sound
source direction; Combining the microphone signal pair used to generate the specified first audio
signal to generate a second audio signal having directivity in the sound source direction;
generating the first audio signal group; and Step of setting parameters for adjusting the sound
quality of each of the first audio signal group and the second audio signal at the time of
generating each of the second audio signals It includes a.
The setting may include setting the parameter such that the sound quality of the first sound
signal group is lower than the sound quality of the second sound signal.
[0020]
According to still another aspect, the speech processing program causes the computer to
synthesize the respective microphone signals output from the plurality of microphones for each
predetermined combination of the plurality of microphones, and to have directivity in directions
different from one another. A step of generating a first sound signal group, and a first sound
signal having a relatively high sound strength is specified from the first sound signal group, and
a direction corresponding to the first sound signal is detected as a sound source direction. And
combining the microphone signal pair used to generate the specified first audio signal to
generate a second audio signal having directivity in the sound source direction; generating the
first audio signal group Parameter for adjusting the sound quality of each of the first audio signal
group and the second audio signal at the time of generation and at the time of generating each of
the second audio signals. And a step of setting the over data. The setting may include setting the
parameter such that the sound quality of the first sound signal group is lower than the sound
quality of the second sound signal.
[0021]
In one aspect, processing time can be shortened while maintaining the accuracy of emphasizing
the target sound.
03-05-2019
6
[0022]
The above and other objects, features, aspects and advantages of the present invention will
become apparent from the following detailed description of the present invention taken in
conjunction with the accompanying drawings.
[0023]
It is a figure showing an example of the composition of the microphone system according to a 1st
embodiment.
It is a conceptual diagram which shows 2 input subtraction processing.
It is a conceptual diagram which shows the difference process of the audio | voice signal of a
frequency domain. It is a figure which shows the example by which the directivity characteristic
is formed in a mutually different direction by the microphone system according to 1st
Embodiment. It is a figure which represented the directivity characteristic shown by FIG. 4 still
more concretely. It is a flowchart showing a part of process which the microphone system
according to 1st Embodiment performs. It is a block diagram showing the main hardware
constitutions of the microphone system according to a 1st embodiment. It is a figure showing an
example of composition of a microphone system according to a 2nd embodiment. It is a figure
which shows the example in which the directional characteristic is formed in a mutually different
direction by the microphone system according to 2nd Embodiment. It is a flowchart showing a
part of process which the microphone system according to a 2nd embodiment performs. It is a
figure which shows the directional characteristic formed by the microphone system according to
3rd Embodiment. It is a flowchart showing a part of process which the microphone system
according to 3rd Embodiment performs. It is a figure which shows the example of arrangement |
positioning of the microphone in 4th Embodiment. It is a flowchart showing a part of process
which the microphone system according to a 5th embodiment performs. It is a flowchart showing
a part of process which the microphone system according to 6th Embodiment performs.
[0024]
Hereinafter, the present embodiment will be described with reference to the drawings. In the
03-05-2019
7
following description, the same parts and components are denoted by the same reference
numerals. Their names and functions are also the same. Therefore, detailed description of these
will not be repeated. The embodiments described below may be selectively combined as
appropriate.
[0025]
First Embodiment [Voice Processing by Microphone System 100A] Voice processing by the
microphone system 100A according to the first embodiment will be described with reference to
FIGS. FIG. 1 is a diagram showing an example of the configuration of a microphone system 100A.
FIG. 2 is a conceptual diagram showing the 2-input subtraction process. FIG. 3 is a conceptual
diagram showing differential processing of audio signals in the frequency domain. FIG. 4 is a
diagram showing an example in which directivity characteristics are formed in directions
different from one another by the microphone system 100A. FIG. 5 is a diagram more specifically
showing the directivity characteristic shown in FIG.
[0026]
The microphone system 100A is mounted on, for example, a smartphone, a tablet, a ticket
vending machine, an information display, a personal computer, a digital camera, an electronic
dictionary, a robot, and other voice recognition devices.
[0027]
As shown in FIG. 1, the microphone system 100A includes microphones 11 to 14 and a CPU
(Central Processing Unit) 102 as a hardware configuration.
[0028]
The microphones 11 to 14 are configured as a microphone array.
The microphones 11 to 14 also refer to voices received from the surroundings as voice signals
(hereinafter, also referred to as “microphone signals”).
Convert to). The microphone signal is output to the CPU 102. In the example of FIG. 1, the
03-05-2019
8
microphone 11 outputs a microphone signal f 1. The microphone 12 outputs a microphone
signal f 2. The microphone 13 outputs a microphone signal f 3. The microphone 14 outputs a
microphone signal f 4. The microphone signals f 1 to f 4 are shown as time series signals.
[0029]
The CPU 102 includes, as functional components, a generation unit 20, a detection unit 30, a
selection unit 40, a generation unit 50, and a setting unit 60.
[0030]
The generation unit 20 combines the microphone signals output from the microphones 11 to 14
for each predetermined combination of the microphones 11 to 14 and generates audio signals F
1 to F 3 having directivity in directions different from one another. Do.
The combination of the microphones 11 to 14 is optional. In the example of FIG. 1, the
microphones 11 and 12 are combined to form a pair. The microphones 12 and 13 are combined
to form a pair. The microphones 13 and 14 are combined to form a pair.
[0031]
Generation unit 20 includes synthesis units 21A to 21C. The combining units 21A to 21C are
provided by the number of combinations of the microphones 11 to 14.
[0032]
The combining unit 21A combines the microphone signal f 1 from the microphone 11 and the
microphone signal f 2 from the microphone 12 to generate an audio signal F 1. The synthesis
unit 21A synthesizes the microphone signals f 1 and f 2 by, for example, 2-input subtraction
processing. As a result, an audio signal F 1 having directivity in a specific direction is generated.
[0033]
03-05-2019
9
FIG. 2 shows an example of a functional configuration for realizing the 2-input subtraction
process. As shown in FIG. 2, the combining unit 21A includes an adding unit 24, an FFT unit 25,
a subtracting unit 26, an FFT unit 27, and a subtracting unit 28.
[0034]
The adder 24 adds the microphone signal f 1 from the microphone 11 and the microphone signal
f 2 from the microphone 12. Voices emitted from positions equidistant from the microphones
reach the microphones at the same timing. Therefore, the addition unit 24 can generate the
audio signal f a having directivity in the direction of the perpendicular bisector of the
microphones 11 and 12 by adding the microphone signals f 1 and f 2. The audio signal f a is
output to the FFT unit 25.
[0035]
The FFT unit 25 performs FFT processing on the audio signal f a. Thereby, the voice signal f a in
the time domain is converted into the voice signal F a in the frequency domain. The audio signal
F a is output to the subtraction unit 28.
[0036]
The subtraction unit 26 subtracts the other from either one of the microphone signal f 1 from
the microphone 11 and the microphone signal f 2 from the microphone 12. As a result, the
component in the direction of the perpendicular bisector of the microphones 11 and 12 is
attenuated, and an audio signal f s having directivity in the arrangement direction of the
microphones 11 and 12 is generated. The audio signal f s is output to the FFT unit 27.
[0037]
The FFT unit 27 performs FFT processing on the audio signal f s. Thereby, the voice signal f s in
the time domain is converted to the voice signal F s in the frequency domain. The audio signal F s
is output to the subtraction unit 28.
03-05-2019
10
[0038]
The subtraction unit 28 subtracts the audio signal F s from the audio signal F a. Since the voice
signal F a having directivity in the direction of the perpendicular bisector of the microphones 11
and 12 has directivity in the arrangement direction of the microphones 11 and 12, the voice
signal F s is further subtracted, so that the vertical bisector is further divided. Speech in the line
direction is emphasized.
[0039]
More specifically, as shown in FIG. 3, the audio signals F a and F s are shown as an amplitude
spectrum for each frequency band. The subtraction unit 28 performs subtraction processing for
each of the amplitude spectra belonging to the same frequency band of the audio signals F a and
F s. Thus, an audio signal F 1 is generated. The audio signal F 1 is output to the detection unit 30.
[0040]
Similar to the combining unit 21A, the combining unit 21B combines the microphone signal f 2
from the microphone 12 and the microphone signal f 3 from the microphone 13 by 2-input
subtraction processing. As a result, an audio signal F 2 having directivity in the direction of the
perpendicular bisector of the microphones 12 and 13 is generated. The audio signal F 2 is output
to the detection unit 30.
[0041]
Similar to the combining unit 21A, the combining unit 21C combines the microphone signal f 3
from the microphone 13 and the microphone signal f 4 from the microphone 14 by 2-input
subtraction processing. Thereby, an audio signal F 3 having directivity in the direction of the
perpendicular bisector of the microphones 13 and 14 is generated. The audio signal F 3 is output
to the detection unit 30.
03-05-2019
11
[0042]
The synthesizing units 21A to 21C generate audio signals F 1 to F 3 having directivity in
directions different from one another. The example (A) of FIG. 4 shows a directivity characteristic
301 corresponding to the audio signal F 1, a directivity characteristic 302 corresponding to the
audio signal F 2, and a directivity characteristic 303 corresponding to the audio signal F 3. . The
directional characteristic refers to the sensitivity of the sound of the microphone in each
direction when the distance between the microphone and the sound source is constant and the
magnitude of the sound emitted from the sound source is constant. The directivity characteristic
301 indicates that the microphones 11 and 12 have directivity in the direction of the
perpendicular bisector. The directivity characteristic 302 indicates that the microphones 12 and
13 have directivity in the direction of the perpendicular bisector. The directivity characteristic
303 indicates that the microphones 13 and 14 have directivity in the direction of the
perpendicular bisector.
[0043]
In the example (B) of FIG. 4, directivity characteristics 311 to 313 formed by the delay-and-sum
process are shown. As shown in the examples (A) and (B) of FIG. 4, the two-input subtraction
processing can form narrow directional directivity characteristics as compared with the delayand-sum processing. Therefore, preferably, the microphone system 100A generates the audio
signals F 1 to F 3 by two-input subtraction processing.
[0044]
The directivity characteristics 301 to 303 of FIG. 4 are shown in more detail in FIG. The vertical
bisector direction of the microphones 12 and 13 is taken as a reference (that is, 0 degree), the
left rotation direction is made negative from the reference, and the right rotation direction is
made positive from the reference. As shown in the graph (A), the directivity characteristic 301
indicates that the sensitivity of the sound received from the -30 degree direction is better than
other angles. As shown in the graph (B), the directivity characteristic 302 indicates that the
sensitivity of the sound received from the 0 degree direction is better than other angles. As
shown in the graph (C), the directivity characteristic 303 indicates that the sensitivity of the
sound received from the +30 degree direction is better than other angles.
03-05-2019
12
[0045]
The detection unit 30 is also referred to as an audio signal having a relatively high audio
intensity among audio signals F 1 to F 3 having directivity in directions different from one
another (hereinafter, also referred to as “target audio signal”). To identify the direction
corresponding to the target audio signal as the sound source direction. The speech intensity is,
for example, an indicator that represents the speech intensity. As an example, the voice intensity
is an amplitude, an amplitude spectrum, or the like. In one aspect, the detection unit 30 detects
an audio signal having the largest amplitude spectrum from among the audio signals F 1 to F 3
as a target audio signal. In another aspect, the detection unit 30 calculates an average value of
the amplitude spectrum for each of the audio signals F 1 to F 3, and detects an audio signal with
the largest calculated average value as a target audio signal.
[0046]
The detection unit 30 outputs the detected sound source direction to the selection unit 40 and
the setting unit 60, and outputs the sound source direction as one of the audio processing
results.
[0047]
The selection unit 40 selects a microphone signal pair used to generate a target audio signal
from the microphone signals f 1 to f 4.
Hereinafter, the microphone signal pair selected by the selection unit 40 is also referred to as a
microphone signal pair f A and f B. For example, when the audio signal F 1 is detected as the
target audio signal, the microphone signals f 2 and f 3 that are the sources of the audio signal F 2
are selected as the microphone signal pair f A and f B.
[0048]
The generation unit 50 synthesizes the microphone signal pair f A and f B used when generating
the target audio signal, and generates an audio signal f having directivity in the sound source
direction as an output result. The method of combining the microphone signal pair f A and f B by
03-05-2019
13
the generation unit 50 is the same as that of the generation unit 20. That is, after converting the
microphone signal pair f A and f B in the time domain into the frequency domain, the generation
unit 50 performs addition processing and subtraction processing shown in FIG. 2 to combine the
microphone signal pair f A and f B. Do. The generation unit 50 generates an audio signal f by
performing inverse fast Fourier transform (inverse FFT) on the signal after synthesis. The
generation unit 50 outputs an audio signal f as one of the audio processing results.
[0049]
The setting unit 60 generates the sound quality of the audio signals F 1 to F 3 and the audio
signal f at the time of generating the audio signals F 1 to F 3 for detecting the sound source
direction and at the time of generating the audio signal f for output. Parameters for adjustment
(hereinafter also referred to as “sound quality parameters”. Set).
[0050]
The detection accuracy of the sound source direction is not so dependent on the sound quality of
the audio signals F 1 to F 3. On the other hand, since the voice signal f as an output result is also
used for voice recognition and the like, the sound quality is important. Focusing on this point, the
setting unit 60 sets the sound quality parameters so that the sound quality of the audio signals F
1 to F 3 is lower than the sound quality of the audio signal f.
[0051]
Thereby, the process of generating the audio signals F 1 to F 3 is simplified as compared with the
process of generating the audio signal f. The generation unit 20 for generating the audio signals
F 1 to F 3 performs the synthesis process for the combination of the microphones 11 to 14, so
that the synthesis process by the generation unit 20 is simplified and the processing time is
significantly reduced. Ru.
[0052]
As an example, the setting of the sound quality parameter by the setting unit 60 includes the
03-05-2019
14
setting of the frequency band of the audio signal. The setting unit 60 sets the sound quality so
that the frequency band of the audio signals F 1 to F 3 (see FIG. 1) for detecting the sound source
direction is narrower than the frequency band of the audio signal f (see FIG. 1) finally output. Set
the parameters. By setting the frequency band, the accuracy in the FFT processing in the 2-input
subtraction processing changes. By setting the frequency band widely, the accuracy of the FFT
processing becomes low, and by setting the frequency band short, the accuracy of the FFT
processing becomes high.
[0053]
When the speech signal f is used for speech recognition or is converted to clear speech so that a
person can hear it, a frequency band of 8 kHz or more is necessary. The sampling frequency at
this time needs to be set to 16 kHz (= 8 kHz × 2) or more in consideration of the sampling
theorem. Therefore, in the two-input subtraction process at the time of generation of the audio
signal f for output, the setting unit 60 sets the frequency band to 16 kHz or more as the sound
quality parameter.
[0054]
On the other hand, in order to detect the direction of the person emitting the voice, the frequency
of the voice signal may be about 500 Hz. The sampling frequency at this time needs to be set to
1 kHz (= 500 Hz × 2) or more in consideration of the sampling theorem. Therefore, the setting
unit 60 sets the frequency band to 1 kHz or more as the sound quality parameter in the twoinput subtraction process at the time of generating the sound signals F 1 to F 3 for detecting the
sound source direction.
[0055]
The frequency band is set to about 16 kHz when generating the audio signal f for output, and the
frequency band is set to about 1 kHz when generating the audio signals F 1 to F 3 for detecting
the sound source direction. At the time of generation of 3, the sampling cycle is set to 1/16 as
compared with the time of generation of the audio signal f. As a result, the time taken to generate
the sound signals F 1 to F 3 for detecting the sound source direction is significantly reduced.
03-05-2019
15
[0056]
Besides, the setting of the sound quality parameter includes the setting of the frequency
resolution. The setting unit 60 sets the sound quality parameter such that the frequency
resolution of the audio signals F 1 to F 3 for detecting the sound source direction is lower than
the frequency resolution of the audio signal f to be finally output. By setting the frequency
resolution, the accuracy in the FFT processing in the 2-input subtraction processing changes. By
setting the frequency resolution low, the accuracy of the FFT processing becomes low, and by
setting the frequency band high, the accuracy of the FFT processing becomes high.
[0057]
As frequency resolution for voice output, about 15 Hz is required by the applicants' actual
measurement. If the frequency resolution is lower than 15 Hz, false recognition of speech
recognition will increase. On the other hand, in the detection process of the sound source
direction, it is not necessary to set the frequency resolution high because it is only necessary to
accurately detect the sound intensity.
[0058]
As an example, in order to obtain a frequency resolution of 15 Hz with a sampling period of 16
kHz, about 1024 pieces of data (≒ 16 k ÷ 15) are required for audio output. On the other hand,
in order to obtain a frequency resolution of 32 Hz with a sampling period of 1 kHz for detection
of the sound source direction, 32 pieces of data (≒ 1 k 32) are required. At the time of
generation of the audio signals F 1 to F 3, the number of data is reduced to 1/32 (= 32/1024) as
compared with the generation of the audio signal f. As a result, the time taken to generate the
audio signals F 1 to F 3 is significantly reduced.
[0059]
In addition, the setting of the sound quality parameter is also referred to as an angular range
having directivity (hereinafter, also referred to as a “sound collection range”. Including
settings). The setting unit 60 sets parameters so that the sound collection range for the audio
signals F 1 to F 3 for detecting the sound source direction is narrower than the sound collection
03-05-2019
16
range for the audio signal f for output. The setting unit 60 can accurately detect the sound
source direction by narrowing the sound collection range when detecting the sound source
direction. On the other hand, the setting unit 60 can detect the sound from the sound source
without omission by widening the sound collection range when generating the audio signal f for
output.
[0060]
[Control Structure of Microphone System 100A] The control structure of the microphone system
100A will be described with reference to FIG. FIG. 6 is a flowchart showing a part of the process
performed by the microphone system 100A. The processing of FIG. 6 is realized by the CPU 102
(see FIG. 1) executing a program. In other aspects, some or all of the processing may be
performed by circuit elements or other hardware.
[0061]
In step S10, the CPU 102 collects voice data at preset sampling intervals from voice signals
sequentially output by the microphone, and determines whether the number of voice data
exceeds a predetermined number. If the CPU 102 determines that the number of voice data
exceeds a predetermined number (YES in step S10), the control is switched to step S12. If not
(NO in step S10), CPU 102 executes the process of step S10 again.
[0062]
In step S12, the CPU 102 sets a sound quality parameter for sound source direction detection as
the setting unit 60 (see FIG. 1). The sound quality parameter for sound source direction detection
is stored in advance as a low sound quality parameter 121 in the storage device 120 (see FIG. 7)
of the microphone system 100A. As an example, the low sound quality parameter 121 defines
the frequency band and frequency resolution of the generated audio signal. The low sound
quality parameter 121 may be preset at the time of design or may be preset by the user.
[0063]
03-05-2019
17
In step S14, the CPU 102, as the generation unit 20 (see FIG. 1), synthesizes audio signals output
from the respective microphones for each predetermined combination of the plurality of
microphones, and generates audio having directivity in different directions. The signals F 1 to F 3
(see FIG. 1) are generated. Since the audio signals F 1 to F 3 are generated by the two-input
subtraction process in which the low sound quality parameter 121 is reflected, the sound quality
is low.
[0064]
In step S16, the CPU 102 specifies, as the detection unit 30 (see FIG. 1), an audio signal having a
relatively high audio intensity among the audio signals F 1 to F 3 as the target audio signal, and
corresponds to the target audio signal. The direction is detected as the sound source direction.
[0065]
In step S20, the CPU 102 determines whether the audio intensity of the target audio signal
exceeds a predetermined audio intensity.
As an example, the voice strength is represented by the amplitude or amplitude spectrum of the
target voice signal. If the CPU 102 determines that the audio intensity of the target audio signal
exceeds the predetermined audio intensity (YES in step S20), the control is switched to step S22.
If not (NO in step S20), CPU 102 returns the control to step S10. In this case, the CPU 102
determines that the speaker does not emit a voice.
[0066]
In step S22, the CPU 102 generates the target audio signal identified in step S16 from among the
microphone signals f 1 to f 4 (see FIG. 1) output from the microphones as the selection unit 40
(see FIG. 1). Select the source microphone signal pair.
[0067]
In step S24, the CPU 102 sets, as the setting unit 60, a sound quality parameter for generating an
audio signal for output.
03-05-2019
18
The sound quality parameters are stored in advance as high sound quality parameters 122 in the
storage device 120 (see FIG. 7) of the microphone system 100A. As one example, the high sound
quality parameter 122 defines the frequency band and frequency resolution of the generated
audio signal. The high sound quality parameter 122 may be preset at the time of design or may
be preset by the user.
[0068]
In step S26, the CPU 102 combines the microphone signal pair output from the set of
microphones selected in step S22 as the generation unit 50 (see FIG. 1), and generates an audio
signal f having directivity in the sound source direction (FIG. 1). Generate reference). Since the
audio signal f is generated by two-input subtraction processing in which the high-quality sound
parameter 122 is reflected, high-quality sound is obtained.
[0069]
[Hardware Configuration of Microphone System 100A] An example of the hardware
configuration of the microphone system 100A according to the first embodiment will be
described with reference to FIG. FIG. 7 is a block diagram showing the main hardware
configuration of the microphone system 100A. As shown in FIG. 7, the microphone system 100A
includes the microphones 11 to 14, a ROM (Read Only Memory) 101, a CPU 102, a RAM
(Random Access Memory) 103, a network interface 104, and a storage device 120. Including.
[0070]
The microphones 11 to 14 receive surrounding sound and convert the sound into a microphone
signal. The number of microphones provided in the microphone system 100A is arbitrary as long
as it is three or more.
[0071]
The ROM 101 stores an operating system, a control program executed by the microphone system
100A, and the like. The CPU 102 controls the operation of the microphone system 100A by
03-05-2019
19
executing various programs such as an operating system and a control program of the
microphone system 100A. The RAM 103 functions as a working memory, and temporarily stores
various data necessary for program execution.
[0072]
The network interface 104 transmits and receives data to and from other communication devices
via an antenna (not shown). Other communication devices include, for example, servers, devices
having other communication functions, and the like. Microphone system 100A may be
configured to be able to download voice processing program 123 for realizing various types of
processing according to the present embodiment via an antenna.
[0073]
The storage device 120 holds a low sound quality parameter 121, a high sound quality
parameter 122, and an audio processing program 123 for realizing various processes according
to the present embodiment. The low sound quality parameter 121 is referred to when generating
an audio signal for sound source direction detection. The high sound quality parameter 122 is
referred to when generating an audio signal for output.
[0074]
The voice processing program 123 may be provided as part of an arbitrary program rather than
as a single program. In this case, the process according to the present embodiment is realized in
cooperation with an arbitrary program. Even a program that does not include such a part of
modules does not deviate from the spirit of the microphone system 100A according to the
present embodiment. Furthermore, some or all of the functions provided by the speech
processing program 123 according to the present embodiment may be realized by dedicated
hardware. Furthermore, the microphone system 100A and the server may cooperate to realize
the process according to the present embodiment. Furthermore, microphone system 100A may
be configured in the form of a so-called cloud service in which at least one server implements the
process according to the present embodiment.
[0075]
03-05-2019
20
[Summary] As described above, the microphone system 100A according to the present
embodiment generates the audio signals F 1 to F 3 having directivity in directions different from
one another, and the audio signal F 1 to F 3 among the audio signals F 1 to F 3. The direction
corresponding to the voice signal with the highest intensity is detected as the sound source
direction. The microphone system 100A generates an audio signal f for output using a
microphone whose directivity is directed to the specified sound source direction.
[0076]
The microphone system 100A sets the low sound quality parameter 121 when generating the
sound signals F 1 to F 3 for detecting the sound source direction, and sets the high sound quality
parameter 122 when generating the sound signal f for output. Thereby, the process of generating
the audio signals F 1 to F 3 is simplified as compared with the process of generating the audio
signal f. As a result, the processing time required to generate the audio signals F 1 to F 3 is
shortened, while the sound quality of the output audio signal f is higher than the sound quality of
the audio signals F 1 to F 3 for source direction detection. good. As a result, when the speech
signal f is used for speech recognition, the accuracy of speech recognition is improved. When the
voice signal f is converted to voice, clear voice can be generated.
[0077]
Second Embodiment [Overview] The microphone system 100A according to the first embodiment
combines the microphone signal pairs output for each predetermined combination of the
microphones 11 to 14 (see FIG. 1) as it is. It was On the other hand, the microphone system
100B according to the second embodiment may combine the microphone signal pair as it is or
may delay one of the microphone signal pairs and then combine the microphone signal pair.
Thereby, a plurality of audio signals having directivity in different directions can be generated
from one microphone signal pair. As a result, the microphone system 100B can increase the
direction of directivity, and can detect the sound source direction more accurately.
[0078]
[Audio Processing by Microphone System 100B] The audio processing by the microphone system
03-05-2019
21
100B according to the second embodiment will be described with reference to FIGS. 8 and 9. FIG.
8 is a diagram showing an example of the configuration of the microphone system 100B. FIG. 9
is a diagram showing an example in which directivity characteristics are formed in directions
different from each other by the microphone system 100B.
[0079]
The microphone system 100B includes the microphones 11 to 14 and the CPU 102 as a
hardware configuration. The CPU 102 includes, as functional components, delay units 15A to
15D, generation units 20, a detection unit 30, a selection unit 40, a generation unit 50, and a
setting unit 60. Generation unit 20 includes synthesis units 22A to 22I. The functional
configuration other than the delay units 15A to 15D and the combining units 22A to 22I is as
described in FIG. 1, and therefore the description of those functional configurations will not be
repeated.
[0080]
The delay unit 15A delays the microphone signal f 1 output from the microphone 11, and
outputs the delayed microphone signal f ′ 1 to the combining unit 22A and the selection unit
40. The delay unit 15B delays the microphone signal f 2 output from the microphone 12 and
outputs the delayed microphone signal f ′ 2 to the combining units 22C and 22D and the
selection unit 40. The delay unit 15C delays the microphone signal f 3 output from the
microphone 13 and outputs the delayed microphone signal f ′ 3 to the combining units 22F and
22G and the selection unit 40. The delay unit 15D delays the microphone signal f 4 output from
the microphone 14 and outputs the delayed microphone signal f ′ 4 to the combining unit 22I
and the selection unit 40.
[0081]
The generation unit 20 synthesizes each microphone signal pair in a state in which the delay
units 15A to 15D do not delay each of the microphone signal pairs output for each
predetermined combination of the microphones 11 to 14. Hereinafter, a voice signal obtained by
combining the microphone signal pair in a state not delayed by the delay units 15A to 15D is
also referred to as a "non-delayed combined signal". The method of synthesizing the non-delayed
synthesized signal is the same as the method shown in FIG.
03-05-2019
22
[0082]
In addition, for each of the microphone signal pairs output for each predetermined combination
of the microphones 11 to 14, the generation unit 20 combines the microphone signal pairs in a
state where one of the microphone signal pairs is delayed by the delay units 15A to 15D. Do.
Hereinafter, a voice signal obtained by combining microphone signal pairs in a state of being
delayed by the delay units 15A to 15D is also referred to as a "delayed combined signal". The
method of synthesizing the delayed synthesized signal is the same as the method shown in FIG.
[0083]
More specifically, the combining unit 22A combines the delayed microphone signal f ′ 1 and
the undelayed microphone signal f 2 to generate a delayed combined signal F 11. As a result, a
directivity characteristic 301A having directivity in the direction inclined from the vertical
bisector direction of the microphones 11 and 12 toward the microphone 11 is formed.
[0084]
The combining unit 22B combines the undelayed microphone signal f 1 and the undelayed
microphone signal f 2 to generate a non-delayed combined signal F 12. Thereby, the directivity
characteristic 301 having directivity in the direction of the perpendicular bisector of the
microphones 11 and 12 is formed.
[0085]
The combining unit 22C combines the undelayed microphone signal f 1 and the delayed
microphone signal f ′ 2 to generate a delayed combined signal F 13. As a result, a directivity
characteristic 301B having directivity in the direction inclined to the microphone 12 side from
the direction of the perpendicular bisector of the microphones 11 and 12 is formed.
[0086]
03-05-2019
23
The combining unit 22D combines the delayed microphone signal f ′ 2 and the undelayed
microphone signal f 3 to generate a delayed combined signal F 14. As a result, a directivity
characteristic 302A having directivity in the direction inclined toward the microphone 12 side
from the direction of the perpendicular bisector of the microphones 12 and 13 is formed.
[0087]
The combining unit 22E combines the undelayed microphone signal f 2 and the undelayed
microphone signal f 3 to generate a non-delayed combined signal F 15. Thereby, a directivity
characteristic 301 having directivity in the direction of the perpendicular bisector of the
microphones 12 and 13 is formed.
[0088]
The combining unit 22F combines the undelayed microphone signal f 2 and the delayed
microphone signal f ′ 3 to generate a delayed combined signal F 16. As a result, a directivity
characteristic 302B having directivity in the direction inclined toward the microphone 13 side
from the direction of the perpendicular bisector of the microphones 12 and 13 is formed.
[0089]
The combining unit 22G combines the delayed microphone signal f ′ 3 and the undelayed
microphone signal f 4 to generate a delayed combined signal F 17. As a result, a directivity
characteristic 303A having directivity in the direction inclined to the microphone 13 side from
the direction of the perpendicular bisector of the microphones 13 and 14 is formed.
[0090]
The combining unit 22H combines the undelayed microphone signal f 3 and the undelayed
microphone signal f 4 to generate a non-delayed combined signal F 18. Thereby, a directivity
characteristic 301 having directivity in the direction of the perpendicular bisector of the
03-05-2019
24
microphones 13 and 14 is formed.
[0091]
The combining unit 22I combines the undelayed microphone signal f 3 and the delayed
microphone signal f ′ 4 to generate a delayed combined signal F 19. As a result, a directivity
characteristic 303B having directivity in the direction inclined toward the microphone 14 side
from the direction of the perpendicular bisector of the microphones 13 and 14 is formed.
[0092]
The detection unit 30 specifies a non-delayed voice signal having a relatively high voice strength
from among the non-delayed synthesized signals. Thereafter, the detection unit 30 detects an
audio signal having a relatively high voice strength from among the identified non-delayed
synthesized signal and the delayed synthesized signal generated from the set of microphones
used to generate the non-delayed synthesized signal. Is identified as the target audio signal. The
detection unit 30 detects a direction corresponding to the target audio signal as a sound source
direction.
[0093]
That is, the detection unit 30 specifies an audio signal having a relatively high audio intensity in
two steps. For example, in the first step, it is assumed that a non-delayed combined signal F 15 is
specified from among the non-delayed combined signals F 12, F 15, and F 18 as a signal having a
relatively high voice intensity. The non-delayed synthesized signal F 15 is generated from the
microphone signals f 2 and f 3. For this reason, in the second stage, the detection unit 30 has a
relative voice strength among the delayed combined signals F 15 and the delayed combined
signals F 14 and F 16 obtained by combining the microphone signals f 2 and f 3. Identify high
audio signals. For example, it is assumed that the delayed composite signal F 16 is identified. The
detection unit 30 specifies the directivity direction of the directivity characteristic 302B
corresponding to the delay composite signal F 16 as a sound source direction.
[0094]
As described above, since the voice signal having a relatively high voice strength is specified in
03-05-2019
25
two steps, it is not necessary to generate all of the delay synthesis signals, so that the processing
time is shortened.
[0095]
Note that the detection unit 30 does not necessarily have to specify an audio signal having a
relatively high audio intensity in two steps.
For example, the detection unit 30 may specify an audio signal having a relatively high audio
strength from all the non-delayed audio signals and the delayed audio signals.
[0096]
[Control Structure of Microphone System 100B] The control structure of the microphone system
100B will be described with reference to FIG. FIG. 10 is a flowchart showing a part of the process
performed by the microphone system 100B. The process of FIG. 10 is realized by the CPU 102
executing a program. In other aspects, some or all of the processing may be performed by circuit
elements or other hardware.
[0097]
In addition, since the process of step S10-S26 is as having demonstrated in FIG. 6, description is
not repeated about those description.
[0098]
In step S30, CPU 102 determines in advance two audio signals whose directivity directions are
adjacent to each other among audio signals having directivity in different directions (that is,
audio signals F 1 to F 3 (see FIG. 1)). It is determined whether the signal strength is exceeded.
The predetermined signal strength in step S30 is lower than the predetermined signal strength in
step S20. If the CPU 102 determines that two audio signals whose directivity directions are
adjacent to each other exceed the predetermined signal strength (YES in step S30), the control is
03-05-2019
26
switched to step S32. If not (NO in step S30), CPU 102 returns the control to step S10.
[0099]
In step S32, the CPU 102, as the setting unit 60 (see FIG. 8), sets the low sound quality
parameter 121 (see FIG. 7) for detecting the sound source direction. As an example, the low
sound quality parameter 121 includes the delay time of the microphone signal output from the
microphone, and the like.
[0100]
In step S34, the CPU 102 delays the microphone signals output from the respective microphones
according to the delay time indicated by the low sound quality parameter 121 as the delay units
15A to 15D (see FIG. 8). As the generation unit 20, the CPU 102 combines the microphone signal
pair in a non-delayed state to generate a non-delayed combined signal, and after delaying,
combines the microphone signal pair to generate a delayed combined signal. As a result, audio
signals having directivity in various directions are generated.
[0101]
In step S36, the CPU 102, as the detection unit 30 (see FIG. 8), specifies a non-delayed audio
signal having a relatively high audio intensity from among the non-delayed synthesized signals.
After that, the CPU 102 selects a target voice signal having a relatively high voice strength from
among the specified no-delay synthesized signal and the delay synthesized signal generated from
the microphone signal pair used to generate the no-delay synthesized signal. Identify. The CPU
102 detects a direction corresponding to the specified target audio signal as a sound source
direction.
[0102]
In step S40, the CPU 102 determines whether the audio intensity of the target audio signal
exceeds a predetermined audio intensity. If the CPU 102 determines that the audio intensity of
the target audio signal exceeds the predetermined audio intensity (YES in step S40), the control is
03-05-2019
27
switched to step S42. If not (NO in step S40), CPU 102 returns the control to step S10.
[0103]
In step S42, the CPU 102 selects, as the selection unit 40 (see FIG. 8), the microphone signal pair
used to generate the target audio signal from the microphone signals output from the
microphones.
[0104]
In step S44, the CPU 102 sets, as the setting unit 60, the high-quality sound parameter 122 (see
FIG. 7) for generating an audio signal for output.
[0105]
In step S46, the CPU 102, as the generation unit 50 (see FIG. 8), synthesizes the microphone
signal pair selected in step S42 based on the high sound quality parameter 122.
Thereby, an audio signal having directivity in the sound source direction is generated.
[0106]
[Summary] As described above, the microphone system 100B according to the present
embodiment delays any one of the microphone signal pairs output for each predetermined
combination of the microphones 11 to 14 and then delays the microphone Synthesize signal
pairs.
Thereby, a plurality of audio signals having directivity in different directions are generated from
one microphone signal pair. As a result, the microphone system 100B can increase the direction
of directivity, and can detect the sound source direction more accurately.
[0107]
03-05-2019
28
Third Embodiment A microphone system 100C according to a third embodiment will be
described with reference to FIG. FIG. 11 is a diagram showing the directivity characteristic
formed by the microphone system 100C.
[0108]
In the first embodiment, the directional angle range (that is, the sound collection range) is
constant. On the other hand, in the third embodiment, the sound source direction is specified
after changing the sound collection range.
[0109]
The sound collection range is adjusted, for example, by changing the subtraction amount of the
audio signal F s in the subtraction unit 28 of FIG. That is, if the subtraction amount of the audio
signal F s increases, the sound collection range narrows, and if the subtraction amount of the
audio signal F s decreases, the sound collection range widens.
[0110]
As shown in FIG. 11, the microphones 11 and 12 form a directivity characteristic 321 with a
narrow sound collection range and a directivity characteristic 331 with a wide sound collection
range. The microphones 12 and 13 form a directivity characteristic 322 with a narrow sound
collection range and a directivity characteristic 332 with a wide sound collection range. The
microphones 13 and 14 form a directivity characteristic 323 with a narrow sound collection
range and a directivity characteristic 333 with a wide sound collection range.
[0111]
The middle of the directivity characteristics 321 and 322 or the middle of the directivity
characteristics 322 and 323 (hereinafter, also referred to as “middle position”. When there is a
sound source, the sound may not be accurately detected only by forming the directivity
characteristics 321 to 323 having a narrow sound collection range. However, by forming the
directivity characteristics 331 to 333 having a wide sound collection range, it is possible to
03-05-2019
29
detect voice even when the sound source is at an intermediate position.
[0112]
Focusing on this point, the microphone system 100C specifies the sound source direction after
setting the sound collection range widely. Next, the microphone system 100C sets a plurality of
narrow sound collection ranges so as to cover the sound collection range corresponding to the
sound source direction. The microphone system 100C specifies a relatively high sound signal
from among the sound signals corresponding to each narrow sound collection range, and detects
a direction corresponding to the sound signal as a sound source direction. Thereby, the sound
source direction is detected in detail.
[0113]
As described above, the microphone system 100C sets the sound collection range wide and
specifies an approximate sound source direction, and then narrows the sound collection range
and specifies the sound source direction in detail. Thus, the microphone system 100C can
prevent the detection failure of the sound source.
[0114]
The microphone system 100C may not only change the sound collection range, but may detect
the sound source direction while changing the directivity direction as in the microphone system
100B according to the second embodiment.
[0115]
[Control Structure of Microphone System 100C] The control structure of the microphone system
100C will be described with reference to FIG.
FIG. 12 is a flowchart showing a part of the process performed by the microphone system 100C.
The process of FIG. 12 is realized by the CPU 102 executing a program. In other aspects, some or
all of the processing may be performed by circuit elements or other hardware.
03-05-2019
30
[0116]
In addition, since each step other than step S32A is as having demonstrated in FIG. 10,
description is not repeated about those description.
[0117]
In step S32A, the CPU 102 sets, as the setting unit 60 (see FIG. 1), the low sound quality
parameter 121 (see FIG. 7) for detecting the sound source direction.
As one example, the low sound quality parameter 121 includes an angular range (that is, a sound
collection range) to direct directivity. The sound collection range set in step S32A is narrower
than the sound collection range set in step S12.
[0118]
[Summary] As described above, the microphone system 100C according to the present
embodiment detects an approximate sound source direction after widening the sound collection
range, and then narrows the sound collection range to set the sound source direction. To identify
in detail. Thus, the microphone system 100C can prevent the detection failure of the sound
source, and can accurately detect the sound source direction.
[0119]
Fourth Embodiment A microphone system 100D according to a fourth embodiment will be
described with reference to FIG. FIG. 13 is a view showing an arrangement example of the
microphones in the fourth embodiment.
[0120]
In the first embodiment, the positional relationship between the microphones 11 to 14 is not
particularly limited. In the fourth embodiment, as shown in FIG. 13, the microphones 11 to 14
03-05-2019
31
are arranged concentrically. That is, the microphones 11 to 14 are arranged equidistant from the
center 410.
[0121]
More specifically, directivity is formed in the perpendicular bisector direction 401 by the
microphones 11 and 12. The directivity is formed in the vertical bisector direction 402 by the
microphones 12 and 13. The directivity is formed in the perpendicular bisector direction 403 by
the microphones 13 and 14.
[0122]
By arranging the microphones 11 to 14 concentrically, the microphone system 100D can form
directivity in directions different from one another. In addition, the formed directivity does not
overlap with each other, so the microphone system 100D can detect voice without leak.
[0123]
The microphones 11 to 14 do not necessarily have to be arranged concentrically. For example, at
least one of the microphones 11 to 14 may not be arranged on the same straight line, and the
remaining microphones may be arranged on the same straight line. Alternatively, the
microphones 11 to 14 may be arranged on the same straight line.
[0124]
Fifth Embodiment [Summary] In the first embodiment, the processing in the case where the voice
from the speaker (ie, the target sound) is not detected has not been described. On the other hand,
when the target sound is not detected, the microphone system 100E according to the fifth
embodiment detects the target sound by directing the directivity direction in the front direction
of the microphone system 100E. This is because the speaker is likely to stand in front of the
microphone system 100E. Thus, the voice of the speaker can be detected more reliably.
03-05-2019
32
[0125]
Further, when the target sound is not detected, the microphone system 100E expands the sound
collection range to detect the target sound. By widening the sound collection range, the voice of
the speaker can be detected more reliably.
[0126]
Thereby, the microphone system 100E can collect the speech of the speaker without omission,
and can detect the speech from the beginning of the speech (for example, several milliseconds).
As a result, the accuracy of speech recognition is improved.
[0127]
[Control Structure of Microphone System 100E] The control structure of the microphone system
100E will be described with reference to FIG. FIG. 14 is a flowchart showing a part of the process
performed by the microphone system 100E. The process of FIG. 14 is realized by the CPU 102
executing a program. In other aspects, some or all of the processing may be performed by circuit
elements or other hardware.
[0128]
In addition, since the process of step S10-S26 is as having demonstrated in FIG. 6, description is
not repeated about those description.
[0129]
In step S52, the CPU 102 selects a set of microphones provided to collect sound from the front
direction of the microphone system 100E.
The set of microphones is preset, for example, at the time of design of the microphone system
100F. In the example of FIG. 4, the microphones 12 and 13 provided to collect sound from the
front direction are selected.
03-05-2019
33
[0130]
In step S54, the CPU 102 sets, as the setting unit 60 (see FIG. 1), the high sound quality
parameter 122 (see FIG. 7) for generating the audio signal for output. The high sound quality
parameter 122 includes a sound collection range. The sound collection range of the high sound
quality parameter 122 set in step S54 is wider than the sound collection range of the high sound
quality parameter 122 set in step S24.
[0131]
In step S56, the CPU 102 combines the microphone signal pair output from the set of
microphones selected in step S52 based on the high sound quality parameter 122 as the
generation unit 50 (see FIG. 1). Thereby, an audio signal having directivity in the front direction
of the microphone system 100E is generated.
[0132]
[Summary] The microphone system 100E outputs a microphone signal output from a set of
microphones provided to collect sound from the front direction when the signal strength of the
target audio signal does not exceed a predetermined audio strength. Synthesize a pair. Thereby,
an audio signal having directivity in the front direction of the microphone system 100E is
generated. At this time, the sound signal is generated after the sound collection range is
expanded as compared with the generation of a normal sound signal.
[0133]
Since the speaker is likely to stand in front of the microphone system 100E, the voice emitted to
the speaker can be detected more reliably. Also, by widening the sound collection range, the
voice of the speaker can be detected more reliably.
[0134]
03-05-2019
34
Sixth Embodiment [Overview] The microphone system 100A according to the first embodiment
generates the next audio signal immediately after generating the audio signal to be finally output.
On the other hand, the microphone system 100F according to the sixth embodiment does not
perform re-detection of the sound source direction until a predetermined time elapses after
generating the audio signal for output.
[0135]
This is advantageous in the following cases. When the microphone system 100F is mounted on
the voice recognition device, the voice recognition device performs voice recognition on the voice
signal output to the microphone system 100F, and changes the display or the contents of
dialogue according to the voice recognition result. After that, the speaker takes measures such as
the next voice input. Thus, there is no need to detect the voice direction or voice until the speaker
makes the next voice input. The microphone system 100F does not re-detect the sound source
direction during the speech processing of the speech recognition device and until the next
occurrence of the speaker. As a result, the microphone system 100F suppresses wasteful
processing.
[0136]
[Control Structure of Microphone System 100F] The control structure of the microphone system
100F will be described with reference to FIG. FIG. 15 is a flowchart showing a part of the process
performed by the microphone system 100F. The process of FIG. 15 is realized by the CPU 102
executing a program. In other aspects, some or all of the processing may be performed by circuit
elements or other hardware.
[0137]
In addition, since each step other than step S60 and S62 is as having demonstrated in FIG. 6,
description is not repeated about those description.
[0138]
03-05-2019
35
In step S60, the CPU 102 determines whether the voice recognition device is processing the
voice signal output from the microphone system 100F.
Whether or not the voice signal is being processed is determined by, for example, a signal
indicating that voice recognition is in progress, which is output from the voice recognition device.
When CPU 102 determines that the speech signal output from microphone system 100F is being
processed by the speech recognition device (YES in step S60), CPU 102 executes the process of
step S60 again. If not (NO in step S60), CPU 102 switches control to step S62.
[0139]
In step S62, the CPU 102 determines whether a predetermined time has elapsed since the
generation of the output audio signal. The elapsed time is measured, for example, by a timer (not
shown) incorporated in the microphone system 100F or a clock function incorporated in the CPU
102. If the CPU 102 determines that a predetermined time has elapsed since the generation of
the audio signal for output (YES in step S62), the control returns to step S10. If not (NO in step
S62), CPU 102 executes the process of step S62 again.
[0140]
(Modification) When a camera or a human sensor is provided in the microphone system 100F,
the microphone system 100F performs human detection on the detected sound source direction
and detects a human in the sound source direction. While being on, the detected voice direction
may be maintained.
[0141]
Further, the frequency band of the voice necessary for the process of detecting the direction of
the sound source may be about 150 Hz to 500 Hz as far as the voice from the speaker is limited.
The microphone system 100F extracts an audio component of 150 Hz to 500 Hz from the audio
signal, and may determine that the user is talking while the audio intensity of the audio
component does not change within a certain range. The microphone system 100F does not redetect the sound source direction while talking. Since only the sound component of the specific
frequency band is used, the processing time is shortened, and noises other than the target sound
03-05-2019
36
(e.g., the sound of object collision) are suppressed.
[0142]
[Summary] As described above, the microphone system 100F according to the present
embodiment does not perform re-detection of the sound source direction until a predetermined
time elapses after generating the audio signal for output. As a result, the number of times the
sound source direction detection process is performed can be reduced, and the time required for
audio processing can be shortened.
[0143]
It should be understood that the embodiments disclosed herein are illustrative and nonrestrictive in every respect. The scope of the present invention is indicated not by the above
description but by the claims, and is intended to include all modifications within the meaning and
scope equivalent to the claims.
[0144]
11 to 14 microphones, 15A to 15D delay units, 20, 50 generation units, 21A to 21C, 22A to 22I
combination units, 24 addition units, 25 and 27 FFT units, 26 and 28 subtraction units, 30
detection units, and 40 selection units , 60 setting unit, 100A to 100F microphone system, 101
ROM, 102 CPU, 103 RAM, 104 network interface, 120 storage device, 121 low sound quality
parameter, 122 high sound quality parameter, 123 speech processing program, 301, 301A,
301B, 302 , 302A, 302B, 303, 303A, 303B, 311, 313, 321, 322, 323, 331, 332, 333 Directing
characteristics, 401 to 403 vertical bisector directions, 410 center.
03-05-2019
37
Документ
Категория
Без категории
Просмотров
0
Размер файла
51 Кб
Теги
jp2017059951
1/--страниц
Пожаловаться на содержимое документа