close

Вход

Забыли?

вход по аккаунту

?

JP2013135433

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2013135433
Abstract: The present invention provides a voice processing apparatus that makes it easy to hear
a sound from a specific direction regardless of the individual differences of microphones or the
installation environment. A voice processing apparatus (1) performs first frequency signal
conversion and second processing on a frame-by-frame basis of sound collected by two voice
input units (2-1, 2-2). The phase difference calculating unit 12 that calculates phase differences
among frequency signals of a plurality of frequency bands, and a first range of phase differences
that can be taken with respect to a predetermined direction of a sound source in a predetermined
number of frames of a plurality of frequency bands. And a detection unit 13 for detecting a
frequency band in which the rate at which the phase difference is included does not satisfy the
condition corresponding to the sound from that direction, and a second range expanded from the
first range for the detected frequency band Range setting unit 14 for setting the amplitudes of
the first and second frequency signals when the phase difference is included in the second range,
and the amplitudes of the frequency signals when the phase difference deviates from the second
range. Signal correction larger than amplitude And a 16. [Selected figure] Figure 2
Voice processing apparatus, voice processing method, and computer program for voice
processing
[0001]
The present invention relates to, for example, an audio processing device, an audio processing
method, and a computer program for audio processing that make it easier to hear audio from a
specific direction among audio collected using a plurality of microphones.
[0002]
03-05-2019
1
BACKGROUND In recent years, voice processing devices have been developed that collect voice
with a plurality of microphones, such as a telephone conference system or a telephone equipped
with a handsfree function.
In such a speech processing device, in order to make it easy to hear the voice from the specific
direction among the collected voices, a technology for suppressing the voice from other than the
specific direction has been studied (for example, patent documents 1- 5).
[0003]
For example, the directional sound collector disclosed in Patent Document 1 converts sounds
from sound sources present in a plurality of directions into signals on the frequency axis, and
calculates a suppression function that suppresses signals on the frequency axis, and The
suppression function is multiplied by the amplitude component of the signal on the frequency
axis of the original signal to correct the signal on the frequency axis. This directional sound
collector calculates the phase component of the signal on each frequency axis for each same
frequency, calculates the difference of the phase component, and indicates the probability that
the sound source exists in a predetermined direction based on the difference. Identify probability
values. Then, the directional sound collector calculates a suppression function that suppresses
the sound from the sound source other than the sound source in the predetermined direction
based on the probability value.
[0004]
Moreover, the noise suppression apparatus disclosed by patent document 2 isolate | separates
the sound source of the sound which two or more microphones received, and estimates the
sound source direction of the target sound among the isolate | separated sound sources. Then,
the noise suppression device detects the phase difference between the microphones using the
sound source direction of the target sound, updates the central value of the phase difference
using the detected phase difference, and generates using the updated central value The noise
suppression filter is used to suppress noise in the sound received by the microphone.
[0005]
03-05-2019
2
The audio signal processing method disclosed in Patent Document 3 determines the audio
section and the noise section of the first input sound signal, and the power level of the first input
sound signal in the noise section is the first threshold. It is determined whether or not it is larger.
If the magnitude of the power of the first input sound signal is less than or equal to the first
threshold, the method for processing an audio signal may include the voice section and the noise
section of the first input sound signal based on the magnitude of the power in the noise section.
Suppress the noise. On the other hand, if the magnitude of the power of the first input sound
signal is greater than the first threshold, the method for processing an audio signal causes the
first input to be responsive to the phase difference between the first and second input sound
signals. Suppress the sound signal.
[0006]
Furthermore, the sound collection device disclosed in Patent Document 4 divides the audio signal
of two channels from the microphone into a plurality of frequency bands for each frame,
calculates the level or phase for each channel and frequency band, and Weight the phase over
the past to the current frame. Then, the sound collection device determines to which sound
source the corresponding frequency band component belongs based on the weighted average
level or phase difference between channels, and the frequency band component signal
determined as the signal from the same sound source is Synthesize across frequency bands.
[0007]
Furthermore, the noise suppression device disclosed in Patent Document 5 calculates a cross
spectrum from acoustic signals acquired by two microphones, measures time variation of the
phase component of the cross spectrum, and uses frequency components with less variation as
speech components. The component with a large fluctuation is a noise component. Then, the
noise suppression device calculates a correction coefficient that suppresses the amplitude of the
noise component.
[0008]
JP, 2007-318528, A JP, 2010-176105, A JP, 2011-99967, A JP, 2003-78988, A JP, 201133717, A
03-05-2019
3
[0009]
However, depending on the individual differences of the microphones used to collect sound or
the installation environment of the microphones, the actual phase difference of the sound from
the sound source located in the specific direction collected by each microphone may be the
phase difference May not necessarily match the theoretical value of.
As a result, the direction of the sound source may not be estimated correctly. Therefore, in any
prior art, there is a possibility that the voice to be emphasized is erroneously suppressed or the
speech to be suppressed is not suppressed.
[0010]
Therefore, the present specification aims to provide a voice processing device that makes it easy
to hear a sound from a specific direction regardless of individual differences of microphones or
an installation environment.
[0011]
According to one embodiment, an audio processing device is provided.
The audio processing device includes a first audio signal representing the sound collected by the
first audio input unit and a second audio signal representing the sound collected by the second
audio input unit. A time-frequency conversion unit for converting a first frequency signal and a
second frequency signal in a frequency domain for each frame having a predetermined time
length, and for each frame, the first frequency signal and the second frequency signal A phase
difference calculation unit that calculates phase differences for each of a plurality of frequency
bands, and a first range of phase differences that can be taken for a predetermined sound source
direction for each of a plurality of frequency bands for each frame. By determining whether or
not the phase difference between the frequency signal and the second frequency signal is
included, a rate at which the phase difference is included in the first range in a predetermined
number of frames is determined, and a plurality of frequency bands , Where the rate is A
detection unit that detects a frequency band that does not satisfy the condition corresponding to
the sound from the direction of the sound source; and a second expanded range of the frequency
band detected by the detection unit than a first range of the direction of the sound source A
03-05-2019
4
range setting unit for setting a range, and an amplitude of at least one of the first and second
frequency signals when the phase difference is included in the second range, when the phase
difference deviates from the second range The signal correction unit for obtaining the first and
second frequency signals corrected by making the amplitude of one of the frequency signals
larger, and the corrected first and second frequency signals after correction in the time domain
And a frequency time conversion unit for converting the first and second audio signals.
[0012]
The objects and advantages of the invention will be realized and attained by the elements and
combinations particularly pointed out in the claims. It is to be understood that both the foregoing
general description and the following detailed description are exemplary and explanatory and are
not restrictive of the invention, as claimed.
[0013]
The voice processing device disclosed in this specification can easily hear sounds from a specific
direction regardless of individual differences of microphones or an installation environment.
[0014]
It is a schematic block diagram of the speech input system which has a speech processing unit by
one embodiment.
It is a schematic block diagram of the speech processing unit by a 1st embodiment. It is a figure
which shows an example of the phase difference between the 1st frequency signal and the 2nd
frequency signal about the sound from the sound source located in a specific direction. It is a
figure which shows an example of the relationship between two microphones and each sub
direction range. It is a figure which shows an example of the range of the phase difference which
can be taken for every sub-direction range. It is a figure which shows an example of the time
change of an achievement rate. The table showing an example of the maximum value of the
achievement rate for every frequency band, average value, and dispersion is shown. It is an
operation flowchart of relaxation frequency band setting processing. (A)-(c) is a figure which
shows an example of the relationship between a reference range and the non-suppression range
corrected about the relaxation frequency zone, respectively. It is an operation | movement
flowchart of audio processing. It is an operation | movement flowchart of the relaxation
03-05-2019
5
frequency band setting process by 2nd Embodiment. It is a schematic block diagram of the
speech processing unit by a 3rd embodiment.
[0015]
Voice processing devices according to various embodiments will now be described with reference
to the figures. The voice processing device obtains phase differences between voice signals
collected by a plurality of voice input units for each of a plurality of frequency bands, and
estimates the direction of a specific sound source from the phase difference of each frequency
band, Attenuates audio signals that arrive from other than the direction of the sound source. At
this time, the voice processing device obtains, for each frequency band, a ratio in which the
phase difference is included in the range of the phase difference corresponding to the direction
of the sound source to be collected in the most recent fixed period. The voice processing
apparatus estimates that the phase difference fluctuates due to individual differences among the
microphones or the installation environment of the microphones in the frequency band where
the rate is low, and does not attenuate the voice signal. Extend the scope of
[0016]
FIG. 1 is a schematic block diagram of a voice input system having a voice processing device
according to one embodiment. The voice input system 1 is, for example, a teleconference system,
and includes voice input units 2-1 and 2-2, an analog / digital conversion unit 3, a storage unit 4,
a storage medium access device 5, and a voice processing device 6. , A control unit 7, a
communication unit 8, and an output unit 9.
[0017]
The voice input units 2-1 and 2-2 each have, for example, a microphone, and collect voices
around the voice input unit 2-1 and the voice input unit 2-2 according to the volume of the
voices. The analog audio signal is output to the analog-digital converter 3. Note that the voice
input unit 2-1 and the voice input unit 2-2 have predetermined intervals (for example, several cm
to a few) so that the time for voice to arrive between the voice input units differs according to the
position of the sound source. Spaced by 10 cm). Therefore, the phase difference between the
audio signals obtained by the two audio input units 2-1 and 2-2 also changes according to the
direction of the sound source. Therefore, the voice processing device 6 can estimate the direction
03-05-2019
6
of the sound source by examining the phase difference.
[0018]
The analog / digital converter 3 includes, for example, an amplifier and an analog / digital
converter. The analog-to-digital converter 3 amplifies the analog audio signals received from the
audio input units 2-1 and 2-2 using an amplifier. Then, the analog / digital conversion unit 3
generates a digitized sound signal by sampling the amplified analog sound signal with a
predetermined sampling cycle by the analog / digital converter. Hereinafter, for convenience, a
voice signal obtained by digitizing an analog voice signal generated by the voice input unit 2-1 is
referred to as a first voice signal, and an analog voice signal generated by the voice input unit 2-2
is digitized. The digitized audio signal is called a second audio signal. The analog / digital
converter 3 outputs the first and second audio signals to the audio processing device 6.
[0019]
The storage unit 4 includes, for example, a readable / writable semiconductor memory and a
read-only semiconductor memory. The storage unit 4 stores various computer programs and
various data used in the voice input system 1. Furthermore, the storage unit 4 may store the first
and second audio signals corrected by the audio processing device 6.
[0020]
The storage medium access device 5 is a device for accessing the storage medium 10 such as, for
example, a magnetic disk, a semiconductor memory card, and an optical storage medium. The
storage medium access device 5 reads, for example, a computer program to be executed on the
control unit 7 stored in the storage medium 10 and passes it to the control unit 7. Further, as
described later, when the control unit 7 executes a computer program for realizing the function
as the voice processing device 6, the storage medium access device 5 reads the computer
program for voice processing from the storage medium 10. , May be passed to the control unit 7.
[0021]
03-05-2019
7
The audio processing device 6 corrects the first and second audio signals by attenuating the
sound or noise from the sound source other than the sound source located in the specific
direction, which is included in the first and second audio signals. Makes it easy to hear the sound
from that particular direction. Then, the audio processing device 6 outputs the corrected first and
second audio signals.
[0022]
The voice processing device 6 may be integrally formed with the control unit 7. In this case, the
audio processing performed by the audio processing device 6 is performed by, for example, a
functional module realized by a computer program executed on a processor of the control unit 7.
Then, various data generated by the voice processing device or used by the voice processing
device are stored in the storage unit 4. The details of the voice processing device 6 will be
described later.
[0023]
The control unit 7 includes one or more processors, a memory circuit, and peripheral circuits.
The control unit 7 controls the entire voice input system 1. For example, when a conference call
is started by the operation of the user via the operation unit (not shown) such as the keypad of
the voice input system 1, the control unit 7 may switch the voice input system 1 and the switch
or the Session Initiation Protocol Execute call control processing such as calling, answering,
disconnecting with the (SIP) server. Then, the control unit 7 encodes the first and second audio
signals corrected by the audio processing device 6, and outputs the encoded first and second
audio signals through the communication unit 8. The control unit 7 may use, for example, the
voice coding technology defined in Recommendation G.711, G722.1, or G.729A according to the
International Telecommunication Union Telecommunication Standardization Sector (ITU-T). It
can be used. Further, the control unit 7 decodes the encoded audio signal received from another
device via the communication unit 8 and outputs the decoded audio signal to a speaker (not
shown) via the output unit 9 You may
[0024]
The communication unit 8 outputs the first and second audio signals corrected by the audio
03-05-2019
8
processing device 6 to another device connected to the audio input system 1 via the
communication network. For that purpose, the communication unit 8 has an interface circuit for
connecting the voice input system 1 to the communication network. The communication unit 8
converts the audio signal encoded by the control unit 7 into a transmission signal according to a
predetermined communication standard. Then, the communication unit 8 outputs the
transmission signal to the communication network. The communication unit 8 may also receive a
signal according to a predetermined communication format from the communication network,
and extract an encoded audio signal from the received signal. Then, the communication unit 8
may pass the encoded audio signal to the control unit 7. The predetermined communication
standard can be, for example, Internet Protocol (IP), and the transmission signal and the
reception signal can be IP packetized signals.
[0025]
The output unit 9 outputs the audio signal received from the control unit 7 to a speaker (not
shown). For that purpose, the output unit 9 includes, for example, a digital / analog converter for
converting an audio signal received from the control unit 7 into an analog signal.
[0026]
Hereinafter, the details of the voice processing device 6 will be described. FIG. 2 is a schematic
block diagram of the voice processing device 6. The voice processing device 6 includes a time
frequency conversion unit 11, a phase difference calculation unit 12, a detection unit 13, a
suppression range setting unit 14, a suppression function calculation unit 15, a signal correction
unit 16, and a frequency time conversion unit 17. And. Each of the units included in the speech
processing device 6 may be implemented as separate circuits in the speech processing device 6
or may be one integrated circuit that implements the functions of the respective units.
Alternatively, these units included in the voice processing device 6 may be implemented as
functional modules realized by a computer program executed on a processor included in the
control unit 7, for example.
[0027]
The time frequency conversion unit 11 converts the first and second audio signals into first and
second frequency signals in the frequency domain in frame units each having a predetermined
03-05-2019
9
time length (for example, several tens of msec). For that purpose, the time-frequency conversion
unit 11 may use, for example, time such as fast Fourier transform (FFT) or modified discrete
cosine transform (MDCT) on the first and second audio signals. Converting into first and second
frequency signals by performing frequency conversion. Alternatively, the time frequency
conversion unit 11 may use a Quadrature Mirror Filter (QMF) filter bank or wavelet transform as
the above time frequency conversion. The time frequency conversion unit 11 outputs the first
and second frequency signals to the phase difference calculation unit 12 and the signal
correction unit 16 for each frame.
[0028]
The phase difference calculation unit 12 obtains the difference between the phase of the first
frequency signal and the phase of the second frequency signal for each of a plurality of
frequency bands each time the first and second frequency signals are received. For example, the
phase difference calculation unit 12 obtains the phase difference Δθ f for each frequency band
according to the following equation. However, S1f represents a component in the frequency band
f of the first frequency signal, and S2f represents a component in the frequency band f of the
second frequency signal. Also, fs represents a sampling frequency. The phase difference
calculation unit 12 passes the phase difference Δθ f of each frequency band to the detection
unit 13 and the signal correction unit 16.
[0029]
The detection unit 13 determines, for each frame, whether or not the phase difference Δθ f is
included in the range in which the phase difference corresponding to the direction of the sound
source desired to be collected can be obtained for each of the plurality of frequency bands. Then,
the detection unit 13 obtains a rate at which the phase difference Δθ f is included in the range
in the latest predetermined number of frames, and relaxes the frequency band that does not
satisfy the condition corresponding to the sound from the direction of the sound source Detect as
a frequency band. The relaxation frequency band is a frequency band that does not attenuate the
first and second frequency signals over a wider range than the phase difference that can be
acquired corresponding to the direction of the sound source to be collected.
[0030]
03-05-2019
10
FIG. 3 is a diagram showing an example of the phase difference between the first frequency
signal and the second frequency signal of the sound from the sound source located in the specific
direction. In FIG. 3, the horizontal axis represents frequency, and the vertical axis represents
phase difference. The graph 300 represents the phase difference for each frequency band
measured for a given frame. The dotted line 310 represents the theoretical value of the phase
difference with respect to a specific sound source direction, and the range 320 is a phase
difference when the direction of the sound source is estimated in a range of a certain direction
width centered on the specific sound source direction. Represents the range of possible values.
Further, the enlarged view 330 enlarges a portion of the graph 300 for about 500 Hz or less. As
shown in FIG. 3, it can be seen that the phase difference is almost out of the range 320 for
frequency bands below about 300 Hz. This is due to the individual differences of the
microphones of the voice input units 2-1 and 2-2, or the reflection, reverberation, etc. of the
sound in the installation environment of the microphones. And in such a frequency band, the
phase difference may deviate from the range 320 over a plurality of frames.
[0031]
Therefore, for each of the plurality of sub-direction ranges obtained by dividing the direction
range in which the sound source may be present, the detection unit 13 sets the phase difference
Δθ f within the range of phase differences that can be taken for the sub-direction range. Is
determined. In the following, for convenience, the range of the phase difference that can be taken
for each sub-direction range will be referred to as the phase difference range with respect to the
sub-direction range.
[0032]
FIG. 4 is a diagram showing an example of the relationship between the voice input units 2-1 and
2-2 and the sub direction range. As shown in FIG. 4, it is assumed that the angle with respect to
the normal direction nd of the line at the midpoint O of the line connecting the voice input units
2-1 and 2-2 is 0, and the counterclockwise direction is more than the normal direction nd. The
turning direction is set as positive and the clockwise direction is set as negative. The direction
range in which the sound source may be present is −π / 2 to π / 2. And each sub direction
range 401-1-401-n is set to one of the range which equally divided n the direction range in
which a sound source may exist, for example, making the middle point O into an origin. Here, n is
an integer of 2 or more. For example, when n = 3, the sub-direction ranges 401-1 to 401-3 are
respectively −π / 2 to −π / 6, −π / 6 to π / 6, and π / 6 to π / 2. Become.
03-05-2019
11
[0033]
The detection unit 13 sets each sub-direction range to a sub-direction range of interest in order.
Then, for each frequency band, the detection unit 13 determines, for each frame, whether or not
the phase difference is included in the phase difference range of the sub-direction range of
interest. As the voice input unit 2-1 and the voice input unit 2-2 are further apart, the time from
the time when the sound from the specific sound source reaches the voice input unit 2-1 and the
time when the sound from the particular sound source reaches the voice input unit 2-2 The
difference also increases, and as a result, the phase difference also increases. Therefore, the
phase difference at the center of the phase difference range is set according to the distance
between the audio input unit 2-1 and the audio input unit 2-2. The wider the sub-direction range,
the wider the phase difference range for the sub-direction range. Furthermore, the higher the
frequency of the sound, the shorter the wavelength of the sound, so the higher the frequency, the
larger the phase difference between the first frequency signal and the second frequency signal.
Therefore, the phase difference range becomes wider as the frequency becomes higher.
[0034]
FIG. 5 is a diagram showing an example of the phase difference range for each sub direction
range. In this example, it is assumed that three sub-direction ranges are set. The phase difference
range 501 corresponds to a sub-direction range including the normal direction nd to the line
connecting the voice input unit 2-1 and the voice input unit 2-2. The phase difference range 502
corresponds to the sub-direction range closer to the voice input unit 2-1 side than the normal
direction nd, while the phase difference range 503 corresponds to the voice input unit 2compared to the normal direction nd. It corresponds to the sub-direction range closer to the 2
side.
[0035]
The detection unit 13 obtains, for the latest frame t, a determination value d (t) indicating
whether the phase difference is included in the phase difference range with respect to the subdirection range of interest. That is, when the phase difference is included in the phase difference
range of the target sub direction range, the detection unit 13 sets the determination value d (t) of
the target sub direction range of the frame t to 1. On the other hand, if the phase difference is
out of the phase difference range, the detection unit 13 sets the determination value d (t) to zero.
03-05-2019
12
Then, the detection unit 13 calculates, for each frequency band, a ratio in which the phase
difference with respect to the sub-direction range of interest is included in the phase difference
range in the latest predetermined number of frames according to the following equation. This
rate is hereinafter referred to as the achievement rate for the sake of convenience. Here, ARP f
<n> (t−1) and ARP f <n> (t) indicate the achievement rates for the frequency band f in the nth
sub-direction range for frame (t−1) and frame t, respectively. Represent. In addition, α is a
forgetting factor, and is set to a value obtained by subtracting from 1 the reciprocal of the
number of frames used to calculate the achievement rate, for example, a value within the range
of 0.9 to 0.99. As apparent from the equation (2), the range of possible values of the achievement
rate ARPf <n> (t) is 0 to 1. At the start of operation of the speech processing device 6, the value
of the achievement rate calculated by the equation (2) becomes unstable. Therefore, the
detection unit 13 sets the forgetting factor α in equation (2) to 0 for the first frame after the
voice processing device 6 starts operating (that is, t = 1). When t is 10 or less, the detection unit
13 sets the forgetting factor α to 0.5. When t exceeds 10, the forgetting factor α may be set to
0.9 to 0.99.
[0036]
The detection unit 13 also has, for example, a volatile memory circuit, and stores the achieved
rate ARPf <n> (t) for the latest predetermined number of frames in the memory circuit. The
number of frames can be, for example, the number of frames used to calculate the achievement
rate.
[0037]
FIG. 6 is a diagram showing an example of the time change of the achievement rate. In FIG. 6, the
horizontal axis represents time, and the vertical axis represents the achievement rate. Moreover,
each graph 601-608 represents the time change of the achievement rate in the frequency of 100
Hz, 200 Hz, 300 Hz, 600 Hz, 800 Hz, 1200 Hz, 1400 Hz, and 2000 Hz, respectively. As shown in
FIG. 6, in the frequency band of 300 Hz or less, it is affected by the individual difference or
installation environment of the microphones of the voice input units 2-1 and 2-2, and the actual
value of the phase difference at that frequency is the theoretical value. It is different. Therefore,
in the frequency band of 300 Hz or less, the achievement rate is not more than a certain constant
value A, which is very low, regardless of the passage of time. On the other hand, in the frequency
band higher than 300 Hz, it is understood that the achievement rate is higher than the constant
value A at most time.
03-05-2019
13
[0038]
Therefore, when time (for example, 1 sec to 2 sec) in which the achievement rate is stabilized
passes after the voice processing device 6 starts operation, the detection unit 13 for each subdirection range and each frequency band for each frame. The maximum value MAXARPf <n>
among the achievement rates ARPf <n> (t) stored in the memory circuit is determined. For
example, among M achievement rates ARPfj <ni> (t) to ARPfj <ni> (t− (M + 1)) calculated for the
subdirection range ni and the frequency band fj and stored in the memory circuit, If the
achievement rate ARPfj <ni> (m) at time m is maximum, MAXARPfj <ni> = ARPfj <ni> (m).
[0039]
Furthermore, the detection unit 13 calculates, for each frequency band, the average value
AVMAXARPf and the variance VMAXARPf for MAXARPf <n> of all the sub-direction ranges. In
general, if there is a sound source to be collected in a specific direction, MAXARPf <n> of the subdirection range including the specific direction is high. Therefore, the average value AVMAXARPf
also increases. Since the value of MAXARPf <n> for each sub-direction range also varies, the
variance VMAXARPf also becomes relatively large. However, in a frequency band where the
phase difference between the first frequency signal and the second frequency signal varies due
to individual differences among microphones or the installation environment of microphones,
MAXARPf <n> is low for all sub-direction ranges. The average value AVMAXARPf also decreases.
Further, in the frequency band, the dispersion of MAXARPf <n> for each sub-direction range also
decreases, so the variance VMAXARPf also relatively decreases.
[0040]
Therefore, the detection unit 13 determines, for each frequency band, whether the average value
AVMAXARPf is equal to or less than a predetermined threshold Th1 and the variance VMAXARPf
is equal to or less than the variance threshold Th2. Then, for a frequency band in which the
average value AVMAXARPf is less than or equal to the threshold Th1 and the variance
VMAXARPf is less than or equal to the variance threshold Th2, the detection unit 13 does not
suppress the first and second frequency signals. It is determined that the range is wider than the
reference range. The reference range corresponds to the range of possible phase differences
corresponding to the direction in which the sound from the sound source to be collected is
present. Therefore, when searching the direction of the sound source for each sub-direction
03-05-2019
14
range, the phase difference range for the sub-direction range matches the reference range. On
the other hand, with respect to a frequency band in which the average value AVMAXARPf is
higher than the threshold Th1 or the variance VMAXARPf is larger than the variance threshold
Th2, the detection unit 13 determines that the non-suppression range is the reference range.
Then, the detection unit 13 notifies the suppression range setting unit 14 of a relaxation
frequency band that is a frequency band determined to make the non-suppression range wider
than the reference range.
[0041]
The threshold Th1 is determined based on, for example, the distribution of the maximum values
of the achievement rates of all frequency bands. For example, the threshold value Th1 is set to a
value obtained by subtracting the maximum value of the achievement rates in all frequency
bands from 1 or a value obtained by multiplying the value by a coefficient smaller than 0.8 to
1.0. Further, the dispersion threshold value Th2 is set to, for example, a dispersion value at which
the frequency becomes a local minimum value below the mode or median value of dispersion in
the distribution histogram of the maximum value MAXARPf of the achievement rate for each
frequency band obtained for each frame. Be done.
[0042]
FIG. 7 shows a table 700 showing an example of the maximum value MAXARPf <n>, the average
value AVMAXARPf, and the variance VMAXARPf of the achievement rate for each frequency
band. In FIG. 7, the top row 701 of the table 700 represents a frequency band. In this example,
the frequency range corresponding to the human audible range is divided into 128 frequency
bands. Further, in this example, six sub direction ranges are set, and in the left end column 702
of the table 700, indices '1' to '6' indicating the respective sub direction ranges are shown.
Furthermore, in the lower two rows of the table 700, the average value AVMAXARPf and the
distribution VMAXARPf of MAXARPf <n> of each frequency band are shown, respectively.
[0043]
Referring to FIG. 7, for example, for the frequency bands '1' and '2', the average value
AVMAXARPf is less than or equal to the threshold Th1, and the variance VMAXARPf is less than
or equal to the variance threshold Th2. Therefore, for the frequency bands '1' and '2', it is
03-05-2019
15
determined that the non-suppression range is wider than the reference range.
[0044]
FIG. 8 is an operation flowchart of the relaxation frequency band setting process performed by
the detection unit 13. The detection unit 13 calculates, for each frequency band, an evaluation
value indicating whether or not the phase difference Δθ f is included in the phase difference
range for the sub direction range for each of the plurality of sub direction ranges (step S101).
Then, the detection unit 13 updates the achievement rate ARP (t) f <n> based on the evaluation
value for each frequency band for each of the plurality of sub-direction ranges (step S102).
[0045]
The detection unit 13 calculates the maximum value MAXARPf <n> of the achievement rate ARP
(t) f <n> in the nearest predetermined number of frames for each frequency band in each subdirection range (step S103). Furthermore, the detection unit 13 calculates, for each frequency
band, an average value AVMAXARPf and a variance VMAXARPf of MAXARPf <n> in all subdirection ranges. Then, the detection unit 13 sets, as the relaxation frequency band, a frequency
band in which AVMAXARPf is equal to or less than the threshold Th1 and VMAXARPf is equal to
or less than the dispersion threshold Th2 among the frequency bands (step S104). After step
S104, the detection unit 13 ends the relaxation frequency band setting process.
[0046]
Furthermore, the detection unit 13 specifies a sub-direction range in which MAXARPf <n> is
maximum for each frequency band in order to estimate a target direction range in which the
sound source to be collected is present. Then, the detection unit 13 estimates that the subdirection range in which the maximum number of MAXARP f <n> is the largest is the target
direction range. The detection unit 13 may estimate the target direction range based on any of
various other techniques for estimating the direction of the sound source. For example, the
detection unit 13 may estimate the target direction range based on the cost function disclosed in
Japanese Patent Laid-Open No. 2010-176105. Then, the detection unit 13 notifies the
suppression range setting unit 14 of the target direction range.
03-05-2019
16
[0047]
The suppression range setting unit 14 is an example of the range setting unit, and for each
frequency band, the suppression range which is the range of the phase difference for attenuating
the first and second frequency signals, and the first and second frequency signals. The nonsuppression range which is the range of the phase difference which does not attenuate is set. At
this time, the suppression range setting unit 14 makes the non-suppression range wider than the
reference range for the target direction range for the relaxation frequency band notified from the
detection unit 13. The suppression range and the non-suppression range are mutually exclusive,
and the suppression range is a range of phase differences not included in the non-suppression
range. Note that an intermediate region in which the amount of suppression is gradually changed
may be provided between the suppression range and the non-suppression range in order to avoid
a rapid change in the amount of suppression. Therefore, hereinafter, a method of setting the nonsuppression range will be described.
[0048]
The suppression range setting unit 14 includes, for example, a non-volatile semiconductor
memory circuit. The memory circuit has, for example, for each frequency band, the width δf of
the phase difference corresponding to the fluctuation width of the phase difference
corresponding to one sub-direction range, and each sub-direction range n (n = 1, 2, 3,. , N) and
stores the central value Cf <n> of the phase difference. The suppression range setting unit 14
specifies the center value Cf <n> of the phase difference of each frequency band corresponding
to the target direction range notified from the detection unit 13 with reference to the memory
circuit, and the center value Cf <n A region of width δf centered on> is taken as a reference
range.
[0049]
Next, when the relaxation frequency band is notified from the detection unit 13, the suppression
range setting unit 14 makes the non-suppression range wider than the reference range for the
relaxation frequency band.
[0050]
FIGS. 9A to 9C are diagrams showing an example of the relationship between the reference range
03-05-2019
17
and the non-suppression range corrected for the relaxation frequency band, respectively.
In FIGS. 9A to 9C, the horizontal axis represents frequency, and the vertical axis represents phase
difference. In the example of FIG. 9A, the frequency band equal to or lower than the frequency f1
is notified as the relaxation frequency band. In this example, all of the frequency bands lower
than the frequency f1 are set to the non-suppression range 901 up to the phase difference -π to
π. Then, for the frequency band higher than the frequency f1, the non-suppression range 901 is
narrowed linearly, and the width of the non-suppression range 901 matches the width of the
reference range 900 at the frequency f2 higher by a predetermined offset value than f1. The
non-suppression range 901 is set to. The predetermined offset value is set to, for example, 50 Hz
to 100 Hz, or a value obtained by multiplying the frequency f1 by 0.1 to 0.2.
[0051]
Also in the example of FIG. 9B, the frequency band equal to or lower than the frequency f1 is
notified as the relaxation frequency band. In this case, at the frequency f1, the non-suppression
range 911 is expanded with respect to the upper limit and the lower limit of the phase difference
of the reference range 910 by the width d of the phase difference set in advance. Furthermore,
the width over which the non-suppression range is extended from the minimum frequency to the
maximum frequency for the first and second frequency signals is set to decrease linearly and
monotonically as the frequency increases.
[0052]
Also in the example of FIG. 9C, the frequency band equal to or lower than the frequency f1 is
notified as the relaxation frequency band. In this case, at the frequency f1, the non-suppression
range 921 is expanded with respect to the upper limit and the lower limit of the phase difference
of the reference range 920 by the width d of the phase difference set in advance. Furthermore,
the width over which the non-suppression range is extended from the minimum frequency to the
maximum frequency for the first and second frequency signals is proportional to the reciprocal
of the frequency and monotonically decreases, for example, as the frequency increases. The
expanded width d is set to (a / f + b) (where a and b are positive constants).
[0053]
03-05-2019
18
The width d where the non-suppression range is expanded may be determined based on the
absolute value of the amount by which the phase difference actually measured from the target
direction range deviates. In this case, when the phase difference for each sub-direction range is
larger than the phase difference range for the sub-direction range, detection unit 13 determines
the difference between the phase difference DPPf and the upper limit value UPTf <n> of the
phase difference range. Determine DDUf <n> (= DPPf−UPTf <n>). Then, the detection unit 13
obtains the maximum value MaxDDUf <n> of DDUf <n> for each sub direction range. Similarly,
for each sub-direction range, when the phase difference is smaller than the phase difference
range for the sub-direction range, detection unit 13 determines the difference between the phase
difference DPPf and the lower limit LWTf <n> of the phase difference range. Find n> (=
DPPf−LWTf <n>). Then, the detection unit 13 obtains the minimum value MinDDLf <n> of DDLf
<n> for each sub direction range. Then, the detection unit 13 notifies the suppression range
setting unit 14 of MinDDL f <n> and MaxDDUf <n> of the relaxation frequency band for the
target direction range. The suppression range setting unit 14 is a width by which the larger one
of the absolute values | MinDDLf <n> | and | MaxDDUf <n> | of MinDDL f <n> and MaxDDU f <n>
of the relaxation frequency band is extended the non-suppression range. d.
[0054]
If | MinDDL f <n> | in the relaxation frequency band is 0, the suppression range setting unit 14
may extend only the upper limit of the phase difference in the non-suppression range according
to any of the above methods. Similarly, when | MaxDDUf <n> | in the relaxation frequency band
is 0, the suppression range setting unit 14 may extend only the lower limit of the phase
difference in the non-suppression range according to any of the above methods.
[0055]
Furthermore, the suppression range setting unit 14 may determine the width d to which the nonsuppression range is expanded as a function of the frequency. In this case, a set of coefficients
defining each of the plurality of functions defining the width d is stored in advance in the
memory circuit included in the suppression range setting unit 14. Then, the suppression range
setting unit 14 selects a set of coefficients of functions in which | MinDDL f <n> | and | Max DDU
f <n> | for the notified one or more relaxation frequency bands is smaller than the width d. The
suppression range setting unit 14 may extend the non-suppression range beyond the reference
range according to the selected function.
03-05-2019
19
[0056]
For example, it is assumed that the function d = g (f) of the frequency f and the width d is
represented by g (f) = a × f + b. Here, a and b are constants. The memory circuit of the
suppression range setting unit 14 includes (i, (-0.008, 1.0), (ii), (-0.015, 2.0), (iii), (iii, (-0.02, 2.5))
as a pair of (a, b). It is assumed that three types of) are stored. In this case, it is assumed that
relaxation frequency bands f are 2, 3, 4, 5, 6 and MinDDL f <n> and MaxDDUf <n> for each
relaxation frequency band are the following values. f = 2 MinDDL2 <n> =-1.2 MaxDDU2 <n> = 1.0
f = 3 MinDDL3 <n> =-0.2 MaxDDU3 <n> = 0.3 f = 4 MinDDL4 <n> =-0.9 MaxDDU4 <n> = 1.1 f = 5
MinDDL5 <n> =-1.2 MaxDDU5 <n> = 1.8 f = 6 MinDDL6 <n> =-1.1 MaxDDU6 <n> = 1.5 In this
case, if it is a set of constants (ii) and (iii), all relaxation The absolute values of MinDDLf <n> and
MaxDDUf <n> for the frequency band are equal to or less than the width d for expanding the
non-suppression range. Therefore, the suppression range setting unit 14 selects one of the set of
constants (ii) and (iii) for which the width d is smaller for each relaxation frequency band, that is,
selects the set of constants (ii), The expansion width d of the non-suppression range for each
frequency band is determined.
[0057]
In any of the above examples, the frequency band lower than the predetermined frequency is the
relaxation frequency band, but in general, long-wavelength sound is more susceptible to
reflection and the like, and corresponds to the sound source direction. This is because there is a
high possibility that the phase difference and the measured phase difference will not coincide
with each other. However, the suppression range setting unit 14 may extend the width of the
phase difference in the non-control range in the relaxation frequency band more than the width
of the phase difference in the reference range according to a rule different from the above
example. For example, the suppression range setting unit 14 may extend the width of the phase
difference of the reference range by the width d of the predetermined phase difference simply
for each of the notified relaxation frequency bands. Also, the width d of the phase difference may
be set to the larger value of | MaxDDUf <n> | and | MinDDLf <n> | described above.
[0058]
The suppression range setting unit 14 notifies the suppression function calculation unit 15 of the
non-suppression range.
03-05-2019
20
[0059]
The suppression function calculation unit 15 calculates a suppression function for suppressing
an audio signal arriving from a direction different from the direction in which the sound source
to be collected is located.
Therefore, the suppression function is set, for example, as a gain value G (f, Δθf) indicating the
degree to which the signal is attenuated according to the phase difference Δθf between the first
frequency signal and the second frequency signal for each frequency band. Be done. Then, the
suppression function calculation unit 15 sets, for example, the gain value G (f, Δθf) in the
frequency band f as follows. G (f, Δθf) = 0 (Δθf is in the non-suppression range) G (f, Δθf) =
10 (Δθf is out of the non-suppression range)
[0060]
Alternatively, the suppression function calculation unit 15 may obtain the suppression function
according to another method. For example, the suppression function calculation unit 15
calculates, for each frequency band, the probability that the sound source to be collected in the
specific direction is present according to the method disclosed in Japanese Patent Laid-Open No.
2007-318528. Calculate the suppression function based on Also in this case, the suppression
function calculation unit 15 determines the gain value G (f, Δθf) when the phase difference
Δθf is included in the non-suppression range and the gain when the phase difference Δθf is
out of the non-suppression range Make it smaller than the value G (f, Δθf).
[0061]
Further, the suppression function calculation unit 15 monotonously increases the gain value G (f,
Δθf) for the phase difference out of the non-suppression range as the absolute value of the
difference between the phase difference and the upper limit or the lower limit of the nonsuppression range increases. You may
[0062]
The suppression function calculation unit 15 passes the gain value G (f, Δθf) of each frequency
band to the signal correction unit 16.
03-05-2019
21
[0063]
The signal correction unit 16 receives the first and second frequency signals, for example, from
the phase difference Δθ f between the first and second frequency signals received from the
phase difference calculation unit 12 according to the following equation and the suppression
function calculation unit 15 Correction is made based on the received gain value G (f, Δθf).
Here, X (f) represents the first or second frequency signal, and Y (f) represents the corrected first
or second frequency signal.
F represents a frequency band. As apparent from the equation (3), Y (f) decreases as the gain
value G (f, Δθf) increases. Therefore, the first and second frequency signals are attenuated by
the signal correction unit 16 when the phase difference Δθ f is out of the non-suppression
range. The signal correction unit 16 is not limited to the equation (3), and may use the first and
second frequency signals according to other functions for attenuating the first and second
frequency signals having a phase difference out of the non-suppression range. May be corrected.
The signal correction unit 16 passes the corrected first and second frequency signals to the
frequency-time conversion unit 17.
[0064]
The frequency-time conversion unit 17 corrects the corrected first and second frequency signals
by converting the corrected first and second frequency signals into time-domain signals using
inverse conversion of time-frequency conversion used by the time-frequency conversion unit 11.
To obtain the first and second audio signals. Thereby, the corrected first and second audio
signals are audio signals that make it easy to hear the sound from the sound source desired to be
collected by attenuating the sound from the direction different from the direction in which the
sound source desired to be collected is located. It becomes.
[0065]
FIG. 10 is an operation flowchart of audio processing performed by the audio processing device
6. The voice processing device 6 acquires the first and second voice signals (step S201). Then,
the first and second audio signals are delivered to the time-frequency converter 11. The time-
03-05-2019
22
frequency conversion unit 11 converts the first and second audio signals into first and second
frequency signals in the frequency domain (step S202). Then, the time frequency conversion unit
11 passes the first and second frequency signals to the phase difference calculation unit 12 and
the signal correction unit 16.
[0066]
The phase difference calculation unit 12 calculates the phase difference Δθ f between the first
frequency signal and the second frequency signal for each of the plurality of frequency bands
(step S203). Then, the phase difference calculation unit 12 passes the phase difference Δθ f of
each frequency band to the detection unit 13 and the signal correction unit 16.
[0067]
The detection unit 13 sets a relaxation frequency band based on the phase difference Δθ f of
each frequency band (step S204). The detection unit 13 also estimates the sound source
direction (step S205). Then, the detection unit 13 notifies the suppression range setting unit 14
of the relaxation frequency band and the estimated sound source direction. The suppression
range setting unit 14 sets the non-suppression range for each frequency band such that the nonsuppression range of the relaxation frequency band is wider than the reference range (step
S206). Then, the suppression range setting unit 14 notifies the suppression function calculation
unit 15 of the non-suppression range. The suppression function calculation unit 15 determines a
suppression function that attenuates the first and second frequency signals having a phase
difference outside the non-suppression range for each frequency band (step S207). Then, the
suppression function calculation unit 15 passes the suppression function to the signal correction
unit 16.
[0068]
The signal correction unit 16 corrects the frequency signal by multiplying the first and second
frequency signals by the suppression function (step S208). At this time, the signal correction unit
16 attenuates the first and second frequency signals when the phase difference Δθ f is included
in the non-suppression range. Then, the signal correction unit 16 outputs the corrected first and
second frequency signals to the frequency-time conversion unit 17.
03-05-2019
23
[0069]
The frequency-time conversion unit 17 converts the corrected first and second frequency signals
into corrected first and second audio signals in the time domain (step S209). Then, the audio
processing device 6 outputs the corrected first and second audio signals, and then ends the audio
processing.
[0070]
As described above, according to the individual differences of the audio input unit or the
installation environment, this audio processing apparatus has a frequency band in which a phase
difference different from the phase difference corresponding to the direction in which the sound
source to be collected is located is measured. , Extend the unsuppressed range. Thus, the voice
processing device can prevent distortion of the sound from the sound source desired to be
collected, and can easily hear the sound.
[0071]
Next, a voice processing apparatus according to a second embodiment will be described. The
speech processing apparatus according to the second embodiment sets the relaxation frequency
band in a state where the direction of the sound source to be collected is known in advance.
[0072]
The voice processing apparatus according to the second embodiment is implemented in a voice
input system in which the direction of the sound source is specified in advance, such as an onvehicle hands-free phone, for example. Alternatively, the voice processing apparatus according to
the second embodiment determines the relaxation frequency band for each sub-direction range
at the time of calibration, and when performing voice processing, it is not based on the relaxation
frequency band determined at the time of calibration. Determine the range of suppression.
[0073]
03-05-2019
24
The speech processing apparatus according to the second embodiment differs from the speech
processing apparatus according to the first embodiment in the processing performed by the
detection unit 13. Therefore, the detection unit 13 will be described below. For the other
components of the speech processing device according to the second embodiment, refer to the
description of the corresponding components of the speech processing device according to the
first embodiment.
[0074]
In the present embodiment, the detection unit 13 receives, for example, the direction of the
sound source to be collected from the control unit 7 of the voice input system 1 in which the
voice processing device 6 is mounted. Then, the detection unit 13 specifies a sub-direction range
in which the direction of the sound source desired to be collected is included among the plurality
of sub-direction ranges as the sub-direction range of interest.
[0075]
FIG. 11 is an operation flowchart of a relaxation frequency band setting process performed by
the detection unit 13 of the speech processing apparatus according to the second embodiment.
The detection unit 13 calculates, for each frequency band, an evaluation value indicating whether
or not the phase difference Δθ f is included in the phase difference range only for the subdirection range of interest (step S301). Then, the detection unit 13 updates the achievement rate
ARPf <n0> (t) based on the evaluation value for each frequency band only for the sub-direction
range of interest (step S302). However, no is an index that represents the sub-direction range of
interest. Then, the detection unit 13 obtains, for each frequency band, the maximum value
MAXARPf <n0> of the achievement rate in the latest predetermined number of frames (step
S303).
[0076]
The detection unit 13 compares, for each frequency band, the maximum value MAXARPf <n0> of
the achievement rate with a predetermined threshold value Th3 and sets the frequency band for
which the maximum value MAXARPf <n0> is equal to or less than the threshold value Th3 to the
03-05-2019
25
relaxation frequency band. (Step S304). The threshold value Th3 is set to, for example, a lower
limit value at which the achievement rate can be obtained when a period corresponding to the
number of frames used for calculation of the achievement rate continues for a sound from a
specific sound source direction. The detection unit 13 notifies the suppression range setting unit
14 of the relaxation frequency band for the sub-direction range of interest. The suppression
range setting unit 14 sets the non-suppression range for the sub-direction range of interest, and
the suppression function calculation unit 15 determines the suppression function based on the
non-suppression range.
[0077]
When calibration processing is executed for the voice input system in which this voice
processing device is implemented, relaxation frequency bands may be determined sequentially
for each sub-direction range in the calibration processing. In this case, the signal correction unit
16 may store the suppression function determined based on the relaxation frequency band for
each sub direction range in the non-volatile memory circuit included in the signal correction unit
16. Then, when the audio processing is performed, the processing of step S204 of the audio
processing illustrated in FIG. 10 may be omitted. Furthermore, in the voice input system in which
the voice processing device is implemented, when the direction of the sound source to be
collected is limited to one sub-direction range, the process of step S205 may be omitted.
[0078]
According to this embodiment, since the direction of the sound source is known in advance when
determining the relaxation frequency band, the speech processing device may obtain the
achievement rate only for the direction of the sound source. Therefore, this voice processing
device can reduce the amount of calculation for determining the relaxation frequency band.
[0079]
According to the modification, when specifying the relaxation frequency band, the speech
processing device uses the achievement rate itself as the threshold Th3 instead of comparing the
maximum value of the achievement rate in the sub-direction range of interest with the threshold
Th3. You may compare. In this embodiment, since the position of the sound source is estimated
not to fluctuate so much in time, the time change of the achievement rate is also small.
03-05-2019
26
[0080]
Next, a speech processing apparatus according to a third embodiment will be described. The
speech processing apparatus according to the third embodiment determines the relaxation
frequency band based on the speech signal only when the ratio of the noise component to the
whole of the inputted speech signal is low.
[0081]
FIG. 12 is a schematic block diagram of the speech processing apparatus according to the third
embodiment. The speech processing apparatus 61 according to the third embodiment includes a
time frequency conversion unit 11, a phase difference calculation unit 12, a detection unit 13, a
suppression range setting unit 14, a suppression function calculation unit 15, and a signal
correction unit 16. And a frequency time conversion unit 17, a noise level determination unit 18,
and a determination unit 19. In FIG. 12, the components of the third voice processing device 61
are given the same reference numerals as the corresponding components of the voice processing
device 6 shown in FIG.
[0082]
The speech processing apparatus according to the third embodiment differs from the speech
processing apparatus according to the first embodiment in that a noise level determination unit
18 and a determination unit 19 are included. Therefore, the noise level determination unit 18
and the determination unit 19 will be described below. For other components of the speech
processing device according to the third embodiment, refer to the description of the
corresponding components of the speech processing device according to the first embodiment.
[0083]
The noise level determination unit 18 estimates the stationary noise model based on the audio
signals collected by the audio input units 2-1 and 2-2 to calculate the levels of noise included in
the first and second audio signals. Decide. Generally, the distance from each voice input unit to
03-05-2019
27
the noise source is longer than the distance from each voice input unit to the sound source to be
collected. Therefore, the power of the noise component is smaller than the power of the sound
emitted from the sound source to be collected. Therefore, the noise level determination unit 18
obtains the average value of the power for each frequency band for a frame having a small power
spectrum for either the first or second audio signal input to the audio processing device 61. ,
Calculate the estimated noise spectrum of the stationary noise model. Specifically, every time the
noise level determination unit 18 receives the first and second frequency signals of each frame
from the time frequency conversion unit 11, the noise level determination unit 18 averages the
power spectrum of one of the first and second frequency signals. Calculate p according to the
following equation. Here, M is the number of frequency bands. Also, flow represents the lowest
frequency band, and fhigh represents the highest frequency band. S (f) is a first frequency signal
or a second frequency signal. Although the power spectrum may be calculated by any of the first
and second frequency signals, it is assumed here that the power spectrum is calculated for the
first frequency signal.
[0084]
Next, the noise level determination unit 18 compares the average value p of the power spectrum
of the latest frame with the threshold value Thr corresponding to the upper limit of the power of
the noise component. The threshold value Thr is set to, for example, any value in the range of 10
dB to 20 dB. Then, when the average value p is less than the threshold value Thr, the noise level
determination unit 18 calculates the estimated noise spectrum Nm (f) for the latest frame by
averaging the power spectrum in the time direction according to the following equation for each
frequency band. Do. However, Nm-1 (f) is an estimated noise spectrum for a frame one frame
before the latest frame, and is read from a buffer of the noise level determination unit 18. Also,
the coefficient β is a forgetting coefficient, and is set to, for example, any value of 0.9 to 0.99.
On the other hand, when the average value p is greater than or equal to the threshold value Thr,
it is estimated that the latest frame contains components other than noise, so the noise level
determination unit 18 does not update the estimated noise spectrum. That is, the noise level
determination unit 18 sets Nm (f) = Nm-1 (f).
[0085]
Alternatively, instead of calculating the average value p of the power spectrum, the noise level
determination unit 18 may obtain the maximum value among the power spectra of all frequency
bands and compare the maximum value with the threshold value Thr. Also, especially when the
noise is white noise, there is no correlation of the power spectrum between the frames.
03-05-2019
28
Therefore, the noise level determination unit 18 may update the noise level only when the cross
correlation value of the power spectrum across all frequency bands between the latest frame and
the immediately preceding frame is equal to or less than a predetermined threshold. The
predetermined threshold may be, for example, 0.1.
[0086]
The noise level determination unit 18 outputs the estimated noise spectrum to the determination
unit 19. Also, the noise level determination unit 18 stores the estimated noise spectrum for the
latest frame in the buffer of the noise level determination unit 18.
[0087]
Each time the determination unit 19 receives the first and second frequency signals of each
frame, the determination unit 19 determines whether the first and second frequency signals of
the frame include the sound from the sound source desired to be collected. Therefore, the
determination unit 19 determines the ratio (p / np) between the average value p of the power
spectrum of the first and second frequency signals for which the estimated noise spectrum is
being calculated and the average value np of the estimated noise spectrum. Ask for). When the
ratio (p / np) is higher than a predetermined threshold, the determination unit 19 determines
that the first and second frequency signals of the frame include the sound from the sound source
desired to be collected. . Then, the determination unit 19 passes the first and second frequency
signals to the phase difference calculation unit 12 and the signal correction unit 16. Then, using
the first and second frequency signals of the frame, the voice processing device 61 determines
the relaxation frequency band and the non-suppression range as in the first embodiment, and
suppresses according to the non-suppression range. The first and second frequency signals are
corrected according to the function. On the other hand, when the ratio (p / np) is equal to or less
than the predetermined threshold value, the determination unit 19 determines that the first and
second frequency components of the frame are included because the first and second frequency
components include many noise components. The frequency signal is not used to determine the
relaxation frequency band and the non-suppression range. Then, the voice processing device 61
corrects the first and second frequency signals based on the suppression function obtained for
the frame before that frame. Alternatively, the audio processing device 61 may not correct the
first and second frequency signals for a frame whose ratio (p / np) is less than or equal to a
predetermined threshold. The predetermined threshold is set to, for example, 2 to 5.
03-05-2019
29
[0088]
According to this embodiment, since the speech processing device determines the nonsuppression range and the suppression function based on the speech signal of the frame with a
relatively small noise component, the more appropriate non-suppression range and the
suppression function can be determined.
[0089]
Next, a voice processing device according to a fourth embodiment will be described.
The speech processing apparatus according to the fourth embodiment uses the threshold Th1 for
the average value AVMAXARPf of the maximum values of the achievement rates in which the
phase difference Δθf is included in the phase difference range in the latest predetermined
number of frames. Based on the distribution of the maximum value of the achievement rate of
the frequency band of
[0090]
The speech processing apparatus according to the fourth embodiment differs from the speech
processing apparatus according to the first embodiment in the processing performed by the
detection unit 13. Therefore, the detection unit 13 will be described below. For other
components of the speech processing device according to the fourth embodiment, refer to the
descriptions of corresponding components of the speech processing device according to the first
embodiment.
[0091]
When the microphones of the first and second voice input units are ideal and the microphones
are installed in an ideal environment where reverberation and the like can be ignored, the first to
the sound from the sound source located in a specific direction The value of the phase difference
between the one audio signal and the second audio signal is approximately the theoretical value.
Therefore, for most frames, the calculated phase difference Δθ f is included in the phase
difference range for a specific sub-direction range including the specific direction. On the other
hand, the calculated phase difference Δθ f is not included in the phase difference range for the
03-05-2019
30
other sub direction range. As a result, the achievement rate of the specific sub-direction range
becomes a value close to 1, and the achievement rate becomes a value close to 0 for the other
sub-direction ranges. Therefore, under such an ideal microphone and an ideal installation
environment, the maximum value and the minimum value of the achievement rates in all
frequency bands have the following relationship. Minimum value of achievement rate ((1.0maximum value of achievement rate)
[0092]
However, the value of the phase difference between the first audio signal and the second audio
signal is the theoretical value due to the individual differences of the microphones possessed by
the audio input units 2-1 and 2-2 or the influence of the installation environment around the
microphones. In the case of divergence, the achievement rate may be low for any sub-direction
range. As a result, the minimum value of the achievement rate is smaller than (1.0-the maximum
value of the achievement rate). Therefore, the detection unit 13 obtains the maximum value
among the achievement rates in all frequency bands. Then, the detection unit 13 sets a value
obtained by multiplying (1.0−maximum value of achievement rate) or (1.0−maximum value of
achievement rate) by a coefficient smaller than 0.8 to 1.0 as a threshold Th1 with respect to the
average value of the maximum values of achievement rates. .
[0093]
According to this embodiment, the speech processing device determines the threshold Th1 for
the average AVMAXARPf of the maximum value of the achievement rates for identifying the
relaxation frequency band based on the distribution of the achievement rates. Therefore, the
speech processing apparatus can appropriately determine the threshold value Th1.
[0094]
Next, a voice processing apparatus according to a fifth embodiment will be described. The speech
processing apparatus according to the fifth embodiment determines that the dispersion threshold
Th2 for the dispersion VMAXARPf of the maximum value of the achievement rate in which the
phase difference Δθ f is included in the phase difference range for each sub-direction range, the
achievement rate of all frequency bands. Determined based on the distribution of variance of the
maximum value.
03-05-2019
31
[0095]
The speech processing apparatus according to the fifth embodiment differs from the speech
processing apparatus according to the first embodiment in the processing performed by the
detection unit 13. Therefore, the detection unit 13 will be described below. For the other
components of the speech processing device according to the fifth embodiment, refer to the
descriptions of the corresponding components of the speech processing device according to the
first embodiment.
[0096]
As described above, there is an individual difference between the microphones of the audio input
units 2-1 and 2-2, or due to the influence of the installation environment around the
microphones, the position between the first audio signal and the second audio signal. The value
of the phase difference may deviate from the theoretical value. In such a case, the inventor
obtained the knowledge that in the distribution of the dispersion of the maximum value of the
achievement rate for each frequency band, there is a tendency for local minima of the frequency
to exist below the mode or median of the dispersion. Furthermore, in the frequency band having
a dispersion value smaller than the dispersion corresponding to the local minimum value, the
phase difference calculated by the phase difference calculation unit temporally fluctuates, and
the achievement rate is obtained for any sub-direction range. We found that there was a
tendency to decline. Therefore, the detection unit 13 obtains, for each frame, the variance of the
maximum value MAXARPf of the achievement rate for each frequency band, and creates a
histogram of the variance. Then, the detection unit 13 specifies a dispersion value at which the
frequency becomes a minimum value at or below the mode or median of dispersion, and sets the
dispersion value as the dispersion threshold Th2 in the frame. Note that the detection unit 13
may obtain the distribution of the variance of the maximum value MAXARPf of the achievement
rate in each frequency band for not only one frame but for a plurality of nearest frames.
[0097]
Further, in this embodiment, the detection unit 13 may also determine the threshold value Th1
for the average value of the maximum values of the achievement rates based on the distribution
of the maximum values of the achievement rates, as in the fourth embodiment.
03-05-2019
32
[0098]
According to this embodiment, the speech processing device determines the variance threshold
Th2 for the variance VMAXARPf of the maximum value of the achievement rate for specifying
the relaxation frequency band based on the distribution of the dispersion of the maximum value
of the achievement rate.
Therefore, the speech processing apparatus can appropriately determine the dispersion
threshold Th2.
[0099]
Note that according to the modification of each of the above embodiments, the audio processing
device may output only one of the first and second audio signals as a monaural audio signal. In
this case, the signal correction unit of the voice processing device may correct only one of the
first and second frequency signals based on the suppression function.
[0100]
Further, according to another modification, the signal correction unit attenuates instead of or
instead of attenuating the first and second frequency signals having a phase difference out of the
non-suppression range. The first and second frequency signals having a phase difference may be
emphasized.
[0101]
Furthermore, a computer program that causes a computer to realize each function of the
processing unit of the voice processing apparatus according to each of the above embodiments is
provided in the form of being recorded on a computer readable medium such as a magnetic
recording medium or an optical recording medium. May be
[0102]
All examples and specific terms cited herein are intended for instructional purposes to help the
reader understand the concepts contributed by the inventor to the present invention and the
03-05-2019
33
promotion of the art. It should be understood that the present invention is not to be limited to the
construction of any of the examples herein, and to the specific listed examples and conditions
relating to showing superiority and inferiority of the present invention.
Although embodiments of the present invention have been described in detail, it should be
understood that various changes, substitutions and modifications can be made thereto without
departing from the spirit and scope of the present invention.
[0103]
The following appendices will be further disclosed regarding the embodiment and its
modification described above.
(Supplementary Note 1) The first audio signal representing the sound collected by the first audio
input unit and the second audio signal representing the sound collected by the second audio
input unit are each predetermined A time-frequency conversion unit for converting a first
frequency signal and a second frequency signal in the frequency domain for each frame having a
time length, and the first frequency signal and the second frequency signal for each frame. A
phase difference calculating unit that calculates a phase difference of each of a plurality of
frequency bands, and the first range of phase differences that can be taken for each of the
plurality of frequency bands for each of the plurality of frequency bands. By determining
whether or not the phase difference is included, a ratio in which the phase difference is included
in the first range in a predetermined number of the frames is determined, and the ratio is the
sound source among the plurality of frequency bands. Who A detection unit that detects a
frequency band that does not satisfy the condition corresponding to the sound from the sound
source; and a second range obtained by extending the first range of the direction of the sound
source with respect to the frequency band detected by the detection unit. A range setting unit to
set, and at least one of the amplitudes of the first and second frequency signals when the phase
difference is included in the second range, the phase difference deviates from the second range.
A signal correction unit for obtaining the first and second frequency signals corrected by making
the amplitude of the one frequency signal in the case larger than the amplitude of the one
frequency signal, and the corrected first and second frequency signals A frequency time
conversion unit for converting the first and second audio signals after the region correction.
(Supplementary Note 2) In the supplementary note 1, the detection unit determines, among the
plurality of frequency bands, a frequency band in which the ratio is equal to or less than a first
threshold as a frequency band in which the ratio does not satisfy the condition. A voice
processing device as described. (Supplementary Note 3) In each of the plurality of frequency
03-05-2019
34
bands, the detection unit obtains the maximum value of the ratio in the predetermined number of
frames in each of the directions of the plurality of sound sources, and among the plurality of
frequency bands, A frequency band in which an average value of the maximum values in each of
the directions of the plurality of sound sources is equal to or less than a second threshold, and a
dispersion of the maximum values in each of the directions of the plurality of sound sources is
equal to or less than a third threshold The speech processing apparatus according to appendix 1,
wherein it is determined that the rate is a frequency band that does not satisfy the condition.
(Supplementary Note 4) The detection unit may set the second threshold to a lower limit that the
average value can take when sound from one of the directions of the plurality of sound sources
continues for the predetermined number of frames. The speech processing device according to
appendix 3, which is set to a value.
(Supplementary Note 5) The detection unit sets the third threshold to a lower limit value at which
the variance can be obtained when sound from one of the directions of the plurality of sound
sources continues for the predetermined number of frames. The speech processing device
according to appendix 3, which is set to (Supplementary Note 6) For the frequency band detected
by the detection unit, the range setting unit is equal to or greater than the maximum value of the
amount by which the phase difference deviates from the first range in the predetermined number
of frames in the frequency band. The voice processing device according to any one of appendices
1 to 5, wherein the second range is set by extending the first range. (Supplementary Note 7) The
signal correction unit corrects by attenuating the amplitude of at least one of the first and second
frequency signals when the phase difference deviates from the second range. The voice
processing device according to any one of appendices 1 to 6, wherein the first and second
frequency signals are obtained. (Supplementary Note 8) The signal correction unit corrects by
amplifying the amplitude of at least one of the first and second frequency signals when the phase
difference is included in the second range. The voice processing device according to any one of
appendices 1 to 6, wherein the first and second frequency signals are obtained. (Supplementary
Note 9) The first audio signal representing the sound collected by the first audio input unit and
the second audio signal representing the sound collected by the second audio input unit are each
specified. Converting a first frequency signal and a second frequency signal in the frequency
domain for each frame having a time length, and for each frame, a plurality of phase differences
between the first frequency signal and the second frequency signal. Calculation for each of the
frequency bands, and determining whether or not the phase difference is included within a first
range of possible phase differences for each of the plurality of frequency bands for each of the
plurality of frequency bands. By doing this, the rate at which the phase difference is included in
the first range in a predetermined number of frames is determined, and the condition
corresponding to the sound from the direction of the sound source among the plurality of
frequency bands is determined. Unfilled lap Several bands are detected, and for the frequency
band detected by the detection unit, a second range expanded from the first range with respect
to the direction of the sound source is set, and the phase difference within the second range is
03-05-2019
35
set. Is corrected by making the amplitude of at least one of the first and second frequency signals
in the case where the phase difference is out of the second range larger than the amplitude of the
one frequency signal. Determining voiced first and second frequency signals, and converting the
corrected first and second frequency signals into first and second voice signals after time domain
correction, respectively. Processing method.
(Supplementary Note 10) The first audio signal representing the sound collected by the first
audio input unit and the second audio signal representing the sound collected by the second
audio input unit are each specified. Converting a first frequency signal and a second frequency
signal in the frequency domain for each frame having a time length, and for each frame, a
plurality of phase differences between the first frequency signal and the second frequency signal.
Calculation for each of the frequency bands, and determining whether or not the phase
difference is included within a first range of possible phase differences for each of the plurality of
frequency bands for each of the plurality of frequency bands. By doing this, the rate at which the
phase difference is included in the first range in a predetermined number of frames is
determined, and the condition corresponding to the sound from the direction of the sound source
among the plurality of frequency bands is determined. Do not meet A wave number band is
detected, and a second range expanded from the first range in the direction of the sound source
is set for the frequency band detected by the detection unit, and the phase difference is set within
the second range. Is corrected by making the amplitude of at least one of the first and second
frequency signals in the case where the phase difference is greater than the amplitude of the one
frequency signal when the phase difference deviates from the second range. Determining the first
and second frequency signals and converting the corrected first and second frequency signals
into time domain corrected first and second audio signals, respectively. Computer program for
voice processing to be executed.
[0104]
1 voice input system 2-1, 2-2 voice input unit 3 analog / digital conversion unit 4 storage unit 5
storage medium access device 6, 61 voice processing unit 7 control unit 8 communication unit 9
output unit 10 storage medium 11 time frequency conversion Unit 12 Phase difference
calculation unit 13 Detection unit 14 Suppression range setting unit 15 Suppression function
calculation unit 16 Signal correction unit 17 Frequency time conversion unit 18 Noise level
calculation unit 19 Determination unit
03-05-2019
36
Документ
Категория
Без категории
Просмотров
0
Размер файла
57 Кб
Теги
jp2013135433
1/--страниц
Пожаловаться на содержимое документа