close

Вход

Забыли?

вход по аккаунту

?

JP2010026361

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010026361
PROBLEM TO BE SOLVED: To accurately collect voices of only a specific speaker such as a
salesperson in face-to-face sales etc. A voice collection system extracts and collects target voices
of interest out of a plurality of voices having different directions of arrival. This system comprises
a microphone array 11 having at least first and second microphones 11a and 11b and having the
first and second microphones separated by a predetermined distance, and the sound received by
the first and second microphones. The respective signals are subjected to discrete Fourier
transform to obtain a plurality of CSP coefficients related to the direction of arrival of speech,
and a plurality of speech signals are detected from the plurality of CSP coefficients. Then, from
the plurality of CSP coefficients found, a plurality of voice direction indexes defined in
accordance with the angle between the line segment connecting the first and second
microphones and the arrival direction are detected, and a plurality of detected voice direction
indexes are detected. The target voice signal is extracted from the voice signal of [Selected figure]
Figure 1
Voice collecting method, system and program
[0001]
The present invention relates to a voice collection method, system and program for collecting a
specific voice. In particular, the present invention relates to a voice collection method, system
and program for collecting only the voices of sales personnel in face-to-face sales.
[0002]
04-05-2019
1
In recent years, credibility of consumers or customers may be lost due to unlawful acts or antisocial acts in companies etc., and a great deal of corporate effort is required to restore the once
lost credibility. Not only that, but it can have a major impact on business survival. For this reason,
establishment of a so-called compliance system has become an urgent issue in companies. For
example, in the financial services industry, sales activities are monitored as part of efforts to
enhance compliance. For example, in the case of sales activities by telephone, the sales staff's
response to calls (content of calls) Is stored in a server etc. and adopted a mechanism to check at
random. In addition, there is also an attempt to automatically detect an inappropriate response of
a salesperson by combining speech recognition technology and natural language processing
technology.
[0003]
On the other hand, in so-called face-to-face sales where products are sold at the window, there is
no mechanism to collect customer correspondence records of sales personnel like sales via
telephone, so the development of the monitoring system is delayed compared to sales via
telephone. . At present, although the method of reporting sales activities conducted by sales
representatives in a document (report) or the like is adopted, it may not only take time to create a
report but also may not be properly reported.
[0004]
In the prior art, as a measure against face-to-face sales, a method has been studied in which a
salesperson wearing a close talk microphone records a conversation with a customer, but
although the purpose is to record only the salesperson's voice, the customer is practically used
Because the voice of the voice is also recorded, there are many customers who show resistance to
the recording of the conversation, and it is not necessarily an appropriate method. For this
reason, it is conceivable to install a (single) directional microphone in a place where it is not
visible to the customer and collect the sales clerk's voice, but with a standard microphone the
directivity is low and the voice of the customer is also recorded It will be done. When using a gun
microphone or the like having superdirectivity to improve directivity, the gun microphone is
generally expensive and considering that its size is large, it is not suitable to use the gun
microphone for face-to-face sales .
04-05-2019
2
[0005]
Therefore, in the prior art, as an attempt to use audio signal processing technology in
combination, two nondirectional microphones are arranged in a straight line toward the speaker,
and output is performed depending on the sound pressure level to one of the microphones. There
is known a microphone device which has means for switching signals and thereby exhibits strong
directivity (see Patent Document 1). Further, in the prior art, there is known a technique of
detecting a speech section and extracting a speech signal using a microphone array having a
plurality of microphone elements (see Patent Document 2). JP-A-9-149490 JP-A-2007-86554
[0006]
However, a microphone array or the like including a method of switching the output according to
the determination result of the sound pressure level described in Patent Document 1 or a
technique using the difference between the voice and noise components described in Patent
Document 2 is used. The conventional technique of using software to form directionality in
software collects the voice of the customer at the time of recording in the arrangement of the
microphone, and it is difficult to collect only the voice of the salesperson excluding the voice of
the customer in the face-to-face sale .
[0007]
According to the present invention, there is provided a microphone array setting method for
separating salesman's and customer's voices in face-to-face sales, a speech enhancement method
for speech recognition performance improvement for separated speech, and speaker direction
indexing of interactive speech using this. Provided is a voice collection method, system and
program for accurately collecting voices of sales people only in face-to-face sales.
Furthermore, the present invention provides a method, system and program for reliably leaving
only the voice of the salesperson who needs the recording, without leaving the speech recording
of the customer who has not obtained the consent.
[0008]
The present invention, in view of the above problems, includes the following solutions.
04-05-2019
3
[0009]
(Use of difference in arrival time of sound) The present invention uses a microphone array
having two microphone elements arranged at a predetermined distance, and a difference in time
for sound to reach these microphone elements from a specific sound source, ie, Use time delay.
Furthermore, in the present invention, the line connecting the two microphone elements included
in the microphone array is disposed substantially parallel to the line connecting the customer
and the salesperson. For example, viewed from above, according to the present invention, the
microphone array is disposed on a straight line connecting the customer and the salesperson.
With such an arrangement, the difference in arrival time of the voice emitted by the customer or
salesperson to each of the two microphone elements may approach maximum. Therefore, in the
present invention, in a situation where a plurality of face-to-face sales booths are lined up, the
arrangement can effectively cut the voice etc. from adjacent booths where the arrival time
difference to the microphone element is not necessarily the largest. Change in the attitude or
position of the salesperson or customer within the scope of Furthermore, in general, in the
microphone array, there is a problem that voices from directions arriving in the same phase (the
same time delay) can not be distinguished (image problem in mirror image position). It is possible
to avoid.
[0010]
(Use of CSP coefficient) The present invention distinguishes the utterance of the customer and
the salesperson by the target speaker utterance section detection based on the CSP (Cross powerSpectrum Phase, whitening cross correlation) coefficient, and separately recognizes the speech.
Can do. At the same time, by combining the speaker direction index according to the CSP method
and the time stamp of the speech recognition result, the recording of the target speaker speech
can be simplified and the recording location can be selectively designated. In other words, the
present invention is characterized by having an interface for designating a speaker and a
recording location from the direction index and the speech recognition result.
[0011]
(Voice Enhancement Processing) Further, the present invention achieves high voice recognition
performance by performing gain adjustment, that is, voice enhancement based on the CSP
04-05-2019
4
coefficient. In the present invention, gain adjustment processing based on CSP coefficients is
linked to a processing procedure that combines spectral subtraction (SS) (abbreviated as SS)
processing and flooring processing, which are typical noise removal methods. Specifically, gain
adjustment is performed between the SS process and the Flooring process. By this series of
processing, voice emphasis is performed simultaneously with voice separation, and practical
voice recognition performance as software processing is realized at low cost.
[0012]
A computer apparatus having a function of audio signal processing, a digital signal processing
apparatus, a digital recording apparatus or the like can be used as an implementation means of
the audio collection method according to the present invention. The computer device or the like
is not limited to being capable of carrying out the steps for the voice collection method according
to the present invention, such as recording of voice signals based on salespersons and customers'
voices, calculation of CSP coefficients for the voice signals recorded, It can be used for
[0013]
The present invention is combined with existing technologies such as voice collection technology
for collecting only voiced sections, and voice signal processing technology for adjusting the
frequency characteristic or gain of signal processing to improve speech intelligibility and
audibleness. Such combined techniques are also within the scope of the present invention.
Similarly, a voice collecting device including the technique of the present invention, a voice
collecting function including the technique of the present invention and incorporated in a
portable computer device or the like, a voice collecting system for cooperating a plurality of
devices including the technique of the present invention, etc. It is included in the technical scope
of the present invention. Furthermore, the techniques of the present invention include steps for
voice collection, FPGA (field programmable gate array), ASIC (application specific integrated
circuit), equivalent hardware logic elements, programmable integration The circuit, or a
combination of these may be provided in the form of a program that can be stored, that is, as a
program product. Specifically, the sales clerk voice collecting apparatus and the like according to
the present invention can be provided as a form of custom LSI (large scale integrated circuit)
provided with data input / output, data bus, memory bus, system bus etc. The form of the
program product stored in the circuit is also within the scope of the present invention.
[0014]
04-05-2019
5
According to the present invention, a signal of sound received by the first and second
microphones is provided using a microphone array provided with at least the first and second
microphones and arranged with the first and second microphones separated by a predetermined
distance. Are respectively subjected to discrete Fourier transform to obtain a plurality of CSP
coefficients related to the direction of arrival of speech, and after detecting a plurality of speech
signals from the plurality of CSP coefficients, the first and A voice direction index defined in
accordance with an angle between a line connecting two microphones and an arrival direction is
detected, and a target voice signal is extracted from a plurality of detected voice signals based on
the detected voice direction index. Because of this, it is possible to reliably extract and collect
only the target voice. Furthermore, the present invention has an effect that it is possible to
reliably leave only the voice of the salesperson who needs the recording without leaving the
speech recording of the customer who has not obtained the consent. At the same time as speech
separation, the speech recognition performance is enhanced by performing speech enhancement
processing consisting of a series of steps of SS processing, gain adjustment processing by CSP
coefficients, and Flooring processing.
[0015]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings.
[0016]
Voice Collection System FIG. 1 is a diagram schematically illustrating an example of a voice
collection system according to an embodiment of the present invention.
In FIG. 1, the voice collection system 10 comprises a microphone array 11, a target voice
extraction device 12, and a customer interaction recording server 13, which comprises two
microphones 11a and 11b, which are commercially available, for example. It may be a possible
integral or a set of stereo microphones or the like. The details of the target voice extraction
device 12 will be described later with reference to FIG.
[0017]
04-05-2019
6
In the example of FIG. 1, the customer 21, the salesperson 22, the table 14 and the like are
viewed from above. The microphone array 11 is disposed substantially on a straight line
connecting the customer 21 and the salesperson 22 as viewed from above. That is, the
microphone array 11 is arranged such that the line connecting the microphones 11 a and 11 b
and the line connecting the customer 21 and the salesperson 22 are substantially parallel.
Thereby, the difference in arrival time of the voice emitted by the customer or the salesperson to
each of the two microphone elements can be maximized. By arranging in this manner, in the
present invention, in a situation where a plurality of face-to-face sales booths are arranged side
by side, it is possible to effectively cut voices and the like from adjacent booths whose arrival
time difference to the microphone element is not necessarily maximum.
[0018]
Further, in the illustrated example, target speaker utterance section detection is performed based
on the CSP coefficient to distinguish the utterance of the customer and the salesperson.
Specifically, CSP coefficients are calculated for voice signals received by two microphones, and a
section in which the CSP coefficients are large is regarded as a target speaker's utterance section
to extract a target voice signal.
[0019]
Further, the extracted speech signal is subjected to speech enhancement processing by
performing gain adjustment with CSP coefficients between SS processing and Flooring
processing. This speech enhancement process is a process for enhancing speech recognition
performance, and AFE (ASR Front-end for speech Enhancement, ASR means automatic speech
recognition) by combining the target speaker speech extraction by CSP coefficient and the
speech enhancement process. (Abbreviated as Automatic Speech Recognition). In this
embodiment, speech recognition is separately performed on the speech signal after separation
and emphasis using AFE, and as will be described later, the purpose is using the speaker
direction index by the CSP method and the time stamp of the speech recognition result. The
recording of the speaker's voice signal is simplified to selectively designate the recording
location.
[0020]
04-05-2019
7
As shown in FIG. 1, the microphone array 11 may be disposed so that the microphones 11 a and
11 b are substantially positioned on a straight line connecting the customer 21 and the
salesperson 22 when viewed from above. The microphone array 11 may be located substantially
at the center of the table 14 and may be embedded at substantially the center of the table 14.
[0021]
FIG. 2 is a diagram showing the voice incoming direction to the microphone. In FIG. 2, assuming
that the microphones 11a and 11b are separated by a distance d, the angle θ between the
straight line connecting the microphones 11a and 11b and the voice incoming direction is
represented by Equation 1.
[0022]
Here, c is the speed of sound, and τ represents the time difference (arrival time difference) when
voices arrive at the microphones 11a and 11b. Preferably, the straight line connecting the
microphones 11a and 11b is a direction vector from the microphones 11a to 11b, and in the
above equation, θ = 0 ° and θ = 180 ° indicate that the arrival direction of the voice and the
direction vector are parallel to each other. And can be distinguished as being in an antiparallel
state.
[0023]
[Voice Enhancement Process (AFE)] Next, in the voice collection system and the like according to
the present invention, CSP coefficients can be calculated and voice enhancement process can be
performed using this. Specifically, the speech enhancement processing can perform gain
adjustment using the CSP coefficients in the SS processing and the Flooring processing, thereby
improving the performance of identifying the sales clerk's voice and the performance of speech
recognition. Hereinafter, specific components of the speech processing means and their
relationships will be illustrated.
[0024]
04-05-2019
8
FIG. 3 is a diagram showing the configuration of the target voice extraction device 12 according
to an embodiment of the present invention. The target voice extraction device 12 receives voice
signals received by the microphones 11a and 11b included in the microphone array 11 as input,
and performs discrete Fourier transform processing units 105 and 106, CSP coefficient
calculation unit 110, group delay array processing unit 120, noise estimation The unit 130, the
SS processing unit 140, the gain adjustment processing unit 150, the flooring processing unit
160, and the like are appropriately included. In the processing of the discrete Fourier transform
processing units 105 and 106, the signals from the two microphones 11a and 11b are
appropriately amplified and divided into frames having a predetermined time width, and the
frequency band is appropriately limited, etc. in digital audio signal processing. Complex discrete
spectra can be output from the input signal, including known techniques.
[0025]
The CSP coefficient calculation unit 110 shown in FIG. 3 calculates CSP coefficients from the
complex discrete spectrum. Here, the CSP coefficient is a cross-correlation coefficient between
two channel signals calculated in the frequency domain, and is calculated by the following
equation 2.
[0026]
Where φ (i, T) is the CSP coefficient obtained from the voice signal received by the first and
second microphones 11a and 11b, i is the voice arrival direction (speaker direction index), T is
the frame number, s1 t) and s2 (t) are the signals of the first and second microphones 11a and
11b, respectively, received at time t. Also, DFT represents discrete Fourier transform, and IDFT
represents inverse discrete Fourier transform. Also, * represents a conjugate complex number.
[0027]
Next, in the group delay array processing unit 120, signals coming from the θ direction are
received by at least two microphones, and the signals coming from the θ direction are
emphasized by in-phase addition and addition. Therefore, signals coming from other than the θ
direction are not emphasized because they are not in-phase. Therefore, directivity can be formed
such that the sensitivity is high in the θ direction and low in the other directions.
04-05-2019
9
[0028]
Instead of the group delay array processing unit 120, it is also possible to form a blind spot in
the direction of noise or reverberation by adaptive array processing. Furthermore, other array
processing may be substituted. Also, it is possible to omit these array processes, that is, to bypass
them, and to use either one of the audio signals received by the two microphones as they are.
[0029]
Next, speech enhancement processing is performed using the CSP coefficients calculated as
described above. Specifically, in the speech enhancement processing, gain adjustment is
performed using CSP coefficients in SS processing and Flooring processing. Typically, SS
processing is subtraction processing represented by the following equation. Here, Xω (T) is a
power spectrum before SS processing, Yω (T) is a power spectrum after SS processing, ie, a
power spectrum after subtraction, and Uω is a power spectrum of noise. The Uω is estimated in
the noise section, that is, in the non-speaking section of the target speaker, and may be estimated
in advance and used in a fixed manner, or sequentially estimated (updated) simultaneously with
the input speech signal It may be estimated (updated) at fixed time intervals.
[0030]
That is, for example, a signal integrated by array processing for both of two input signals
received by the microphones 11 a and 11 b or Xω (T) which is one of the two input signals is
input to the noise estimation unit 130. The power spectrum Uω of noise is estimated
appropriately. α is a subtraction constant, and can take an arbitrary value such as a value close
to 1 (eg, 0.90).
[0031]
Then, gain adjustment may be appropriately performed as in the following equation. That is, the
gain adjustment is performed by multiplying the subtraction spectrum Yω (T) after the abovementioned SS processing by the CSP coefficient. Where Dω (T) is the power spectrum after gain
adjustment. When the target speaker is not speaking, the CSP coefficient is small, so that the
04-05-2019
10
power spectrum of the voice signal from other than the arrival direction is suppressed by this
processing. If "gain adjustment" can be performed as this equation shows, it can be understood
that the technical idea of the present invention is not limited to anything using CSP coefficients.
[0032]
Furthermore, the Flooring process is performed as follows. That is, the flooring process refers to
replacing small values included in actual data with appropriate numerical values without using
them as they are. Where Zω (T) is the power spectrum after the Flooring process, Uω is the
power spectrum of noise, and Uω is the same as that used in Equation 3, or the output of the
noise estimation unit 130 or the like as appropriate. Although it is possible, different ones
estimated by other methods may be used. As Equation 5 shows, Uω may be used only for
conditional judgment. The flooring factor (Flooring factor) β is a constant, and can take any
convenient value in the art, such as, for example, a value close to 0 (eg, 0.10).
[0033]
Usually, SS processing and flooring processing are used following this procedure, but it is one
point of the present invention that the gain adjustment by the CSP coefficient is introduced
between both processings. The output Zω (T) obtained as described above can be used as a voice
signal of a salesperson for storage in a server device or the like, or an input to voice recognition
means. Although FIG. 3 shows an example in which one of the audio signals that can be observed
using two microphones 11a and 11b is used as an output, the present invention is not limited to
this, and the audio collection method according to the present invention uses FIG. As described
later, it is possible to obtain an output for recording or voice recognition for each of the received
voice signals for two voices having different directions of reaching the microphone array 11. The
output for recording or voice recognition can be used for voice recognition as described later
with reference to FIG.
[0034]
[Speaker Direction Index] FIG. 4 is a diagram showing an example of a speaker direction index for
the position of the microphone. Assuming a direction vector connecting the microphones 11 a
and 11 b included in the microphone array 11, the direction in which speech from a speaker
arrives can be distinguished as a range of azimuth angles with respect to the direction vector
04-05-2019
11
centered on the microphone array 11. For example, a voice arriving along the direction from the
microphone 11a to the microphone 11b is substantially parallel to the direction vector, and the
value of cosine of azimuth angle is close to +1 (area where the speaker direction index is +7
shown in FIG. 4) . Further, for example, the voice arriving along the direction of the microphone
11b from the microphone 11b is close to antiparallel to the direction vector, and the value of
cosine of azimuth angle is close to -1 (speaker direction index shown in FIG. Area of As shown in
Equation 1, given the microphone spacing d and the sound speed c, the arrival time difference τ
depends on the angle θ, so the speaker direction index shown in FIG. 4 includes information on
the arrival time difference τ.
[0035]
There is no difference in arrival time for speech arriving at the microphones 11a and 11b from a
direction perpendicular to the microphone array 11, and here the speaker direction index for this
direction is represented as zero. That is, as described above, when the angle θ is represented by
equation 1, and the number of incoming samples is x and the sampling frequency is f, it is
represented by τ = x / f, so the sampling frequency is now 22050 Hz and the distance between
microphones When d = 12.5 cm, x = 0, that is, when the speaker direction index = 0, the angle θ
= 90 ° when the sound velocity is 340 m / s.
[0036]
Further, in FIG. 4, the speaker direction index +1 (or −1) represents a range in which the voice
reaching the microphones 11a and 11b is shifted by one sample (that is, X = 1), and in this case
The angle θ is 82.9 °.
[0037]
Similarly, the speaker direction indexes +2 to +7 (or -2 to -7) indicate the ranges in which the
voices reaching the microphones 11a and 11b are shifted by 1 to 7 samples, respectively.
Then, in the AFE, the target sound is extracted using the CSP coefficient in consideration of the
arrival time difference of the sound input to the microphones 11a and 11b. Here, at x = + 7, the
angle θ = 30.3 °, and at x = −7, the angle θ = 149.7 °. Therefore, in the linear direction
connecting the microphones 11a and 11b, the range of about 30 ° can be accepted as the same
sound arrival direction. As described above, the present invention is characterized in that
04-05-2019
12
changes in the attitude or position of a salesperson or a customer can be tolerated within the
range of arrangement that can utilize the arrival time difference.
[0038]
Now, assuming that the target speaker is in the direction of the speaker direction index = 0 (for
example, the right side), the microphones 11a and 11b as described above when the target
speaker at the speaker direction index = 0 speaks. There is no time delay in the received audio
signal, and the correlation between both audio signals is high. Therefore, the CSP coefficient φ
(0, T) becomes large.
[0039]
On the other hand, for example, when the voice comes from the direction of the speaker direction
index = + 4 (for example, the right side in the figure), the voice reaches the microphone 11b with
a delay of four samples from the microphone 11a. Therefore, φ (0, T) becomes smaller (in this
case, φ (4, T) becomes larger).
[0040]
Therefore, if it is desired to extract only the voice coming from the direction of the speaker
direction index = 0, the value of φ (0, T) should be tracked to extract the section in which φ (0,
T) becomes large. It will be. However, in the AFE, voices arriving from the direction of an object
with respect to the direction in which the microphones 11a and 11b arrive with the same time
difference, that is, the axis connecting the microphones 11a and 11b, are also received.
[0041]
For example, focusing on the speaker direction index = + 4, the voice coming from the speaker
direction index = + 4 on the right side of the figure can not be distinguished from the voice
coming from the speaker direction index = + 4 on the left side in the figure. . Therefore, it is
necessary to arrange the microphones 11a and 11b so as not to receive the problem of the
mirror image position.
04-05-2019
13
[0042]
By the way, when the speakers (that is, here, the customer 21 and the salesperson 22) sit
opposite to each other across the table 14, there is a possibility that they will be laterally offset
(ie, in a wide range in the lateral direction), Furthermore, the seating position and posture often
change even during dialogue. For this reason, it is necessary to be able to pick up a certain range
of voice in the direction of the target speaker.
[0043]
Although superdirective microphones are highly effective in terms of recording only the voice
signal of the target speaker, they are generally expensive, and it is difficult to cope with
variations in speaker position, depending on the seating position. The sound pickup performance
changes extremely. In addition, the superdirective microphone is large in size and has sharp
directivity in the direction opposite to the target direction. As a result, the layout relationship
between the booth layout and the microphones becomes extremely difficult.
[0044]
On the other hand, in the case of using a unidirectional microphone, the accuracy of the
directivity is not so high, so environmental sound in the surrounding area and conversation in
the adjacent booth will also be recorded. The unidirectional microphones are also relatively
expensive.
[0045]
FIG. 5 is a diagram showing the classification according to the directivity of the microphone, and
the nondirectional microphone shown in FIG. 5A has the same sensitivity to all directions of 360
degrees, and the bi-directional shown in FIG. Sex microphones are sensitive to the front and the
opposite side. Also, the unidirectional microphone shown in FIG. 5 (c) has high sensitivity to voice
only in the front direction. The sharp directional microphone shown in FIG. 5 (d) and the
superdirective microphone shown in FIG. 5 (e) each have sharper directional characteristics than
04-05-2019
14
unidirectionality.
[0046]
When AFE is used, relatively broad lobes are formed in the axial direction (+7, -7) of the
microphone array 11, as shown in FIG. When the customer 21 is positioned at the index of the
direction of the vehicle = -7, the lobes are wide in the axial direction (+7, -7), so the postures and
positions of the customer 21 and the salesperson 22 may slightly deviate. It is possible to
effectively cut the voice that arrives from outside the range of.
[0047]
And if AFE is used, the directivity / non-directivity of a microphone becomes irrelevant, and as a
result of being able to use a microphone of any directivity, the cost which a microphone requires
can also be held down low.
[0048]
[Arrangement of Microphone Array] FIG. 6 shows an example of the arrangement of the
microphone array according to an embodiment of the present invention.
As described above, since there is a problem of mirror image position when using AFE, it is
necessary to consider the position of the microphone. For example, the position indicated by
symbol A in FIG. When the microphone array 11 is disposed in (1), the voice of the next booth 16
may be extracted in the same manner.
[0049]
For this reason, in the present embodiment, the microphone array 11 is installed at a position
(for example, on the table 15) indicated by reference numeral B in FIG. 6 to avoid the above
problem.
The installation of the microphone array 11 in the present embodiment makes it difficult to
accurately detect the direction of the speaker in fine units, but there is no problem in terms of
collecting only the voice of the salesperson 22. Of course, in an environment where there is no
04-05-2019
15
incoming voice from an adjacent booth, for example, an embodiment may be envisaged in which
a microphone is placed at the position indicated by symbol A in FIG. 6 and only the part related
to speech enhancement by AFE of the present invention is applied. .
[0050]
[Objective Speech Extraction Device] FIG. 7 is a block diagram showing in detail the objective
speech extraction device 12 shown in FIG. In FIG. 7, it is assumed that the salesperson 22 and the
customer 21 are in a one-on-one dialogue. The target speech extraction device 12 includes a
speech zone index detection processing unit 31, a first speech recognition unit 32, a second
speech recognition unit 33, an integrated selection unit 34, and a recording range extraction unit
35. The audio signal received from the microphones 11 a and 11 b is input to the index detection
processing unit 31.
[0051]
In FIG. 7, the microphone 11a is located on the salesperson 22 side, and the microphone 11b is
located on the customer 21 side. The voice signal S1 (t) received by the microphone 11a (L-ch)
and the microphone 11b It is assumed that the audio signal S2 (t) received in (R-ch) is input.
Here, the input from any of the microphones is also sampled at a predetermined sampling
frequency by an A / D conversion unit (not shown) and given to the speech section index
detection processing unit 31 as a digital signal. Details of the operation of the speech zone index
detection processing unit 31 will be described later with reference to FIG.
[0052]
Next, the target voice extraction device 12 according to the present invention uses the voice
recognition units 32 and 33, and the voice signal of the salesperson and the voice of the
customer, which are separated voice signals output from the speech section index detection
processing unit 31. The speech recognition operation is appropriately performed on each of the
signals to obtain a recognition result and a time stamp. Here, the time stamp is time information
or the like output from the speech recognition units 32 and 33. The time stamp can be time
series information when integrating recognition results in a subsequent stage.
04-05-2019
16
[0053]
Next, the target speech extraction device 12 according to the present invention may integrate the
speech recognition results using the integration selection unit 34. Specifically, data in which the
speaker distinction, the speech recognition result, the time stamp and the like are associated with
one another may be generated.
[0054]
Next, the target voice extraction device 12 according to the present invention causes the
recording range extraction unit 35 to generate voice signals included in a predetermined or
designated time domain based on the information such as the speaker direction index, the voice
recognition result and the time stamp. Can be cut out and appropriately stored in a server device
or the like. In the present invention, by performing voice recognition individually for each of the
salesperson or the customer, when designating the recording part, the contents of the dialogue
between the two can be confirmed. Moreover, it is also possible to avoid recording of
unnecessary parts, and resources such as server devices can be efficiently used.
[0055]
[Process of Speech Section Index Detection Processing Unit 31] FIG. 8 is a flowchart for
explaining processing in the speech section index detection processing unit 31. The speech zone
index detection processing unit 31 acquires an audio signal (step S1), and determines whether
the audio signal is an input from the microphone 11a (step S2). If it is an input from the
microphone 11a (first microphone), for example, windowing processing with a Hanning window
or a Hamming window is performed for the salesperson digital voice input signal, and the
salesperson windowing processing completed signal is obtained (step S3). Subsequently, the sales
clerk windowing processed signal is converted into a frequency domain by discrete Fourier
transform processing to be a sales clerk frequency domain signal (step S4), and the processing
shifts to the processing indicated by the dashed line in the drawing. Similarly, if it is determined
in step S2 that the input is from microphone 11b (second microphone), windowing processing
(step S5) and discrete Fourier transform processing (step S5) are similarly performed for the
customer digital voice input signal. S6) is performed to make a customer frequency domain
signal.
04-05-2019
17
[0056]
As described above, the utterance section index detection processing unit 31 detects the speaker
direction index, and based on the salesperson frequency domain signal, the customer frequency
domain signal, and the speaker direction index, that is, based on Eq. A coefficient is calculated
(step S7).
[0057]
Subsequently, the salesperson side delay-and-sum array processing is performed on the
salesperson frequency domain signal and the customer frequency domain signal (step S8), and
the salesperson's voice signal is emphasized as a salesperson emphasis signal.
Similarly, the customer side delay and sum array processing is performed on the salesperson
frequency domain signal and the customer frequency domain signal (step S9) to emphasize the
voice signal of the customer to obtain a customer emphasizing signal.
[0058]
Next, after the noise is removed in the spectral subtraction process (step S10) and the gain
adjustment process (step S11) is further performed using the CSP coefficient, the salesperson
emphasis signal is appropriately floored (step S12). To get the sales clerk's voice signal.
[0059]
Similarly, noise is removed from the customer emphasis signal in the spectral subtraction process
(step S13), and after gain adjustment processing (step S14) is performed using the CSP
coefficient, the flooring process (step S15) is performed as appropriate. To get the customer side
voice signal.
[0060]
Further, the speech zone index detection processing unit 31 performs speech zone detection
processing based on the CSP coefficient shown in the above-mentioned equation 1 to obtain the
salesperson's voice signal and the customer's voice signal obtained as described above. Are
temporarily stored as independent channels (in the utterance period detection process, an
algorithm according to the above-mentioned target sound extraction method is used).
04-05-2019
18
Here, as described above, the speaker direction index is also detected along with the separation
of the target sound, and the separated voice signal is associated with the speaker direction index.
[0061]
The speech zone index detection processing unit 31 gives the speech signal of the salesperson
and the speaker direction index of the speech signal to the first speech recognition unit 32 and
also to the recording range extraction unit 35.
Further, the speech zone index detection processing unit 31 gives the voice signal on the
customer side and the speaker direction index of the voice signal to the second voice recognition
unit 33 and also to the recording range extraction unit 35.
[0062]
The first speech recognition unit 32 performs speech recognition on the salesperson's speech
signal to obtain a recognition result and a time stamp (to obtain a salesperson speech recognition
result and a salesperson time stamp). The second voice recognition unit 33 performs voice
recognition on the voice signal on the customer side to obtain a recognition result and a time
stamp (get a customer voice recognition result and a customer time stamp). Here, the time stamp
is time information output by the first speech recognition unit 32 and the second speech
recognition unit 33, and is used as time-series information when integrating recognition results.
[0063]
The sales clerk speech recognition result and the sales clerk time stamp, the customer speech
recognition result and the customer time stamp are given to the integrated selection unit 34,
where the speech recognition results are integrated and the dialogue table shown in Table 1 is
provided. (Note that this dialogue table may be presented to the user in HTML format, for
example).
[0064]
04-05-2019
19
[0065]
When a portion of a desired voice signal is selected as a recording unit from this dialogue table,
the integrated selection unit 34 generates a target speaker recording range (that is, a range
divided by a time stamp) and sends it to the recording range extraction unit 35.
The recording range extraction unit 35 extracts the voice signal of the corresponding section
(range) based on the speaker direction index and the target speaker recording range, and stores
the voice signal in the customer interaction recording server 13 as the salesperson voice.
[0066]
In the present embodiment, as described above, the recording section is determined using the
speaker direction index, the speech recognition result, and the time stamp, and speech
recognition is individually performed for each speaker. When specifying the recording part, the
user can specify the recording part while confirming the contents of the dialogue between the
two.
[0067]
Further, in the present embodiment, recording of unnecessary portions can be avoided, and as a
result, the disk capacity of the customer interaction recording server 13 can be reduced, which is
efficient.
[0068]
Here, the type of microphone and the AFE were compared from the viewpoint of reduction of the
voice signal of the customer (an evaluation test was performed).
For the evaluation experiment, voice signals collected in a simulated face-to-face fashion were
used.
In the evaluation test, it is assumed that one speaker each for the sales clerk and the customer
holds a talk on the investment trust on both sides of the 100 cm vertical (direction between the
04-05-2019
20
sales clerk and the customer) table. .
[0069]
The dialogue consists of the sales staff, the customer, and the contents of the sales staff's turn in
one set, and in three cases: a predetermined standard position, a position slightly shifted from the
standard position to the left and right, and a position extremely close to the table. Three sets of
each were recorded.
As microphones, two non-directional microphones of Sony (registered trademark) (Sony ECM55B) were used to construct a microphone array, which was placed in the center of the sales
clerk and the customer.
[0070]
For comparison, a unidirectional microphone (AKG 400) was installed with the direction of each
speaker to collect the voices of both speakers. The distance between the microphones was 12.5
cm for both directivity and non-directivity. In this evaluation test, AFE was performed on the
audio signal received by the nondirectional microphone.
[0071]
Here, in order to extract only the salesperson's voice signal and not leave the customer's voice
signal as a record, the customer's voice signal is regarded as noise and evaluated by the noise
reduction rate (NRR: Noise Reduction Rate). Did. At this time, based on the voice sound pressure
level of the customer collected by the nondirectional microphone close to the salesperson, the
effect was compared according to the degree of reduction of the voice signal of the customer
from the standard.
[0072]
However, in order to absorb the difference in the recording level caused by the difference in the
04-05-2019
21
recording device, normalization was performed on the computer so that the power of the audio
signal of the salesperson would be the same in each case. The definition of NRR used in this
evaluation experiment is as follows.
[0073]
Noise Reduction Rate (NRR:%) = customer vocal sound pressure level with nondirectional
microphone (reference microphone) [dB]-customer vocal sound pressure level with directional
microphone (or after AFE) [dB]
[0074]
Normally, NRR is calculated based on the SNR of the input and output, but in this evaluation
experiment, since the power of the audio signal is normalized, it is formulated as the difference of
only noise as described above.
Table 2 shows the experimental results.
[0075]
[0076]
In the experimental results, it can be seen that the omnidirectional microphone picks up all the
voices regardless of the voice incoming direction, and therefore, indicates a high sound pressure
level also for the customer's voice.
In addition, it can be seen that although the unidirectional microphone has directivity in the front
direction, the directivity characteristic is dull and therefore the customer's voice can not be
blocked so much. This means that it is completely useless for the purpose of recording only the
salesperson's voice on the server.
[0077]
04-05-2019
22
On the other hand, in the voice collection system (use of the nondirectional microphone)
according to the present embodiment, it can be seen that the voice of the customer is
significantly reduced and the customer voice is effectively suppressed. Although the sound
collection system according to the present embodiment shows a sound pressure level of 19.6 dB,
this is because the AFE adds a small amount of noise by performing the floor processing shown
in equation 5 for speech recognition. Note that this speech does not have information that can
identify the phonology (what you are talking about). In the voice collection system according to
the present embodiment, the voice of the salesperson is completely detected.
[0078]
In the above-mentioned embodiment, although voices are collected from the microphones and
only the voice of the salesperson is stored in the customer interaction record server by the
microphone array purpose voice extraction device, the voices of the customers are stored in the
server as needed. It is also possible. Further, if necessary, three or more microphones may be
arranged according to the speaker direction index shown in FIG. 4 to extract the voice of only the
desired speaker.
[0079]
Further, although the cross correlation coefficient is used in the above embodiment, another
method of obtaining the correlation coefficient may be used. Then, even if a program for realizing
the operation of the above-described voice collection system is operated on a computer, the voice
of only the desired speaker can be extracted similarly.
[0080]
[Example of Performance of Speech Enhancement According to Order of Stages of Speech
Processing] In speech collection according to the present invention, as shown in the steps of
speech processing and their order using the above-mentioned FIG. The speech enhancement
process for collecting the target speech is performed in the order of gain adjustment by the CSP
→ Flooring process. This order is an important point in speech enhancement for the speech
collection method according to the present invention, and the difference in the speech
enhancement performance due to the difference in processing order will be exemplified below.
04-05-2019
23
[0081]
The voice for testing the difference in performance of voice enhancement is collected through
the microphone array 11 and processed under conditions of sampling frequency 22 kHz, frame
size 23 ms, frame shift 15 ms, FFT size 512 points, and used for voice enhancement , Target
voice emphasis signal. Speech recognition processing was further appropriately performed on
the obtained target speech emphasis signal.
[0082]
First, an example in which the speech recognition rate is improved by using the speech
enhancement according to the present invention will be described. Table 3 shows command
recognition rates in the case where speech recognition processing is performed with only voice
processing as SS processing according to the prior art in speech recording of 50 types of voice
commands by 4 speakers, and a predetermined according to the present invention 10 shows a
comparison of command recognition rates when order-based speech enhancement, that is, SS
processing → gain adjustment by CSP → Flooring processing is performed. The command
recognition rate can be treated as a speech recognition rate. Therefore, as shown in Table 3, it is
possible to enhance the speech recognition rate by speech enhancement according to the present
invention.
[0083]
Next, an example will be shown in which the order of the speech enhancement steps according to
the present invention influences the speech recognition rate results. Table 4 shows the results of
comparison of command recognition rates in the case of changing the processing procedure of
voice emphasis as a table in which the results are added to Table 3. The speaker and the voice
collection condition are the same as the example shown in the above-mentioned Table 3. The
voice emphasis is performed in the procedure of gain adjustment procedure by SS processing →
Flooring processing → CSP as "processing procedure replacement 1", As processing procedure
replacement 2 ", speech enhancement was performed with gain adjustment by CSP → SS
processing → Flooring processing. Comparing voice recognition rates shown in Table 4 as
command recognition rates, as a procedure of speech enhancement according to the present
invention, remarkably high performance was obtained when processing in the order of SS
04-05-2019
24
processing → gain adjustment by CSP → Flooring processing. . Therefore, it turns out that the
procedure of processing in this order is important.
[0084]
FIG. 9 shows an example of the speech signal of the noise section at various stages of the speech
enhancement process according to the present invention. As the reason why the processing
procedure of speech enhancement according to the present invention jumps and shows high
performance, the explanation by schematic diagrams as shown in (a), (b), (c) and (d) of FIG. 9 can
be considered. Each example (200) of the noise section (the non-speaking section of the target
speaker) is expressed as the frequency characteristic of the amplitude. FIG. 9A is a schematic
view showing a power spectrum Xω (T) before the spectral subtraction (SS) processing is
performed. FIG. 9B is a schematic view showing a power spectrum Yω (T) after subtraction
subjected to the SS process, in which noise is reduced by the SS process. FIG. 9C is a schematic
diagram showing the power spectrum Dω (T) after gain adjustment by the CSP coefficient, and
the noise is further reduced by the gain adjustment by the CSP coefficient. FIG. 9D is a schematic
view showing the recognition power spectrum Zω (T) after the Flooring processing, and the
spectrum of the irregular noise becomes smooth.
[0085]
The effects of CSP and Flooring appear in the noise section (the non-speaking section of the
target speaker). The spectrum of the noise section is flattened by SS processing, and the popping
peaks are further collapsed by multiplying the CSP coefficients, and the valleys are filled and
smoothed by applying the Flooring (Fig. , Snow-capped) gentle spectral envelope. As a result, the
noise is not mistaken as the voice of the intended speaker. In the speech recognition method
according to the prior art, although the target speaker does not speak, there is a problem that the
surrounding noise is mistaken for the target speaker's voice to cause an erroneous recognition. It
is considered that the error is reduced if processing is performed by the processing procedure of
processing → gain adjustment (by the CSP coefficient) → Flooring processing.
[0086]
[Example of Operation State of Portable Salesperson Voice Collection Device] FIG. 10 illustrates
an operation state of the portable salesperson voice collection device 60 according to an
04-05-2019
25
embodiment of the present invention. The portable salesperson voice collecting apparatus 60
comprises microphones 60a and 60b, which constitute the microphone array in the apparatus
for carrying out the voice collecting method according to the present invention described above
with reference to FIGS. Furthermore, the portable salesperson voice collection device 60 includes
digital signal processing means capable of performing the steps of the voice collection method
according to the present invention, and appropriately includes storage means, voice reproduction
means and the like.
[0087]
Typically, the portable salesperson voice collection device 60 is fixed to the chest of the
salesperson 22, and the portable salesperson voice collection device 60 starts from the
salesperson 22 when the salesperson 22 faces the customer 21. Voice incoming direction 1 (70)
and voice incoming direction 2 (72) from the mouth of the customer 21 to the portable sales
person voice collecting apparatus 60 are different angles with respect to the direction vector
connecting the microphone 60a and the microphone 60b Are arranged to have For example, the
direction vector is directed from the top of the salesperson 22 to the foot and directed
substantially parallel to the body axis (the two microphones 60a and 60b appear to be arranged
vertically at the customer 21) Voice arrival direction 1 (70) may be a direction substantially
parallel to the direction vector, and voice arrival direction 2 (71) may be a direction substantially
perpendicular to the direction vector. Not limited to this, in the portable salesperson voice
collecting apparatus 60, direction vectors connecting the microphone 60a and the microphone
60b form different angles with respect to the voice incoming direction 1 (70) and the voice
incoming direction 2 (71), respectively. The size, shape, etc. of the portable salesperson voice
collecting device 60 may be designed as appropriate.
[0088]
Thus, the portable salesperson voice collecting apparatus 60 is arranged, the microphone 60a
and the microphone 60b are used as a microphone array in the voice collecting method
according to the present invention, and the above-mentioned method for target voice extraction
is implemented to By extracting the speech arriving at the microphone array with a time
difference, it is possible to selectively collect the sales clerk's 22 voice. In the present invention, a
portable salesperson voice collection device 60 having a form similar to a commercially available
voice recorder or the like can be used to implement an implementation means for selectively
collecting salespersons' voices.
04-05-2019
26
[0089]
[Hardware Configuration of Salesperson Voice Collection Device] FIG. 11 is a diagram showing a
hardware configuration of the salesperson voice collection device according to an embodiment of
the present invention. In FIG. 11, the salesperson voice collecting apparatus is assumed to be the
information processing apparatus 1000, and the hardware configuration thereof is illustrated.
Although the general configuration will be described below as an information processing
apparatus typified by a computer, it is needless to say that the minimum necessary configuration
can be selected according to the environment.
[0090]
The information processing apparatus 1000 includes a central processing unit (CPU) 1010, a bus
line 1005, a communication I / F 1040, a main memory 1050, a basic input output system (BIOS)
1060, a parallel port 1080, a USB port 1090, a graphic controller 1020, and a VRAM 1024. , An
audio processor 1030, an I / O controller 1070, and input means such as a keyboard and mouse
adapter 1100. Storage means such as a flexible disk (FD) drive 1072, a hard disk 1074, an
optical disk drive 1076, a semiconductor memory 1078, etc. can be connected to the I / O
controller 1070.
[0091]
Microphones 1036 and 1037, an amplifier circuit 1032, and a speaker 1034 are connected to
the audio processor 1030. A display device 1022 is connected to the graphic controller 1020.
[0092]
The BIOS 1060 stores a boot program executed by the CPU 1010 when the information
processing apparatus 1000 is started, a program depending on the hardware of the information
processing apparatus 1000, and the like. The FD (flexible disk) drive 1072 reads a program or
data from the flexible disk 1071 and provides the program or data to the main memory 1050 or
the hard disk 1074 via the I / O controller 1070. Although FIG. 5 shows an example in which the
hard disk 1074 is included inside the information processing apparatus 1000, an interface (not
04-05-2019
27
shown) for connecting an external device to the bus line 1005 or the I / O controller 1070 is
connected to the information processing apparatus. A hard disk may be connected or added
outside 1000.
[0093]
As the optical disk drive 1076, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM
drive, a CD-RAM drive can be used. At this time, it is necessary to use an optical disc 1077
corresponding to each drive. The optical disk drive 1076 can read a program or data from the
optical disk 1077 and can provide the program or data to the main memory 1050 or the hard
disk 1074 via the I / O controller 1070.
[0094]
The computer program provided to the information processing apparatus 1000 is stored in a
recording medium such as the flexible disk 1071, the optical disk 1077, or a memory card and
provided by the user. The computer program is installed in the information processing apparatus
1000 and executed by being read from the recording medium via the I / O controller 1070 or
downloaded via the communication I / F 1040. The operations that the computer program
causes the information processing apparatus to perform are the same as the operations in the
apparatus described above, and thus are omitted.
[0095]
The aforementioned computer program may be stored in an external storage medium. As the
storage medium, in addition to the flexible disk 1071, the optical disk 1077, or the memory card,
a magneto-optical recording medium such as MD or a tape medium can be used. Alternatively, a
storage device such as a hard disk or an optical disk library provided in a server system
connected to a dedicated communication line or the Internet may be used as a recording
medium, and a computer program may be provided to the information processing apparatus
1000 via the communication line. .
[0096]
The above example mainly describes the information processing apparatus 1000. However, the
04-05-2019
28
information described above is installed by installing a program having the function described in
the information processing apparatus in the computer and operating the computer as the
information processing apparatus. The same function as the processing device can be realized.
[0097]
The apparatus can be realized as hardware, software, or a combination of hardware and software.
In hardware and software combination implementation, implementation on a computer system
having a predetermined program is mentioned as a typical example. In such a case, the
predetermined program is loaded into the computer system and executed to cause the computer
system to execute the process according to the present invention. This program is composed of
instructions that can be expressed in any language, code or notation. Such instructions may be
performed by the system directly executing a particular function, or (1) conversion to another
language, code, or representation, or (2) copying to another medium, or both. After it has been
done, it will be possible to carry out. Of course, the present invention covers not only such a
program itself but also a program product including a medium on which the program is
recorded. The program for executing the functions of the present invention can be stored in any
computer readable medium such as a flexible disk, MO, CD-ROM, DVD, hard disk drive, ROM,
MRAM, RAM and the like. Such a program can be downloaded from another computer system
connected by a communication line or copied from another medium for storage on a computer
readable medium. In addition, such a program may be compressed or divided into a plurality and
stored on a single or a plurality of recording media.
[0098]
FIG. 1 is a block diagram schematically illustrating an example of a voice collection system
according to an embodiment of the present invention. It is a figure which shows the audio | voice
arrival direction with respect to a microphone. It is a figure which shows the structure of the
object audio | voice extraction apparatus 12 based on one Embodiment of this invention. It is a
figure which shows an example of the speaker direction index with respect to the position of a
microphone. It is a figure which shows the classification by the directivity of a microphone. It is a
figure which shows an example of the place which arrange | positions the microphone array by
embodiment of this invention. It is a block diagram which shows the object audio | voice
extraction apparatus 12 shown in FIG. 1 in detail. It is a flowchart for demonstrating the process
04-05-2019
29
in the speech area index detection process part 31 shown in FIG. It is a figure which shows the
example of the audio | voice signal of the noise area in the various steps of the speech
enhancement process based on this invention. FIG. 6 illustrates the operating status of the
portable salesperson voice collection device 60 according to an embodiment of the present
invention. It is a figure which shows the hardware constitutions of a salesperson audio | voice
collection apparatus based on one Embodiment of this invention.
Explanation of sign
[0099]
DESCRIPTION OF SYMBOLS 10 Voice collection system 11 Microphone array 12 Purpose voice
extraction device 13 Customer dialogue recording server 31 Utterance section index detection
processing part 32, 33 Speech recognition part 34 Integral selection part 35 Recording range
extraction part 60 Portable salesperson voice collection apparatus 105, 106 Discrete Fourier
transform processing unit 110 CSP coefficient calculation unit 120 Group delay array processing
unit 130 Noise estimation unit 140 SS processing unit 150 Gain adjustment processing unit 160
Flooring processing unit
04-05-2019
30
Документ
Категория
Без категории
Просмотров
0
Размер файла
50 Кб
Теги
jp2010026361
1/--страниц
Пожаловаться на содержимое документа