close

Вход

Забыли?

вход по аккаунту

?

JP2018189985

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2018189985
Abstract: To generate speaker identification information for each time based on a temporal
existence period of voices uttered by each sound source, and to display on the screen an object
corresponding to the speaking time of each identified speaker Provide electronic equipment. An
electronic device according to an embodiment receives digitized time-series amplitude data of
two acoustic signals from a first microphone and a second microphone arranged at a
predetermined distance, Frequency decomposition means for generating in time series power
values and phase values for each frequency of the amplitude data; section detection means for
detecting a speech section based on the result of the power value and the phase value; Based on
the speech direction estimation means for detecting the speech direction, the speaker clustering
means for generating the speaker identification information for each time based on the temporal
existence period of the speech emitted by each sound source, and based on the speaker
identification information And user interface display processing means for visibly displaying an
object corresponding to the speaking time of each speaker on the display screen. [Selected
figure] Figure 3
Electronic device and control method of electronic device
[0001]
Embodiments of the present invention relate to techniques for estimating the direction of a
speaker and displaying on the screen an object corresponding to the speaking time of each
speaker identified.
[0002]
There has been developed an electronic device for estimating the direction of a speaker based on
03-05-2019
1
the phase difference for each frequency component of speech input to a plurality of
microphones.
[0003]
JP, 2006-254226, A
[0004]
If voice is collected while the user holds the electronic device, the accuracy of estimating the
direction of the speaker may be reduced.
[0005]
An object of the present invention is to provide an electronic device and a control method of the
electronic device for estimating an orientation of a speaker and displaying on a screen an object
corresponding to a speaking time of each identified speaker.
[0006]
The electronic device according to the embodiment includes a first microphone and a second
microphone arranged at a predetermined distance, a frequency resolution unit, a section
detection unit, a speech direction estimation unit, a speaker clustering unit, and a user. And
interface display processing means.
The frequency resolution means receives digitized time-series amplitude data of two acoustic
signals from the first microphone and the second microphone, and generates power values and
phase values for each frequency of the amplitude data. Generate in time series.
The section detecting means detects an audio section based on the result of the power value and
the phase value in the frequency resolving means.
The speech direction estimation means detects the speech direction of the voice section based on
the detection result of the section detection means.
03-05-2019
2
The speaker clustering means generates speaker identification information for each time based
on the temporal existence period of the speech emitted by each sound source output from the
speech direction estimation means.
The user interface display processing means visibly displays an object corresponding to the
speaking time of each speaker on the display screen based on the speaker identification
information from the speaker clustering means.
[0007]
The perspective view which shows an example of the external appearance of the electronic
device of embodiment. FIG. 2 is a block diagram showing the configuration of the electronic
device of the embodiment. Functional block diagram of the recording application. FIG. 7 shows
source directions and the arrival time differences observed in acoustic signals. The figure which
shows the relationship between a flame | frame and frame shift amount. The figure which shows
the procedure and short-time Fourier-transform data of FFT process. The functional block
diagram of the speech direction estimation part. FIG. 2 is a functional block diagram showing an
internal configuration of each of a two-dimensional data conversion unit and a figure detection
unit. The figure which shows the procedure of phase difference calculation. The figure which
shows the procedure of coordinate value calculation. FIG. 2 is a functional block diagram
showing an internal configuration of a sound source information generation unit. The figure for
demonstrating direction estimation. The figure which shows the relationship between (theta) and
(DELTA) T. FIG. 6 is a view showing an example of a screen displayed by the user interface
display processing unit. The flowchart which shows the procedure which initializes the data
which relates to speaker identification.
[0008]
Embodiments will be described below with reference to the drawings.
[0009]
First, the configuration of the electronic device of the present embodiment will be described with
reference to FIG.
03-05-2019
3
This electronic device can be realized as a portable terminal, for example, a tablet personal
computer, a laptop or notebook personal computer, a PDA. Hereinafter, the electronic device is
referred to as a tablet personal computer 10 (hereinafter, referred to as a computer 10). Assume
that the case is realized as
[0010]
FIG. 1 is a view showing the appearance of the computer 10. The computer 10 comprises a
computer main body 11 and a touch screen display 17. The computer main body 11 has a thin
box-shaped housing. The touch screen display 17 is disposed on the surface of the computer
main body 11. The touch screen display 17 includes a flat panel display (for example, a liquid
crystal display (LCD)) and a touch panel. The touch panel is provided to cover the screen of the
LCD. The touch panel is configured to detect the position on the touch screen display 17 touched
by the user's finger or pen.
[0011]
FIG. 2 is a block diagram showing the system configuration of the computer 10. The computer
10 is, as shown in FIG. 2, a touch screen display 17, a CPU 101, a system controller 102, a main
memory 103, a graphics controller 104, a BIOS-ROM 105, a non-volatile memory 106, an
embedded controller (EC) 108, a microphone 109A, 109B, an acceleration sensor 110, and the
like.
[0012]
The CPU 101 is a processor that controls the operation of various modules in the computer 10.
The CPU 101 executes various software loaded from the non-volatile memory 106, which is a
storage device, to the main memory 103, which is a volatile memory. The software includes an
operating system (OS) 200 and various application programs. The various application programs
include a recording application 300.
[0013]
03-05-2019
4
The CPU 101 also executes a basic input / output system (BIOS) stored in the BIOS-ROM 105.
The BIOS is a program for hardware control.
[0014]
The system controller 102 is a device that connects between the local bus of the CPU 101 and
various components. The system controller 102 also incorporates a memory controller that
controls access to the main memory 103. The system controller 102 also has a function of
executing communication with the graphics controller 104 via a PCI Express standard serial bus
or the like.
[0015]
The graphics controller 104 is a display controller that controls the LCD 17A used as a display
monitor of the computer 10. The display signal generated by the graphics controller 104 is sent
to the LCD 17A. The LCD 17A displays a screen image based on the display signal. A touch panel
17B is disposed on the LCD 17A. The touch panel 17B is a capacitive pointing device for
performing input on the screen of the LCD 17A. The touch position on the screen where the
finger is touched, the movement of the touch position, and the like are detected by the touch
panel 17B.
[0016]
The EC 108 is a one-chip microcomputer including an embedded controller for power
management. The EC 108 has a function of powering on or off the computer 10 in accordance
with the operation of the power button by the user.
[0017]
The acceleration sensor 110 detects an acceleration applied to the electronic device 10 in the x,
y, z directions. It is possible to detect the orientation of the electronic device 10 by detecting the
acceleration in the x, y, z axis directions.
03-05-2019
5
[0018]
FIG. 3 is a functional block diagram of the recording application 300. As shown in FIG. A
frequency decomposition unit 301, a voice section detection unit 302, an utterance direction
estimation unit 303, a speaker clustering unit 304, a user interface display processing unit 305,
a recording processing unit 306, a control unit 307, and the like are provided.
[0019]
The recording processing unit 306 performs a recording process by performing compression
processing and the like on the audio data input from the microphone 109A and the microphone
109B and storing the audio data in the storage device 106.
[0020]
The control unit 307 can control the operation of each unit of the recording application 300.
[0021]
[Basic concept of sound source estimation based on phase difference for each frequency
component] The microphones 109A and 109B are two microphones disposed at a predetermined
distance in a medium such as air, and the medium vibrations at two different points. It is a means
for converting (sound wave) into an electric signal (acoustic signal).
Hereinafter, when the microphones 109A and the microphones 109B are handled collectively,
this is referred to as a microphone pair.
[0022]
The acoustic signal input unit digitizes two acoustic signals 403 and 404 by the microphone
109A and the microphone 109B by periodically A / D converting the two acoustic signals by the
microphone 109A and the microphone 109B at a predetermined sampling period Fr The
generated amplitude data is generated in time series.
[0023]
03-05-2019
6
Assuming that the sound source is positioned far enough compared to the distance between the
microphones, as shown in FIG. 4A, the wave front 401 of the sound wave emitted from the sound
source 400 and reaching the microphone pair becomes almost a plane .
When this plane wave is observed at two different points by using the microphone 109A and the
microphone 109B, the microphone according to the direction R of the sound source 400 with
respect to the line segment 402 connecting the microphone 109A and the microphone 109B
(this is called a baseline). A predetermined arrival time difference ΔT should be observed in the
acoustic signals converted in pairs.
When the sound source is sufficiently far, this arrival time difference ΔT is 0 when the sound
source 400 exists on a plane perpendicular to the baseline 402, and this direction is defined as
the front direction of the microphone pair.
[0024]
[Frequency Decomposition Unit] As a general method for decomposing amplitude data into
frequency components, there is a fast Fourier transform (FFT). As a representative algorithm, the
Cooley-Turkey DFT algorithm is known.
[0025]
As shown in FIG. 5, the frequency decomposition unit 301 extracts continuous N pieces of
amplitude data as a frame (T-th frame 411) from the amplitude data 410 from the acoustic signal
input unit to perform fast Fourier transform, and The extraction position is repeated while being
shifted by the frame shift amount 413 (T + 1st frame 412).
[0026]
The amplitude data constituting the frame is subjected to windowing 601 as shown in FIG. 6A,
and then fast Fourier transform 602 is performed.
03-05-2019
7
As a result, short-time Fourier transform data of the input frame is generated in the real part
buffer R [N] and the imaginary part buffer I [N] (603). An example of the windowing function
(Hamming windowing or Hanning windowing) 605 is shown in FIG. 6 (B).
[0027]
The short time Fourier transform data generated here is data obtained by decomposing the
amplitude data of the frame into N / 2 frequency components, and the real part R [k] in the
buffer 603 for the kth frequency component fk The numerical value of the imaginary part I [k]
represents a point Pk on the complex coordinate system 604 as shown in FIG. 6 (C). The square
of the distance of Pk from the origin O is the power Po (fk) of the frequency component, and the
signed rotation angle θ {θ: -π> θ ≧ π [radian]} from the real part axis of Pk is It is the phase
Ph (fk) of the frequency component.
[0028]
When the sampling frequency is Fr [Hz] and the frame length is N [samples], k takes an integer
value from 0 to (N / 2) -1, and k = 0 is 0 [Hz] (direct current), k = (N / 2) -1 represents Fr / 2 [Hz]
(highest frequency component), and the interval is equally divided by the frequency resolution
Δf = (Fr / 2) N ((N / 2) -1) [Hz] Is the frequency at each k, and is expressed by fk = k · Δf.
[0029]
Note that, as described above, the frequency decomposition unit 301 continuously performs this
process at a predetermined interval (frame shift amount Fs) to generate a frequency consisting of
a power value and a phase value for each frequency of input amplitude data. Generate
decomposition datasets in time series.
[0030]
[Voice Section Detection Unit] The voice section detection unit 302 detects a voice section based
on the result of the frequency decomposition unit 301.
[0031]
[Utterance Direction Estimation Unit] The speech direction estimation unit 303 detects the
speech direction of the speech segment based on the detection result of the speech segment
03-05-2019
8
detection unit 302.
FIG. 7 is a functional block diagram of the speech direction estimation unit 303. As shown in FIG.
The speech direction estimation unit 303 includes a two-dimensional data conversion unit 701, a
figure detection unit 702, a sound source information generation unit 703, and an output unit
704.
[0032]
(Two-Dimensional Data Generation Unit and Graphic Detection Unit) As shown in FIG. 8, the twodimensional data generation unit 701 includes a phase difference calculation unit 801 and a
coordinate value determination unit 802.
The figure detection unit 702 includes a voting unit 811 and a straight line detection unit 812.
[0033]
[Phase Difference Calculation Unit] The phase difference calculation unit 801 compares the two
frequency-resolved data sets a and b at the same time obtained by the frequency decomposition
unit 301, and determines the difference between the phase values of the two frequency
components. To generate phase difference data between a and b obtained by calculating For
example, as shown in FIG. 9, the phase difference .DELTA.Ph (fk) of a certain frequency
component fk is calculated by calculating the difference between the phase value Ph1 (fk) at the
microphone 109A and the phase value Ph2 (fk) at the microphone 109B. It is calculated as a
remainder system of 2π so as to fall within {ΔPh (fk): −π <ΔPh (fk) ≦ π}.
[0034]
[Coordinate Value Determination Unit] The coordinate value determination unit 802 calculates
phase difference data obtained by calculating the difference between the phase values of both
03-05-2019
9
frequency components based on the phase difference data obtained by the phase difference
calculation unit 801. Means for determining coordinate values to be treated as points on the twodimensional XY coordinate system of The X coordinate value x (fk) and the Y coordinate value y
(fk) corresponding to the phase difference ΔPh (fk) of a certain frequency component fk are
determined by the equation shown in FIG. The X coordinate value is a phase difference ΔPh (fk),
and the Y coordinate value is a frequency component number k.
[0035]
[Voting Unit] The voting unit 811 applies a linear Hough transform to each frequency component
given the (x, y) coordinates by the coordinate value determination unit 802, and sets its locus to
the Hough voting space by a predetermined method. It is a means to vote.
[0036]
Straight Line Detection Unit The straight line detection unit 812 is a unit that analyzes a vote
distribution on the Hough voting space generated by the voting unit 811 to detect a strong
straight line.
[0037]
[Sound Source Information Generating Unit] As shown in FIG. 11, the sound source information
generating unit 703 includes a direction estimating unit 1111, a sound source component
estimating unit 1112, a sound source sound resynthesizing unit 1113, a time series tracking unit
1114, and a duration time. An evaluation unit 1115, an in-phase unit 1116, an adaptive array
processing unit 1117, and a voice recognition unit 1118 are provided.
[0038]
[Direction Estimation Unit] The direction estimation unit 1111 receives the straight line detection
result by the straight line detection unit 812 described above, that is, the θ value for each
straight line group, and calculates the existing range of the sound source corresponding to each
straight line group.
At this time, the number of detected straight line groups is the number of sound sources (all
candidates).
03-05-2019
10
If the distance to the sound source is sufficiently far from the baseline of the microphone pair,
the range of the sound source is a conical surface with an angle to the baseline of the
microphone pair.
This will be described with reference to FIG.
[0039]
The arrival time difference ΔT between the microphone 109A and the microphone 109B can
change in the range of ± ΔTmax. As shown in FIG. 12A, when incident from the front, ΔT is 0,
and the azimuth angle φ of the sound source is 0 ° with reference to the front. Also, as shown
in FIG. 12B, when the sound is incident right from the right, ie, from the direction of the
microphone 109B, ΔT is equal to + ΔTmax, and the azimuth angle φ of the sound source is +
90 ° with positive clockwise with reference to the front. . Similarly, as shown in FIG. 12C, when
the voice is incident from the left just side, that is, from the direction of the microphone 109A,
ΔT is equal to −ΔTmax, and the azimuth angle φ is −90 °. Thus, ΔT is defined as positive
when the sound is incident from the right and negative when the sound is incident from the left.
[0040]
Based on the above, general conditions as shown in FIG. 12 (D) are considered. Assuming that the
position of the microphone 109A is A, the position of the microphone 109B is B, and the voice is
incident from the direction of the line segment PA, ΔPAB is a right triangle whose apex P is at a
right angle. At this time, with an inter-microphone center O and a line segment OC as the front
direction of the microphone pair, an angle that positively assumes a counterclockwise direction
with the OC direction being an azimuth angle of 0 ° is defined as an azimuth angle φ. Since
ΔQOB is a similar form of ΔPAB, the absolute value of the azimuthal angle φ is equal to ∠OBQ,
ie, BPABP, and the sign corresponds to the sign of ΔT. Also, ∠ABP can be calculated as sin
<−1> of the ratio of PA to AB. At this time, when the length of the line segment PA is
represented by ΔT corresponding to this, the length of the line segment AB corresponds to
ΔTmax. Therefore, including the sign, the azimuth can be calculated as φ = sin <−1> (ΔT /
ΔTmax). Then, the existence range of the sound source is estimated as a conical surface 1200
opened at (90−φ) ° with the point O as a vertex and the baseline AB as an axis. The source is
somewhere on this conical surface 1200.
03-05-2019
11
[0041]
As shown in FIG. 13, ΔTmax is a value obtained by dividing the distance between microphones L
[m] by the sound velocity Vs [m / sec]. At this time, it is known that the sound velocity Vs can be
approximated as a function of the air temperature t [° C.]. Now, it is assumed that the straight
line 1300 is detected by the straight line detection unit 812 at the inclination θ of Hough. Since
the straight line 1300 is inclined to the right, θ is a negative value. When y = k (frequency fk),
the phase difference ΔPh indicated by the straight line 1300 can be obtained by k · tan (−θ) as
a function of k and θ. At this time, ΔT [sec] is a time obtained by multiplying one period (1 / fk)
[sec] of the frequency fk by the ratio of the phase difference ΔPh (θ, k) to 2π. Since θ is a
signed quantity, ΔT is also a signed quantity. That is, when the sound is incident from the right
in FIG. 12D (the phase difference ΔPh has a positive value), θ has a negative value. Also, when
the sound is incident from the left in FIG. 12D (the phase difference ΔPh has a negative value),
θ has a positive value. Therefore, the sign of θ is inverted. In the actual calculation, the
calculation may be performed at k = 1 (the frequency immediately above the DC component k =
0).
[0042]
[Sound source component estimation unit] The sound source component estimation unit 1112
evaluates the distance between the (x, y) coordinate value for each frequency component given
by the coordinate value determination unit 802 and the straight line detected by the straight line
detection unit 812 Thus, a point located in the vicinity of the straight line (ie, frequency
component) is detected as a frequency component of the straight line (ie, sound source), and the
frequency component for each sound source is estimated based on the detection result.
[0043]
[Sound source sound re-synthesis unit] The sound source sound re-synthesis unit 1113 performs
inverse FFT processing on frequency components of the same acquisition time that make up each
sound source sound to generate the sound source sound (amplitude (amplitude of Resynthesize
the data).
As illustrated in FIG. 5, one frame overlaps with the next frame with a time difference of the
amount of frame shift. As described above, in the overlapping period in a plurality of frames, the
amplitude data of all the overlapping frames can be averaged to form final amplitude data. Such
processing makes it possible to separate and extract the source sound as its amplitude data.
03-05-2019
12
[0044]
[Time Series Tracking Unit] The straight line detection unit 812 obtains a straight line group
every Hough voting by the voting unit 811. Hough voting is performed collectively on m
consecutive (m ≧ 1) FFT results. As a result, the straight line group can be obtained in time
series as a period of m frames as a period (this will be referred to as a “graphic detection
period”). Further, since θ of the straight line group corresponds to the sound source direction
φ calculated by the direction estimation unit 111 one to one, it corresponds to a stable sound
source even if the sound source is stationary or moving. The trajectory on the time axis of θ (or
φ) should be continuous. On the other hand, the straight line group detected by the straight line
detection unit 812 includes a straight line group corresponding to background noise (referred to
as a “noise straight line group”) depending on how the threshold is set. is there. However, it
can be expected that the trajectory on the time axis of θ (or φ) of such a noise linear group is
not continuous or is short.
[0045]
The time-series tracking unit 1114 is a means for obtaining a locus on the time axis of φ by
dividing φ thus obtained for each figure detection cycle into continuous groups on the time axis.
[0046]
[Duration Evaluation Unit] The duration evaluation unit 1115 calculates the duration of the
trajectory from the start time and the end time of the trace data whose tracking has been output,
which is output by the time series tracking unit 1114, and this duration is a predetermined
threshold Are recognized as trajectory data based on sound source sound, and others are
recognized as trajectory data based on noise.
Trajectory data based on sound source sound is called sound source stream information. The
sound source stream information includes time-series trajectory data of the start time Ts and the
end time Te of the sound source sound, and θ, ρ, and φ representing the sound source
direction. Although the number of straight line groups by the figure detection unit 702 gives the
number of sound sources, noise sources are also included therein. The number of sound source
stream information by the duration evaluation unit 1115 gives the number of reliable sound
sources excluding those based on noise.
03-05-2019
13
[0047]
[In-phase unit] The in-phase unit 1116 obtains the time transition of the sound source direction
φ of the stream by referring to the sound source stream information by the time series tracking
unit 1114, and obtains an intermediate value from the maximum value φmax and the minimum
value φmin of φ. The value φmid = (φmax + φmin) / 2 is calculated to determine the width
φw = φmax−φmid. Then, the time series data of the two frequency-resolved data sets a and b
that are the sources of the sound source stream information is extracted from the time when the
predetermined time goes back from the start time Ts of the stream to the time when the
predetermined time elapses from the end time Te. Then, the phase difference is made to be in
phase by correcting so as to cancel the arrival time difference which is inversely calculated by
the intermediate value φmid.
[0048]
Alternatively, the time-series data of the two frequency-resolved data sets a and b can be always
in phase, with the sound source direction φ at each time by the direction estimation unit 1111
as φmid. Whether to refer to sound source stream information or φ at each time is determined
in the operation mode, and this operation mode can be set and changed as a parameter.
[0049]
[Adaptive array processing unit] The adaptive array processing unit 1117 directs the center
directivity to the front 0 ° of the time-series data of the two frequency-resolved data sets a and
b subjected to extraction and in phase, and gives a predetermined margin to ± φw. The time
series data of the frequency component of the sound source sound of the stream is separated and
extracted with high accuracy by applying adaptive array processing in which the value obtained
by Although this method differs in method, it functions in the same manner as the sound source
component estimation unit 1112 in that time series data of frequency components are separated
and extracted. Therefore, the sound source sound re-synthesis unit 1113 can also re-synthesize
the amplitude data of the sound source sound from the time series data of the frequency
component of the sound source sound by the adaptive array processing unit 1117.
03-05-2019
14
[0050]
As adaptive array processing, as described in reference 3 “Amada et al.“ Microphone array
technology for speech recognition ”, Toshiba review 2004, VOL. 59, NO. 9, 2004”, the beam
itself is a beam. It is possible to apply a method for clearly separating and extracting speech
within a set directivity range, such as using “Griffith-Jim type generalized sidelobe canceller”
known as a construction method of a former as the main and sub two. it can.
[0051]
Usually, when using adaptive array processing, in order to set the tracking range in advance and
to use only the voice from that direction to wait, in order to wait for voice from all directions, a
large number of adaptive arrays with different tracking ranges Needed to be prepared.
On the other hand, in this embodiment, after the number of sound sources and the direction
thereof are actually obtained, only the adaptive array of the number according to the number of
sound sources can be operated, and the following range is also predetermined according to the
direction of the sound sources. Since a narrow range can be set, speech can be separated and
extracted efficiently and with good quality.
[0052]
At this time, by making the time series data of the two frequency resolved data sets a and b in
phase in advance, it is possible to process the sound in any direction only by setting the tracking
range in the adaptive array processing only in the front. It will be.
[0053]
[Speech recognition unit] The speech recognition unit 1118 analyzes and collates time-series
data of frequency components of the sound source extracted by the sound source component
estimation unit 1112 or the adaptive array processing unit 1117 to obtain symbolic content of
the stream, That is, it extracts symbols (strings) representing linguistic meanings, types of sound
sources, and distinction of speakers.
[0054]
In addition, each functional block from the direction estimation unit 1111 to the voice
recognition unit 1118 can exchange information by a wire connection not shown in FIG. 11 as
needed.
03-05-2019
15
[0055]
The output unit 704 calculates the number of sound sources obtained as the number of straight
line groups by the figure detection unit 702 as sound source information by the sound source
information generation unit 703, the space of each sound source as a sound source of acoustic
signals estimated by the direction estimation unit 1111. Typical existence range (angle φ for
determining the conical surface), component configuration of speech emitted from each sound
source (time series data of power and phase for each frequency component) estimated by the
sound source component estimation unit 1112 Separated speech (time-series data of amplitude
values) separated for each sound source synthesized by the synthesis unit 1113; sound source
excluding noise sources determined based on the time-series tracking unit 1114 and the duration
evaluation unit 1115 Number, the temporal existence period of the voice emitted by each sound
source determined by the time series tracking unit 1114 and the duration evaluation unit 1115,
the in-phase unit 1116 and the adaptive array processing unit 1117, and More determined, (time
series data of the amplitude value) separating speech for each sound source, obtained by the
speech recognition unit 1118, symbolic content of each source audio is means for outputting
information including at least one of.
[0056]
[Speaker Clustering Unit] The speaker clustering unit 304 generates the speaker identification
information 310 for each time based on the temporal existence period of the voice emitted by
each sound source, which is output from the output unit 704.
The speaker identification information 310 includes information associated with the speech start
time and the speech start time.
[0057]
[User Interface Display Processing Unit] The user interface display processing unit 305 presents
various setting contents necessary for the above-mentioned sound signal processing to the user,
accepts setting input from the user, and saves the setting contents in the external storage device.
And reading from the external storage device, (1) display of frequency components for each
microphone, (2) display of phase difference (or time difference) plot (that is, display of twodimensional data), (3) various votes Display of distribution, (4) Display of maximum position, (5)
03-05-2019
16
Display of straight line group on plot, (6) Display of frequency component belonging to straight
line group, (7) Display of locus data, etc. Various processing It is a means for visualizing the
results and intermediate results and presenting them to the user, and allowing the user to select
desired data for more detailed visualization.
By doing this, the user can confirm the function of the acoustic signal processing apparatus
according to the present embodiment, make adjustments so that the user can perform a desired
operation, and thereafter use the apparatus in an adjusted state. Become possible.
[0058]
The user interface display processing unit 305 displays, for example, a screen shown in FIG. 14
on the LCD 17A based on the speaker identification information 310.
[0059]
At the top of the LCD 17A, an object 1401 indicating the speaker A, an object 1402 indicating
the speaker B, and an object 1403 indicating the speaker C are shown.
At the bottom of the LCD 17A, objects 1413A, 1411A, 1413B, 1412 and 1411B corresponding
to the speaking time of the speaker are displayed.
The objects 1411A and 1411B correspond to the speaking time of the speaker A, and are
displayed in a color corresponding to the object 1401.
The object 1412 corresponds to the speaking time of the speaker B, and is displayed in a color
corresponding to the object 1402. The objects 1413A and 1413B correspond to the speaking
time of the speaker C, and are displayed in a color corresponding to the object 1403. When there
is an utterance, objects 1413A, 1411A, 1413B, 1412 and 1411B are displayed to flow with time
from right to left.
[0060]
03-05-2019
17
By the way, the speaker identification using the phase difference of the distance between the
microphones loses accuracy when the terminal is moved during recording. The present device
uses the acceleration in the x, y, z axis directions obtained from the acceleration sensor 110 and
the inclination of the terminal for speaker identification to suppress convenience degradation
due to accuracy degradation.
[0061]
The control unit 307 requests the speech direction estimation unit 303 to initialize data related
to the process of estimating the direction of the speaker according to the acceleration detected
by the acceleration sensor.
[0062]
FIG. 15 is a flow chart showing a procedure for initializing data relating to speaker identification.
[0063]
The control unit 307 determines whether the difference between the current tilt of the device 10
obtained from the acceleration sensor 110 and the tilt of the device 10 at the start of the speaker
identification exceeds a threshold (step B11).
When it is determined that the threshold value is exceeded (Yes in step B11), the control unit
307 requests the speech direction estimation unit 303 to initialize data relating to speaker
identification (step B12).
The speech direction estimation unit 303 initializes data relating to speaker identification (step
B13). Then, the speech direction estimation unit 303 performs speaker identification processing
based on the data newly generated by each unit in the speech direction estimation unit 303.
[0064]
If it is determined that the initial state is not exceeded (No in step B12), the control unit 307
determines that the values of acceleration in the x, y, and z directions of the device 10 obtained
from the acceleration sensor 110 take periodic values. It is determined whether it has become
03-05-2019
18
(step B14). When it is determined that the value of the acceleration has a periodic value (Yes in
step B13), the control unit 307 requests the recording processing unit 306 to stop the recording
process (step B15). Further, the control unit 307 requests the frequency decomposition unit 301,
the speech segment detection unit 302, the speech direction estimation unit 303, and the
speaker clustering unit 304 to stop the processing. The recording processing unit 306 stops the
recording process (step B16). The frequency decomposition unit 301, the speech segment
detection unit 302, the speech direction estimation unit 303, and the speaker clustering unit 304
stop the processing.
[0065]
According to the present embodiment, the state held by the user by requesting the speech
direction estimation unit 303 to initialize data related to the process of estimating the direction
of the speaker according to the acceleration detected by the acceleration sensor 110 It is
possible to suppress the decrease in the accuracy of estimating the direction of the speaker even
if the voice is collected by
[0066]
Note that since various processes of the present embodiment can be realized by a computer
program, the computer program can be installed in a computer through a computer-readable
storage medium storing the computer program and executed. Similar effects can be easily
realized.
[0067]
While certain embodiments of the present invention have been described, these embodiments
have been presented by way of example only, and are not intended to limit the scope of the
invention.
These novel embodiments can be implemented in various other forms, and various omissions,
substitutions, and modifications can be made without departing from the scope of the invention.
These embodiments and modifications thereof are included in the scope and the gist of the
invention, and are included in the invention described in the claims and the equivalent scope
thereof.
03-05-2019
19
[0068]
DESCRIPTION OF SYMBOLS 10 Tablet-type personal computer (electronic equipment) 101 CPU
103 main memory 106 storage device 108 embedded controller 109 A microphone 109 B
microphone 110 acceleration sensor 200 operating system 300 Recording application 301
Frequency separation unit 302 Voice section detection unit 303 Speech direction estimation unit
304 Speaker clustering unit 305 User interface display processing unit 306 Recording
processing unit 307 Control unit
03-05-2019
20
Документ
Категория
Без категории
Просмотров
0
Размер файла
33 Кб
Теги
jp2018189985
1/--страниц
Пожаловаться на содержимое документа