close

Вход

Забыли?

вход по аккаунту

?

JP2009199158

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2009199158
To provide an acoustic pointing device capable of performing pointing operation without placing
any accessory on a desk. SOLUTION: A microphone array 101 having two or more microphone
elements, an A / D conversion unit 102 for converting analog sound pressure data into digital
sound pressure data, buffering 201 for storing digital sound pressure data, and digital sound
Direction estimation unit 203 for estimating the sound source direction of sudden sound from
pressure data based on the correlation of sound between microphone elements, noise estimation
unit 204 for estimating the noise level of digital sound pressure data, noise level, An SNR
estimation unit 205 that estimates the ratio of signal components based on digital sound
pressure data, a power calculation unit 209 that calculates and outputs an output signal from the
ratio of signal components, and a sound source direction and an output signal are integrated. The
integration unit 211 that specifies the sound source position, and the specified sound source
position is converted to a point on the screen of the display device based on the data of the
screen conversion DB 213. And a control unit 212 for. [Selected figure] Figure 2
Acoustic pointing device, pointing method of sound source position and computer system
[0001]
The present invention relates to a pointing device for a user to designate one point on a screen of
a display device in a computer, and more particularly to the technology of a pointing device
using acoustic information.
[0002]
04-05-2019
1
In general, a pointing device using a mouse is often used to operate a computer.
This is because the mouse operation and the cursor on the screen of the display device in the
computer move in conjunction, and the point you want to select on the screen is selected by
moving the cursor over that point and clicking on that point be able to.
[0003]
In addition, a pointing device using a touch panel is already in widespread use as a consumer
product. In the touch panel, an element for detecting the pressure at which the user presses the
screen is mounted on each point on the display, and it is determined whether or not the point is
pressed for each point.
[0004]
As a pointing device using acoustic information, there is a device using a special pen that emits
an ultrasonic wave when the screen is pressed (see, for example, Patent Document 1).
[0005]
There is also a device that generates light together with the ultrasonic wave and detects the
pointing position based on the time difference between the ultrasonic wave and the light
reaching the sound receiving / receiving element (see, for example, Patent Document 2).
[0006]
Further, there is a device which provides a vibration detection element on a display, detects the
direction of vibration generated when a fingertip touches the display, and detects a point position
based on it (see, for example, Patent Document 3).
[0007]
JP, 2002-351605, A JP, 2002-132436, A JP, 2002-351614, A
[0008]
04-05-2019
2
However, with a pointing device using a mouse for operating the computer, the mouse must be
placed on a desk, which is not convenient.
Also, although there is no accessory device, the touch panel requires a special display, and each
element on the display must be provided with a pressing pressure detection device, and it is
necessary to approach and point to the display.
[0009]
In the techniques described in Patent Documents 1 and 2, the user needs to use a special pen or
coordinate input device.
Further, in the technique described in Patent Document 3, it is necessary to contact the display
surface to generate vibration to detect the vibration.
[0010]
In view of the above problems, according to the present invention, when operating a computer,
an acoustic pointing device capable of performing a pointing operation using sound information
even from a distant place without placing the accessory on the desk at all, It is an object of the
present invention to provide a pointing method and a computer system using the above acoustic
pointing device.
[0011]
In order to solve the above problems, an acoustic pointing device according to the present
invention is an acoustic pointing device that detects a sound source position of a sound to be
detected and converts the sound source position into one point on a screen of a display device. A
microphone array for holding the microphone elements, an A / D conversion unit for converting
analog sound pressure data obtained by the microphone array into digital sound pressure data,
and sound from the digital sound pressure data between the microphone elements A direction
estimation unit for estimating a sound source direction of the sound to be detected based on a
correlation; estimating a noise level among the digital sound pressure data; and determining the
sound level based on the noise level and the digital sound pressure data An output signal
calculation unit that calculates a signal component and outputs the result as an output signal;
and integrating the sound source direction and the output signal to generate the sound
Comprising an integrated unit for specifying a position, and a control unit for converting the
04-05-2019
3
identified the sound source position to a point on the screen of the display device.
[0012]
Furthermore, in the acoustic pointing device according to the present invention, the microphone
array includes a plurality of sub microphone arrays, and the sound source directions estimated
by the direction estimation unit are integrated by triangulation for each of the sub microphone
arrays. A triangulation unit that calculates a distance to the direction and the sound source
position; and a localization determination unit that determines whether the sound source
direction and the distance are within an area defined in advance; A signal, the sound source
direction and the distance within the area are integrated to specify the sound source position,
and the control unit converts the specified sound source position into one point on the screen of
the display device .
[0013]
Furthermore, in the acoustic pointing device according to the present invention, the microphone
array includes a plurality of sub microphone arrays, and the conversion unit converts the digital
sound pressure data into a signal in the time-frequency domain; The sound source direction and
the distance to the sound source position are calculated by integrating the sound source
direction estimated by the direction estimation unit using a signal by triangulation, and the sound
source direction and the distance are previously calculated. And a localization determination unit
that determines whether or not it is within a defined area, and the integration unit unifies the
output signal, the sound source direction and the distance within the area to specify the sound
source position. The control unit converts the identified sound source position into one point on
the screen of the display device.
[0014]
Furthermore, in the acoustic pointing device according to the present invention, the microphone
array includes a plurality of sub microphone arrays, and the conversion unit converts the digital
sound pressure data into a signal in the time-frequency domain; The sound source direction and
the distance to the sound source position are calculated by integrating the sound source
direction estimated by the direction estimation unit using a signal by triangulation, and the sound
source direction and the distance are previously calculated. A localization determination unit that
determines whether or not the output signal is within an area to be defined; an output signal
determination unit that determines that the output signal output from the output signal
calculation unit is equal to or greater than a predetermined threshold; A sound source frequency
database in which frequency characteristics are stored in advance, and a conversion table
capable of specifying the one point on the screen from the sound source position And the
integration unit weights the output signal which is equal to or more than the threshold value
04-05-2019
4
with the frequency characteristic, integrates the sound source direction and the distance within
the region, and integrates the sound source position. Is identified, and the control unit converts
the identified sound source position into one point on the screen using the information of the
screen conversion database.
[0015]
Furthermore, the present invention provides a method of pointing a sound source position used
for the acoustic pointing device, and a computer system provided with the acoustic pointing
device.
[0016]
According to the present invention, it is possible to provide an acoustic pointing device capable
of performing pointing operation using sound information even from a remote place without
placing the accessory device on the desk at the time of operating the computer.
[0017]
Furthermore, it is possible to provide a pointing method of the sound source position used for
the acoustic pointing device.
[0018]
Furthermore, a computer system using the acoustic pointing device can be provided.
[0019]
Hereinafter, embodiments of the present invention will be described in detail with reference to
the attached drawings.
[0020]
FIG. 1 is a schematic block diagram of an acoustic pointing device showing an example of an
embodiment according to the present invention.
The acoustic pointing device is, for example, a pointing device used in place of a mouse device of
a personal computer (hereinafter referred to as "PC"), and a user may specify a specific position
04-05-2019
5
shown on the display by tapping a desk. It is a possible pointing device.
Hereinafter, a sound to be detected as a sound source of the acoustic pointing device by a sound
striking a desk or the like will be referred to as "a sudden sound".
The acoustic pointing device shown in FIG. 1 is a microphone array 101 composed of at least two
or more microphone elements (hereinafter also referred to as “microphones”), and
multichannel sudden sound of each microphone element obtained by the microphone array 101.
A / D (Analogue to Digital) converter 102 for converting analog sound pressure data to digital
sound pressure data, buffering unit 201 for storing digital sound pressure data by a specific
amount, and converting digital sound pressure data to time-frequency domain signals STFT
(Short Term Fourier Transform) unit 202 for converting, microphone array divided into a
plurality of sub microphone arrays (hereinafter also referred to as "sub array"), sudden
occurrence calculated by correlation of sound between microphone elements in the same sub
microphone array Direction of sexual sound in azimuth and elevation Based on the direction
estimation unit 203 that estimates the direction, the sound source direction determined for each
sub microphone array is integrated, and the triangulation unit 206 that measures the azimuth,
elevation angle and distance of the sound source, and the sound source determined by the
triangulation unit 206 The localization determination unit 207 that determines whether the
position is within a predetermined range, the noise estimation unit 204 that estimates the noise
power of the background from the digital sound pressure data, and the SNR (Signal to Noise
Ratio) from the digital sound pressure data and the noise power An SNR estimation unit 205 for
estimation, an SNR determination unit 208 for outputting an SNR where the estimated value of
the SNR output from the SNR estimation unit 205 is equal to or more than a predetermined
threshold, and a power calculation unit 209 for calculating signal power from digital sound
pressure data and SNR. A power determination unit 210 for outputting a signal power at which
the signal power is equal to or higher than a predetermined threshold value; The integration unit
211 outputs the specified time-frequency component as the sound source position coordinates in
the area defined in advance by the localization determination unit, and the control unit 212
converts the sound source position coordinates to a specific point on the display screen .
[0021]
Furthermore, a sound source frequency database (hereinafter referred to as "DB") 208 storing
the frequency characteristics of the target sound in advance, and a screen conversion DB 213 for
correlating sound source coordinates with specific points on the display screen are provided.
[0022]
04-05-2019
6
When the digital sound pressure data is used only as a signal in the time domain, it is possible to
specify the sound source position by omitting the STFT unit 202, power determination unit 210,
SNR determination unit, and sound source frequency DB 208. .
FIG. 2 shows a schematic block diagram of the acoustic pointing device using signals in the time
domain only.
FIG. 2 shows the minimum configuration for identifying the sound source position.
Here, the output signal calculation unit refers to the noise estimation unit 204, the SNR
estimation unit 205, and the power calculation unit 209.
Furthermore, in order to specify the sound source position more accurately, it is necessary to
include the triangulation unit 206 and the localization determination unit 207.
[0023]
FIG. 3 is a hardware configuration diagram of the acoustic pointing device and a computer
system including the same.
FIG. 3A shows the hardware configuration of the acoustic pointing device, and the microphone
array 101 described above, an A / D converter 102 for converting the analog sound pressure
data into digital sound pressure data, and the acoustic pointing device A central processing unit
103 that performs processing related to the above, a memory 104, and a storage medium 105
that stores programs related to the acoustic pointing device and physical coordinates of each
microphone element of the microphone array.
In the acoustic pointing device shown in FIG. 1, each component except the microphone array
101 and the A / D conversion unit 102 executes the program while using the volatile memory
104 on the central processing unit 103. Is realized.
04-05-2019
7
[0024]
FIG. 3 (b) is a hardware configuration diagram of a computer system provided with the acoustic
pointing device.
The computer system includes an acoustic pointing device 10, a central processing unit 20 that
processes a program using information on a sound source position of the acoustic pointing
device 10, a storage device 30 used in the program and the arithmetic processing, and a sound
source position on a screen. And a display device for displaying on the display.
[0025]
Next, each component shown in FIG. 1 will be described in more detail.
And a display device 106 for displaying the sound source position as a point on the screen.
[0026]
The multi-channel digital sound pressure data converted by the A / D converter 102 is stored in
the buffering unit 201 for a specific amount for each channel.
In the processing in the time-frequency domain, processing is not usually performed each time
one sample is obtained, but is performed collectively after multiple samples are obtained.
No processing is performed until a specific amount of sound pressure data is stored, and
processing is performed only after a specific amount of digital sound pressure data is stored.
[0027]
04-05-2019
8
The buffering unit 201 has a function of storing digital sound pressure data of this specific
amount. Digital sound pressure data obtained by each microphone element is processed
separately for each microphone element by an index i starting from zero. Assuming that n is an
integer, digital sound pressure data of the ith microphone element sampled at the nth time after
the start of digital conversion is denoted as xi (n).
[0028]
An STFT (Short Term Fourier Transform) unit 202 converts digital sound pressure data for each
microphone element into a signal in the time-frequency domain according to (Expression 1)
below.
[0029]
Here, j is defined by (Expression 2).
[0030]
Further, Xi (f, τ) is the f-th frequency component of the ith element.
f starts from 0 and ends at N / 2.
N is a data length of digital sound pressure data to be converted to a signal in the time-frequency
domain. It is usually called frame size. S is usually called frame shift, and is an amount by which
digital sound pressure data is shifted when converted to a signal in the time-frequency domain.
The buffering unit 201 continues to store digital sound pressure data until acquiring new S
samples for each microphone element, and after acquiring S samples, the STFT unit 202 converts
the data into a time-frequency domain signal.
[0031]
τ is called a frame index, and corresponds to the number of times of conversion to a signal in
the time-frequency domain. τ starts from 0. w (n) is called a window function, and normally,
functions such as Blackman window, Hanning window, Hamming window are used. The use of a
window function enables highly accurate time-frequency resolution.
04-05-2019
9
[0032]
The digital sound pressure data converted into the time-frequency domain signal is sent to the
direction estimation unit 203.
[0033]
The direction estimation unit 203 first divides the microphone elements forming the microphone
array into a plurality of sub microphone arrays.
Then, for each sub microphone array, the sound source direction is estimated in each coordinate
system. When the division into sub microphone arrays is performed, for example, into R sub
microphone arrays, M microphone elements forming the microphone array are allocated to at
least one of R sub microphone arrays. Two or more sub microphone arrays may be allocated, and
in this case, a plurality of sub microphone arrays will have the same microphone element.
[0034]
FIG. 4 is a diagram showing a sub microphone array. FIG. 4A is a diagram showing a linear
arrangement of sub microphone arrays. In the case of the linear arrangement, the direction
orthogonal to the array direction in which the microphone elements are arranged is defined as an
angle of 0 degrees, and only the angle θ formed by the straight line connecting the sound
source and the sub microphone array and the straight line counterclockwise from that direction
is estimated. It becomes possible. Also, d represents the microphone interval. FIG. 4B is a diagram
showing a state in which the M microphone elements described above are allocated to R sub
microphone arrays, in which three microphone elements are allocated to one sub microphone
array. ing.
[0035]
When the two microphone elements of the sub microphone array are arranged parallel to the
desk surface, the angle θ is estimated as the horizontal azimuth angle. On the other hand, when
the two microphone elements are disposed perpendicularly to the desk surface, the angle θ is
04-05-2019
10
estimated as the elevation angle in the vertical direction. In this way, the azimuth and elevation
are estimated.
[0036]
The sub microphone array has at least two microphone elements, and in the case of two
microphone elements, θ is estimated by (Equation 3).
[0037]
Here, ρ is the phase difference between the frame τ and the frequency index f of the input
signals of the two microphone elements.
F is the frequency (Hz) of the frequency index f. F=(f+0.5)/N×Fs/2とする。 Fs
is a sampling rate of the A / D converter 102. Let d be the physical spacing (m) of the two
microphone elements. c is the speed of sound (m / s). The speed of sound varies depending on
the exact, temperature and density of the medium, but it is usually fixed at one value such as 340
m / s.
[0038]
Since the processing inside the direction estimation unit 203 is the same processing for each
time-frequency, the suffix (f, τ) of the time-frequency will be abbreviated and described
hereinafter. The process of the direction estimation unit 203 performs the same process for each
time-frequency. When the sub-microphone array uses three or more microphone elements and
they are arranged on a straight line, it is possible to calculate the direction with high accuracy by
the SPIRE algorithm in the straight line arrangement. For details of the SPIRE algorithm, see M.
Togami, T. Sumiyoshi, and A. Amano, "Stepwise phase difference restoration method for sound
source localization using multiple microphone pairs", ICASSP 2007, vol. I, pp. 117-120, 2007. It
is described in.
[0039]
Since the SPIRE algorithm uses a plurality of microphone pairs having different intervals between
04-05-2019
11
adjacent microphone elements (hereinafter referred to as "microphone intervals"), the
microphone elements constituting the sub-microphone array should be arranged so that the
respective microphone intervals differ. Is desirable. Sort each microphone pair in order from the
one with the smallest microphone spacing. Let p be a mark for identifying one microphone pair,
p = 1 be the microphone pair with the shortest microphone spacing, and p = P be the microphone
pair with the longest microphone spacing. The following processing is sequentially performed
from p = 1 to p = P. First, an integer np satisfying the following (Expression 4) is found.
[0040]
Since the range enclosed by the inequality corresponds to 2π, only one solution is always found.
Then, the following (equation 5) is executed.
[0041]
Further, before performing the above processing for p = 1, the following (Expression 6) is set as
an initial value.
[0042]
Further, dp is the distance between the microphone elements of the p-th microphone pair.
After the above processing is performed until p = P, the sound source direction is estimated by
(Expression 7).
[0043]
The estimation accuracy of the sound source direction estimation is known to increase as the
microphone spacing increases. However, if the microphone spacing is longer than a half
wavelength of the signal for estimating the direction, one direction is identified from the phase
difference between the microphones. It is known that two or more directions with the same
phase difference can not exist (spatial aliasing). The SPIRE algorithm has a mechanism for
selecting a direction closer to the sound source direction determined by the short microphone
interval among the two or more estimated directions generated by the long microphone interval.
04-05-2019
12
Therefore, it has an advantage that the sound source direction can be estimated with high
accuracy even at a long microphone interval where spatial aliasing occurs. When the microphone
pair is in a non-linear arrangement, the SPIRE algorithm for the non-linear arrangement makes it
possible to calculate also the azimuth angle and possibly the elevation angle.
[0044]
Also, if the digital sound pressure data is not a signal in the time-frequency domain, ie, only in
the time domain, the SPIRE algorithm can not be used. In the case of only the time domain, the
GCC-PHAT (Generalized Cross Correlation PHAse Transform) method is used to estimate the
direction.
[0045]
The noise estimation unit 204 estimates the background noise level from the output signal of the
STFT unit 202. For estimation of the noise level, MCRA (Minima Controlled Recursive Averaging)
or the like is used. The noise estimation process of MCRA is based on the minimum statistics
method. In the minimum statistic cis method, the minimum power in several frames is used as an
estimate of noise power for each frequency. In general, the sound of a voice or a desk often has a
large amount of power suddenly for each frequency, and rarely holds a large amount of power
for a long time. Therefore, the component which takes the minimum power in several frames can
be approximated to the component in which only noise is included, and noise power can be
estimated with high accuracy even in a speech utterance section. The noise power for each
estimated microphone element and frequency is denoted as Ni (f, τ). i is an index of the
microphone element, and noise power is estimated for each microphone element. Also, the noise
power is a value dependent on τ because it is updated every frame. The noise estimation unit
204 outputs the estimated noise power Ni (f, τ) for each microphone element / frequency.
[0046]
In the case of only the time domain, noise is characterized in that the output of power is small
but the duration is long as compared with a sudden sound, so that noise power can be estimated.
[0047]
The SNR estimation unit 205 estimates a signal to noise ratio (SNR) from the estimated noise
power and the input signal Xi (f, τ) of the microphone array according to the following equation
04-05-2019
13
(8).
[0048]
SNRi (f, τ) is the frame τ of the microphone index i and the SNR of the frequency index f.
The SNR estimation unit 205 outputs the estimated SNR.
The SNR estimation unit 205 may smooth the input power in the time direction. Smoothing
enables stable SNR estimation resistant to noise.
[0049]
The triangulation unit 206 integrates the sound source direction obtained for each submicrophone array, and measures the azimuth angle, the elevation angle, and the distance to the
sound source position. The sound source direction determined by the i-th sub-microphone array
is represented by the following (Equation 9) for the sound source direction determined in each
coordinate system for each sub-microphone array.
[0050]
For example, as shown in FIG. 4, a direction orthogonal to the array direction is defined as an
angle of 0 degrees, and a counterclockwise direction from the direction orthogonal to the array
direction is defined as a sound source direction. Here, in general, the sound source direction is
composed of two elements of azimuth angle and elevation angle, but when only one of them can
be estimated, such as when the sub-microphone array is arranged in a straight line, only one of
them can be used. It may be configured. In this case, the sound source direction obtained in the
coordinate system of the i-th sub-microphone array of which the number of elements is one is
converted into the sound source direction in the absolute coordinate system. Let the sound
source direction in the converted absolute coordinate system be Pi. From the result of the ith
sub-microphone array, it can be estimated that the sound source is present on the sound source
direction Pi. From this, it is considered appropriate to estimate that the point of intersection of
the sound source directions Pi determined by all the sub-microphone arrays is the sound source
04-05-2019
14
position. From this, the triangulation section 206 outputs the intersection of the sound source
direction Pi as the sound source position.
[0051]
In general, there may be a case where one intersection point of the sound source direction Pi is
not determined. In such a case, intersection points of two sound source directions are obtained
for all the sets of sub-microphone arrays, and the average value of the intersection points is
output as a sound source position. Averaging increases the robustness of the variation in the
intersection point position.
[0052]
In some cases, two sound source directions do not have an intersection. In that case, the solution
obtained by the set of sub-microphone arrays not having the intersection point is not used for
sound source position estimation of the corresponding time-frequency domain, or in the
corresponding time-frequency domain. , Do not perform sound source position estimation. When
there is no intersection point, it is considered that the information of the phase difference
includes noise because there is a sound source other than the sound source to be observed.
Therefore, more accurate source position estimation becomes possible by not using such a
source position estimated in the time-frequency domain.
[0053]
Further, when the sub-microphone array is in a linear arrangement, it is not possible to estimate
both the azimuth angle and the elevation angle, and only the angle between the array direction of
the sub-microphone array and the sound source can be estimated. In this case, the sound source
is present on a plane such that the angle formed by the array direction of the sub-microphone
array and the sound source is an estimated value. The intersection point of such a plane obtained
by each sub-microphone array is output as a sound source position or a sound source direction.
When all the sub-microphone arrays are in a linear arrangement, the average value of the
intersection points of the planes determined for all the combinations of the sub-microphone
arrays is output as the sound source position. Averaging increases robustness for some of the
intersection point variations.
04-05-2019
15
[0054]
Also, when some sub-microphone arrays are in a linear arrangement and the other submicrophone arrays are in a non-linear arrangement, one sub-microphone array in a linear
arrangement and one sub-microphone array in a non-linear arrangement are combined Thus, one
estimate of the sound source position can be obtained. When using the linear arrangement and
the non-linear arrangement in combination, the average value of the intersections determined by
the combination of all the sub-microphone arrays with the minimum number of sub-microphone
arrays where one intersection point is determined as one unit is the final Output as an estimated
value of the typical sound source position.
[0055]
The localization determination unit 207 determines whether the sound source position obtained
by the triangulation unit 206 is on a desk, or whether the sound source position is within a
predetermined tapping area. Whether or not the absolute value of the height of the sound source
from the desk calculated from the information of the sound source position determined by the
triangulation unit 206 is below a predetermined threshold, and on the desk the sound source
calculated from the information of the sound source position. In the case where the planar
coordinates are simultaneously satisfied with the two viewpoints of whether or not they are in
the tapping area, the localization determination unit 207 outputs the sound source direction and
the distance to the sound source as the information of the sound source position. Note that the
sound source direction and the distance to the sound source may be output as an azimuth angle
and an elevation angle. The localization determination unit outputs a positive determination
result when the two viewpoints described above are satisfied simultaneously, and outputs a
negative determination result when the two viewpoints are not satisfied simultaneously. The
determination result and the sound source direction and distance output from the triangulation
unit may be integrated. The definition of the tapping area will be described later.
[0056]
The SNR determination unit 208 outputs time-frequency components for which the estimated
value of the SNR for each time-frequency output by the SNR estimation unit 205 is equal to or
greater than a predetermined threshold. The power calculation unit 209 calculates the signal
power Ps from the SNR for each time-frequency output from the SNR estimation unit 205
04-05-2019
16
according to (Expression 10) below.
[0057]
Here, Px is the power of the input signal.
[0058]
The power determination unit 210 outputs a time-frequency component in which the signal
power for each time-frequency output from the power calculation unit 209 is equal to or higher
than a predetermined threshold.
The integration unit 211 weights the power output by the power calculation 209 of the
corresponding component with respect to the time-frequency component simultaneously
specified by the power determination 210 and the SNR determination 208 with the weight for
each frequency held in the sound source frequency DB 208. That is, when the frequency
characteristics of a target sound such as a sound striking a desk can be measured in advance, the
frequency characteristics are stored in the sound source frequency DB 208. Then, by weighting
with the power stored in the sound source frequency DB 208, it is possible to perform position
estimation more accurately.
[0059]
The weight is set to zero for time-frequency components that the power determination unit 210
and the SNR determination unit 208 have not identified at the same time. In addition, the weight
is set to zero also for the time-frequency component determined by the localization
determination unit 207 not to be within the area to be struck.
[0060]
In the present embodiment, the output signal determination unit refers to the SNR determination
unit 208 and the power determination unit 210.
[0061]
04-05-2019
17
If the tapping area is cut into a grid of several cm on a side and the estimation result of the sound
source position of the corresponding component is included in the ith grid for each timefrequency, weighted power corresponding to the power Pi of the grid Add
The grid power addition process is performed in this way for every time-frequency. A grid with
the largest power after addition processing is output as a final sound source position. The size
and number of grids are predefined.
[0062]
In addition, the time length for performing addition processing of grid power may be defined in
advance, or the above addition processing may be performed only for a time zone determined as
a voice section using VAD (Voice Activity Detection). It is good. By shortening the time for
performing the addition process, it is possible to further shorten the reaction time until the sound
source position is determined after the beating sound is emitted. However, if it is shortened, it
has the disadvantage of being vulnerable to noise.
[0063]
Further, by lengthening the time for performing the addition process, the reaction time until the
sound source position is determined after the beating sound becomes long, but there is an
advantage of being robust against noise. It is necessary to determine the time to perform the
addition process in consideration of such a trade-off relationship, but since the tapping sound
generally stops ringing in a short time of about 100 ms, the time to perform the addition process
is also about that It is desirable to set to time. Also, if the maximum power of the grid is less than
a predetermined threshold, the result is discarded assuming that there is no beating sound. On
the other hand, when the maximum power of the grid is larger than the predetermined threshold,
the sound source position is output, and the processing of the integration unit 211 is ended.
[0064]
The control unit 212 converts the coordinates of the sound source position of the beating sound
output from the integration unit 211 into a specific point on the screen based on the information
of the screen conversion DB 213.
04-05-2019
18
[0065]
The screen conversion DB 213 receives a coordinate of a sound source position and holds a table
for converting a specific point on the screen.
The conversion may be any conversion that can specify one point on the screen from the sound
source position of the beating sound, such as a linear conversion using a 2-by-2 matrix. For
example, ignore the information on the height of the sound source obtained at the time of sound
source position estimation, associate the position information of the sound source on the
horizontal plane with one point on the screen, and click one point on the converted screen
Control the PC as if you were dragging or. In addition, using height information, for example,
when sound is generated from a certain height or more, it is considered that one point on the
screen is double-clicked, and sound is generated from a certain height or less In this case, the
interpretation may be changed according to the height information, such as assuming that one
point on the screen is clicked. By doing so, more diverse user operations are possible.
[0066]
FIG. 5 is a diagram showing an example of setting of the user's tapping position on the desk. A
certain plane on the desk is designated in advance as a striking area in advance on the desk 301
to be struck. If the sound source position of the estimated tapping sound is within this tapping
area, the sound is accepted. The microphone array may be set on the display 302 like the submicrophone arrays 303 to 305, or may be set separately on the desk. Here, the sub microphone
array 303 estimates an elevation angle, and the sub microphone arrays 304 and 305 estimate an
azimuth angle. By installing the sub microphone array on the display, the center of the
coordinate axis of the microphone array and the center of the display can be aligned, and it
becomes possible to more intuitively specify one point in the virtual space of the display.
[0067]
FIG. 6 is a diagram showing the process flow of the apparatus for determining the button on the
screen pressed by the user using the above-described desk strike position detection.
[0068]
After the system startup, it is determined in termination determination 501 whether the program
04-05-2019
19
should be terminated by some method such as when the computer shuts down or the user
presses the termination button of the tapping position detection program of the desk.
[0069]
If it is determined in the termination determination 501 that the processing is terminated, the
processing is terminated.
If it is not determined that the process is ended, the process proceeds to digital conversion 502,
and converts analog sound pressure data acquired by the microphone array into digital sound
pressure data.
The conversion is performed by the A / D converter. The converted digitized digital sound
pressure data is taken on a computer. Digital conversion may be performed for each sample, or
may be simultaneously read into a computer by a plurality of samples, for example, in
accordance with the minimum processing length of the tapping sound. The acquired digital data
is decomposed by time-frequency transform 503 into components for each time-frequency using
a short time Fourier transform. By using a short time Fourier transform, it is possible to estimate
the arrival direction of sound for each frequency component.
[0070]
In the environment where the tapping sound program is used, in addition to the tapping sound,
human voice often exists as noise. Human voice is a sparse signal in the time-frequency domain,
and it is known that the component is ubiquitous in a part of frequency bands. Therefore, by thus
estimating the sound source direction in the time-frequency domain, it becomes possible to easily
reject the frequency component in which the human voice is unevenly distributed, and the
beating sound detection accuracy is improved.
[0071]
In the detection result rejection determination 505, it is determined whether or not the detected
beating sound is really a beating sound in the beating area on the desk. If it is determined that
04-05-2019
20
the sound is not beating, the process proceeds to an end determination 501. If it is determined
that the sound is a tapping sound, mapping of each point of the tapping area to one point on the
screen is defined in advance, and the pressing position determination 506 for determining the
position where the button is pressed Identify one point on the screen from the location
information. Whether or not the button is present at the position of the tapping area is
determined by the button presence determination 507, and if it is determined that the button is
not present, the process returns to the end determination 501. If it is determined that a button is
present, processing similar to that performed when the button is clicked using a mouse or other
pointing device on the screen is executed as a button action 508.
[0072]
FIG. 7 shows a specific processing flow of the localization determination unit, the power
determination unit, the SNR determination unit, and the integration unit. In the localization
determination unit 207, for each time-frequency component, based on the information on the
sound source direction and distance calculated by the triangulation unit using a plurality of submicrophone arrays, ie, the azimuth angle and elevation angle, the azimuth angle and elevation
angle It is determined whether or not the area is within a pre-defined hitting area (localization
determination 601). The predefined strike area may be a rectangular area on a desk, as in the
strike area shown in FIG. 5, or may have a spatial thickness. It is sufficient if it is a space where it
can be judged from the information of elevation angle and azimuth angle whether or not it is in
the hitting area.
[0073]
The power determination unit 210 determines whether the size of the beating sound is larger
than the noise power estimated using a method such as the above-mentioned MCRA method
(noise power comparison 602). The MCRA method is a method of estimating the power of
background noise from sounds in which speech and background noise are mixed. The MCRA
method is a method based on minimum statistics. The minimum statistic is a method that regards
the minimum power in the past several frames as the power of background noise under the
hypothesis that speech has a sudden loudness. However, the background noise power estimated
by the minimum statistic tends to be smaller than the actual background noise power. The MCRA
method is a method of correcting the background noise power estimated with the minimum
statistic by smoothing it in the time direction or the like, and calculating a value close to the
actual background noise power. The tapping sound is not speech, but from the viewpoint of
taking a large amount of power suddenly, it exhibits statistical properties similar to speech, so
04-05-2019
21
that it is possible to apply a method of estimating background noise power such as MCRA
method.
[0074]
If the power of the beating sound is greater than the noise power, then the SNR of the power of
the background noise and the beating sound is calculated. The SNR determination unit 208
determines whether or not the beating sound power is larger than the calculated SNR (SNR
determination 603), and when it is large, determines that the time-frequency component is a
beating sound component.
[0075]
In the integration unit 211, the tapping area is divided in advance in a lattice shape. A timefrequency component determined to be a beating sound component is assigned to the
corresponding grid from an estimate of the azimuth and elevation of that component. At the time
of allocation, a value obtained by multiplying the power of the beating sound component by the
frequency-dependent weighting is added to the corresponding grid. Such processing is
performed only for a predefined frequency band and a predefined time length. Then, the grid
with the largest power is detected (grid detection 604), and the azimuth and elevation of the grid
are output as the azimuth and elevation of the sound to specify the sound source position. Here,
when the grid power at which the power is maximized falls below a predefined threshold, it is
determined that no beating sound is present.
[0076]
The processing order of each of the localization determination unit 207, the power determination
unit 210, and the SNR determination unit 208 is not limited to the order shown in FIG. Before the
processing in the integration unit 211, each processing of the localization determination unit
207, the power determination unit 210, and the SNR determination unit 208 may be completed.
[0077]
04-05-2019
22
FIG. 8 is a diagram showing a time waveform of a typical beating sound. The beating sound takes
a sudden large value (the direct sound of the beating sound). After that, the echo / reverberation
component of the beating sound comes. This echo and reverberation components can be
regarded as sounds coming from various directions. Therefore, it is not desirable to use the echo
and reverberation components for estimating the direction of the beating sound since it is
difficult to estimate the direction as compared with the direct sound. The echo / reverb
component generally has less power than the direct sound, so it is not considered to be a
smashing component for the component whose power is less than the sudden loud sound
immediately after the sudden loud sound. It can be determined by the method. From such a point
of view, also when assigning the beating sound component for each time-frequency to each grid
in the frequency determination unit, processing is added that components with smaller power are
not assigned to the grid compared to the previous frame. Also good. By adding the sound, it is
possible to detect a striking sound that is resistant to echo and reverberation.
[0078]
FIG. 9 is a schematic diagram of component assignment to the grid for each time-frequency
component. The tapping sound detection device is assumed to be used as a substitute for a PC
operating device such as a mouse. Therefore, in an environment where a beat sound detection
apparatus is used, it is assumed that there are many sound sources such as human speech.
Therefore, there is a need for a beating sound detector that operates robustly even in the
presence of a speech source. Speech is generally considered to be a sparse signal in the timefrequency domain. That is, in the time-frequency domain, the speech is unevenly distributed in
power in some components. Therefore, it is possible to operate the beating sound detecting
apparatus robustly even in the environment where the sound source is present by removing the
partially distributed component.
[0079]
The integration unit 211 determines whether or not the elevation angle and the azimuth angle
are within the tapping area, and only when it is within the hitting area is regarded as a hitting
sound. By performing such a determination, it is possible to reject part of the time-frequency
domain in which the speech component is unevenly distributed.
[0080]
The integration unit 211 operates to output a grid with the largest power, but after determining
04-05-2019
23
the direction in which the power becomes largest in each sub microphone array, the largest
direction is integrated and triangulated The sound source direction of the beating sound may be
estimated.
[0081]
FIG. 10 is a diagram showing an example of the frequency for each direction in the submicrophone array.
For example, as shown in FIG. 10, for each sub-microphone array, the power in each direction
viewed from each sub-microphone array is added. In the method of allocating time-frequency
components on a two-dimensional plane or three-dimensional space, the number of allocated for
each lattice often becomes extremely small. In such a case, a histogram can be calculated
separately for each sub-microphone array, a direction that gives the maximum value of each
histogram can be calculated separately, and then integration by triangulation will allow more
robust estimation. .
[0082]
FIG. 11 shows an example in which the tapping area is set to have a thickness in the height
direction. As in this example, by making the tapping area thick in the height direction, it becomes
possible to detect a sound of a finger being sang in the air, in addition to being robust against
some estimation errors in the elevation direction.
[0083]
FIG. 12 is a diagram showing an example of the arrangement of sub-microphone arrays. In this
example, a plurality of sub-microphone arrays 1101 to 1104 are arranged so as to surround a
tapping area. As shown in FIG. 12, by arranging so as to surround the tapping area, the tapping
sound position is detected with higher accuracy compared to the arrangement of the submicrophone arrays 303 to 305 shown in FIG. 5 and FIG. It becomes possible.
[0084]
04-05-2019
24
FIG. 13 is a view showing an application example where the sound source pointing device is
applied to a striking sound detection apparatus. The display 1204 is disposed on the desk so that
the display surface and the desk surface are in parallel, and the plurality of sub-microphone
arrays 1201 to 1203 are disposed on the display. The tapping sound area is the entire display
screen. By such setting, when the user strikes one point on the display surface on the display, it is
possible to know where the struck point is. That is, it becomes possible to use a tapping sound
detection device as shown in FIG. 13 instead of the touch panel. In addition, although the touch
panel can only literally detect "whether or not it touched", using the tapping sound detection
device of the present invention, the tapping area is set with thickness in the height direction to
squeeze a finger on the space The case can also be detected.
[0085]
FIG. 14 shows an application example in which the tapping sound detection device is applied to a
baseball "strike determination device". In a so-called strike determination apparatus, a ball is
thrown from a throwing area 1301 to a target 1305 as shown in FIG. Then, it is determined
which of the squares 1 to 9 on the target 1305 has been hit. The sound produced when the ball
strikes can be detected by the tapping sound detection device of the present invention because of
the sudden sound having a large power. In this case, by arranging a plurality of sub-microphone
arrays 1302 to 1304 as shown in FIG. 14 and applying a beating sound detection device, when
one ball is hit, any mass of 1 to 9 is heated. It becomes possible to determine whether it has hit a
frame or not. The metal sound when the ball hits the frame and the sound when the ball hits the
mass have different frequency characteristics. Therefore, when the ball hits the frame by
referring to the characteristics of the frequency component determined to be a beating sound. It
is possible to distinguish between the case where the ball hit the square and the ball.
[0086]
FIG. 15 shows an application example in which the tapping sound detection device is applied to a
soccer “goal position determination device”. The configuration is the same as that of the strike
determination device of FIG. It is determined by using the sub microphone array 1402 to 1404
the tapping sound detection device which of the 1 to 9 squares in the target 1405 the ball kicked
from the kick area 1401 has hit.
04-05-2019
25
[0087]
FIG. 16 is an application example in which the tapping sound detection device is applied to a
"bound position determination device" for table tennis. It becomes possible to know the position
where the ball of table tennis bounces. The configuration is the same as the strike determination
device and the goal position determination device. The position where the ball bounces on the
coat 1501 is determined by the tapping sound detection device using the sub-microphone arrays
1502-1507. The sound when the table tennis ball bounces on the court 1501 is a sudden sound,
so it can be determined by the striking sound detection device. This makes it possible to obtain
information about the trajectory of the ball, which the viewer could not obtain until now, by live
broadcast of table tennis or the like.
[0088]
FIG. 17 shows an application example in which a tapping sound detection device is applied to
"the wall of tennis" to detect the position of the ball hitting the wall. Since there was no way of
knowing where the wall was hit so often in tennis, it was not possible to determine whether the
hit direction was good or bad. The tapping sound detection device using a plurality of submicrophone arrays 1602 to 1604 arranged on the wall 1601 makes it possible to detect the
position where the ball hits. For example, by storing the hit position of the ball and displaying it
on the display of the computer later, it is possible to see results such as large variation in the hit
position.
[0089]
FIG. 18 is a view showing another application of applying the above-mentioned sound source
pointing device to a striking sound detection apparatus. It is a figure showing an example of use
when a sudden sound other than the sound which strikes a desk is sounded in the air, such as a
user waving a finger. By setting the tapping area to have a thickness in the height direction, it is
possible to detect a sudden sound that sounds in the air.
[0090]
It is a schematic block diagram of the acoustic pointing device which shows an example of
04-05-2019
26
embodiment concerning the present invention. It is a schematic block diagram of the abovementioned acoustic pointing device which uses a signal of only a time domain. It is a hardware
block diagram of the said acoustic pointing device and a computer system provided with the
same. It is a figure which shows the linear arrangement | positioning of the submicrophone array
used for the said acoustic pointing device. It is a figure which shows the example of a setting of
the user's striking position on the desk in the said acoustic pointing device. It is a figure which
shows the detection flow of the tapping position in the said acoustic pointing device. It is a figure
which shows the processing flow of determination in the said acoustic pointing device, and
integration. It is a figure which shows the time waveform of the beating sound in the said
acoustic pointing device. It is a schematic diagram of the grating | lattice for every timefrequency component in the said acoustic pointing device. It is a figure which shows the power
for every sound source direction in the said acoustic pointing device. It is a figure which shows
the example which set the striking area in the said acoustic pointing device to the height
direction. It is a figure which shows arrangement | positioning of the submicrophone array in the
said acoustic pointing device. It is a figure which shows the application example which applied
the said sound source pointing device to the striking sound detection apparatus. It is a figure
which shows the other application example which applied the said sound source pointing device
to the striking sound detection apparatus. It is a figure which shows the other application
example which applied the said sound source pointing device to the striking sound detection
apparatus. It is a figure which shows the other application example which applied the said sound
source pointing device to the striking sound detection apparatus. It is a figure which shows the
other application example which applied the said sound source pointing device to the striking
sound detection apparatus. It is a figure which shows the other application example which
applied the said sound source pointing device to the striking sound detection apparatus.
Explanation of sign
[0091]
DESCRIPTION OF SYMBOLS 101 ... Microphone array 102 102 A / D conversion part 103
Central processing unit 104 Volatile memory 105 Storage medium 106 Display apparatus 201
Buffering part 202 STFT part 203 Direction estimation Unit 204 Noise estimation unit 205 SNR
estimation unit 206 Triangulation unit 207 Localization determination unit 208 SNR
determination unit 209 Power calculation unit 210 Power determination unit 211 Integration
unit 212 ... Control part, 213 ... Screen conversion DB, 214 ... Sound source frequency DB, 301 ...
Desk, 302 ... Display, 303, 304, 305 ... Sub microphone array.
04-05-2019
27
Документ
Категория
Без категории
Просмотров
0
Размер файла
43 Кб
Теги
jp2009199158
1/--страниц
Пожаловаться на содержимое документа