close

Вход

Забыли?

вход по аккаунту

?

JP2017147504

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2017147504
Abstract: A user can be notified of a sound source direction, an approach degree, a sound source
type, and only a necessary one without looking at the surroundings, even in a situation where
surrounding sounds can not be recognized. A sound source display device (1) recognizes a type
of a sound source extracted by a sound collection means (10) for collecting sound, a sound
source extraction means (11) for extracting a sound source based on a collected sound signal,
and A sound source type recognition means (engine) 14 and a display means 18 for displaying a
sound source type recognition result are provided. The display means 18 displays or hides the
sound source type recognition result in accordance with the information necessary for display or
information not required for display registered in advance. [Selected figure] Figure 1
Sound source display device and sound source display method
[0001]
The present invention relates to a sound source display device and a sound source display
method.
[0002]
For example, with a smartphone, a lot of information can be obtained while moving.
Also, as wearable information display devices, HUDs (Head-Up Displays), HMDs (Head Mounted
Displays), etc. have been commercialized, and the obtained information can be displayed on the
03-05-2019
1
wearable information display devices. In information display devices such as smartphones
(smartphones) and HMDs, it is known to display information on the direction of arrival of sound
by changing color and size. By the way, when performing "walking smartphones" to use while
walking such "immersive" (non-transmissive) HMDs and smartphones, the user can not see the
situation of the outside world, and also wears the earphones. , I can not recognize the
surrounding sound. Therefore, contact with people and things, oneself encounters an accident
such as falling or falling, and it is a nuisance against others, which is a social problem.
[0003]
Patent Document 1 (Japanese Patent Application Laid-Open No. 2013-183286) discloses, for the
purpose of notifying the direction of a sound source, a sound coming from a sound source
acquired by a plurality of microphones (hereinafter abbreviated as microphones) arranged
around a housing. A portable terminal device is disclosed which detects and notifies a sound
source direction from a time difference. In this portable terminal device, even if the surrounding
sound can not be recognized, the generation direction of the sound can be notified. However, in
this portable terminal device, frequency analysis is required to determine the approach of the
sound, and the direction of the moving sound is detected and displayed, but the user does not
know what the sound is. Therefore, after all, the user needs to look at the direction of arrival of
the sound to determine the type of the target, and the above-described conventional problem is
not solved.
[0004]
Further, according to Patent Document 2 (Japanese Patent Application Laid-Open No. 2012029209), it is possible to automatically follow a change in position of a sound source that
requires attention, and selectively continue necessary voice processing for the sound source that
is focused on. For the purpose, there is described a sound processing system which monitors a
user's gaze to detect an object and detects movement of the object by an image. In this sound
processing system, the object is detected by the direction in which the user is looking and the
face movement detector. Therefore, the above conventional problems can not be solved. Further,
none of the patent documents disclose that the display target is selected to be displayed or
hidden according to the type of the sound source.
[0005]
The present invention has been made in view of the above-described conventional problems, and
03-05-2019
2
its object is to provide the user with a sound source direction, an approach degree, and a sound
source type without looking at the surroundings even when the surrounding sound can not be
recognized. Moreover, it is necessary to be able to display only what is necessary.
[0006]
The present invention comprises a sound collection means for collecting a sound, a sound source
extraction means for extracting a sound source based on the collected sound signal, a sound
source type recognition means for recognizing the type of the extracted sound source, and a
sound source type recognition result A sound source display device characterized in that the
display means displays or hides the sound source type recognition result in accordance with predisplayed or display unnecessary information registered. .
[0007]
According to the present invention, even when the surrounding sound can not be recognized, it is
possible to display the sound source direction, the approach degree, and the sound source type
to the user without looking at the surroundings, and only the necessary ones.
[0008]
FIG. 1 is a block diagram schematically showing a sound source display device according to an
embodiment of the present invention.
It is a figure which shows the example of arrangement | positioning of several microphones
(microphone array).
FIG. 3A is a diagram showing a relationship between a sound source and a user who wears or
holds the sound source display apparatus. FIG. 3A shows a predetermined initial angle
(directivity angle) within a scan angle range of the sound source direction with respect to a
certain sound signal section. It is a figure which shows the condition which performs a
microphone array process for every (DELTA) theta from (theta) = 0 degree, and detects a sound
source.
FIG. 3B is a diagram showing the detected sound source direction. It is a flowchart which shows
the process sequence for sound source detection. FIG. 5A is a display screen of the display
03-05-2019
3
means, in which the characters are displayed with the display position fixed regardless of the
direction information of the recognition result, and FIG. 5A shows the case where the sound
source is relatively far. 5B shows the case where the sound sources are relatively close. FIG. 6A is
a display screen of the display means, in which the figure is displayed with the display position
fixed regardless of the direction information of the recognition result, and FIG. 6A shows the case
where the sound source is relatively far; 6B shows the case where the sound sources are
relatively close. FIG. 7A is a display screen of the display means, in which direction information
of the recognition result is displayed on the screen, and FIG. 7A shows the case where the sound
source is relatively far, and FIG. 7B shows It is a figure which shows the case where a sound
source is relatively near.
[0009]
The present invention can reduce the degree of danger by displaying it to the user without
visually confirming the target sound source even in a situation where the surrounding sound can
not be recognized, and has the following features. That is, even if the user can not hear
surrounding sounds, it is possible to display only the necessary sound source direction, the
approach degree, and the sound source type without directly checking the target sound source
by visual observation. Next, embodiments of the present invention will be described with
reference to the drawings.
[0010]
FIG. 1 is a block diagram schematically showing a sound source display device according to an
embodiment of the present invention. The sound source display device 1 includes sound
collecting means 10 such as a plurality of microphones (herein abbreviated as microphones) 10a,
a sound source extracting means 11 for detecting the direction of the collected sound signal, and
a sound source signal from the direction detected sound. The sound source type is recognized
from the sound source section cut out by the sound source signal section cutting out means 12
and the sound source signal section cutting out section 12 which judges the distance of the
sound source 20 from the sound source signal section cutting out section 12 which cuts out a
section Connected to a CPU (central processing unit) 15 that performs sound source type
recognition means (engine) 14, a notification process based on a sound source type recognition
result and a perspective determination result, an instruction of an extraction direction angle in
sound source extraction means 11, etc. Interface (I / F) unit 16, storage unit 17, display unit 18,
and voice presentation unit 19. The I / F unit 16 is, for example, a display, a keyboard, a mouse
or the like, and the display means includes an optional but wearable display means. In FIG. 1, in
03-05-2019
4
the sound source extraction means 11, the sound source signal section extraction means 12, the
distance judgment means 13, the sound source type recognition means (engine) 14, and the
voice presenting means 19, the voice signal is generated by DSP (Digital Signal Processor) or the
like. It is processed. The CPU 15 performs an operation as control means for controlling each of
these means or for receiving the result thereof.
[0011]
For example, the CPU 15 accumulates, in a storage unit 17 such as a semiconductor memory,
sound signals of a plurality of channels having a certain time length and collected by the
microphone 10 a and A / D converted. The CPU 15 performs microphone array processing every
Δθ within a scan angle range from a predetermined initial angle (for example, θ = 0 °) of the
sound signal. In the microphone array processing, using a plurality of nondirectional
microphones 10a, time differences and amplitude differences of sound signals existing among a
plurality of channels detected for respective target sound directions are beamed to sound signals
for the target direction by directivity adjustment. Perform forming processing. That is, a plurality
of nondirectional microphones 10a are arranged at spatially different positions, and their output
signals are subjected to signal processing to control the filter in the frequency domain and the
filter in the spatial domain (ie directivity). In this way, parameters that enhance the sound signal
for the target direction are calculated.
[0012]
Here, the signal processing unit corresponding to each of the microphones 10a performs
parameter calculation to emphasize the sound signal in the target direction, and the sound signal
in the target direction is added to extract the sound signal as the target sound. This microphone
array processing is performed by sound (sound source signal) extraction by directivity
processing in a certain direction θ (θ).
[0013]
The sound source signal section cutout means 12 cuts out a sound source signal section of the
sound source signal extracted by the sound source extraction section 11. This extraction process
is also referred to as voice activity detection (VAD) when the speech signal is used. Based on the
sound source signal section cut out here, the type of sound source is recognized by the sound
03-05-2019
5
source type recognition means (engine) 14 that follows. Note that although the sound source
signal section extraction may or may not be performed, erroneous recognition by the sound
source type recognition means (engine) 14 tends to be reduced if the sound source signal section
extraction is performed. As the sound source signal section cutting out means 12, means using
GMM (Gaussian Mixture Model), average power, zero crossing number, etc. is famous, but it is
not limited to this as long as it uses acoustic feature quantities.
[0014]
The sound source type recognition means (engine) 14 recognizes a sound source type (for
example, a car, a bicycle, a cry of an animal, etc.) of the sound being sounded. The sound source
type recognition means (engine) 14 performs processing of recognizing a sound source type
using an HMM (Hidden Marcov Model) or DNN (Deep Neural Network) or the like used in speech
recognition as a recognition means.
[0015]
Next, the perspective determination means 13 extracts the feature quantity of the sound
extracted by the microphone array (the feature quantity of the sound is extracted by timefrequency conversion processing such as Fourier transform or wavelet transform, for example,
average value, variance, etc. The statistical change of) is used to determine whether it is closer or
farther away using only the temporal change of Examples of the feature amount of sound used
here include frequency changes using the Doppler effect and the like, and changes in signal
power and the like. It should be noted that the presence or absence of the perspective judgment
is not necessarily essential as described above, and a change from the past (approach, as it is,
separation) can be easily recognized if it is present.
[0016]
The CPU 15 stores the result obtained by the above processing in the storage unit 17 by the
“recognition result, parameter storage” function. In the storage means 17, a threshold value of
likelihood (a degree representing likelihood) in the sound source type recognition means (engine)
14, a sound source direction scan angle range, a scan angle step Δθ, whether to display for
each sound source type or not The display presence / absence setting etc. is saved. The presence
/ absence setting of the display whether or not to display for each sound source type is
03-05-2019
6
performed by the user himself and by making the setting changeable, the display means 18
displays the display necessity or display unnecessary information set by the user, Only the
information the user needs is displayed. When using the audio presentation means 19, in
addition to the visual display of the display means 18, the extraction sound is also presented.
[0017]
Also, regardless of the display / non-display setting for each sound source, the recognition result
of the likelihood falling below the threshold of the likelihood in the sound source type
recognition means (engine) 14 is a recognition result with low reliability. I remove it from the
display subject regardless of how. In the present embodiment, when the sound source type is
determined, the sound source types to be non-displayed are designated in advance, and those are
not to be displayed, or conversely, the sound source types to be displayed are designated in
advance. It is possible to display only the designated sound source type. Although the setting or
changing of the display target or the non-display target is performed by the user, a preset for a
typical environment may be performed. Further, changing means when the user performs setting
and changing is performed via the I / F unit 16. For example, it is performed via a display, a
keyboard, a mouse, set files, and a network.
[0018]
FIG. 2 is a view showing an arrangement example (microphone array) of a plurality of
microphones. The arrangement example of FIG. 2A shows the case where four microphones 10a
are arranged at four corners of the device (sound source display device 1), and in the
arrangement example of FIG. 2B, on a line. In the illustrated arrangement shape, the microphone
arrangement and the number change due to the directivity that the user 22 (FIGS. 3A and 3B)
wants to create. For example, not only one row, a square, or a circle on the horizontal surface, but
also a microphone array arrangement in which heights are three-dimensionally changed may be
used.
[0019]
FIG. 3 is a diagram showing the relationship between the sound source 20 and the user 22
wearing or holding the sound source display device 1, and FIG. 3A is predetermined for a certain
sound signal section within the scan angle range of the sound source direction. It is a figure
03-05-2019
7
which shows the condition which performs a microphone array process for every (DELTA) (theta)
from initial angle (directivity angle) (theta) = 0 degree, and detects a sound source direction
angle. FIG. 3B is a diagram showing the detected sound source direction. Note that, with
reference to FIG. 3A, the method of performing microphone array processing every Δθ may be
a method in which a plurality of microphone array processing units by direction simultaneously
process, and input signals of each microphone temporarily stored in the storage unit 17 The
settings may be changed and processed by two microphone array processing units, and may be
stored in the storage unit 17 again. The calculation of the sound source direction angle is
performed in an arbitrary angle range in the range of 360 degrees or 180 degrees from the
initial angle based on the sound source direction scan angle. That is, the determination is made
by scanning at a preset sound source direction scan angle (Δθ). Whether the scan angle range
is 360 degrees or 180 degrees depends on the position and number of microphones. That is,
since the front and back judgment can not be made with the microphone array disposed on a
straight line in a two-dimensional plane, the scan angle range (search angle) is 180 degrees.
[0020]
In FIG. 3A, if the angle θ when performing the microphone array processing is not the last
directivity angle within the scan angle range of the sound source direction, the directivity scan
angle Δθ is added to the directivity angle θ, and the next directivity angle is obtained. Perform
processing for θ + Δθ. In that case, if the directivity angle is the last directivity angle in the
scan angle range, the recognition result is displayed on the screen, the directivity angle is
initialized (.theta. = 0), and the next sound signal section is obtained. Do the processing.
According to the present embodiment, since the detection of the sound source 20 is performed
as described above, even in a situation where the surrounding sound can not be recognized, the
user can detect the sound source 20 without directly checking the target sound source 20
visually. The presence of the sound source 20 can be notified. Therefore, the degree of danger of
the user holding or wearing the display means 18 can be reduced.
[0021]
FIG. 4 is a flowchart showing the processing procedure for sound source detection described
above. That is, when the sound source detection is started, first, initial setting (for example, θ = 0
° or the like) of an initial angle (directivity angle) determined in advance within the scan angle
range of the sound source direction is performed (S101). The CPU 15 inputs the sound waveform
of each channel (channel) extracted by the sound source extraction unit 11 through the sound
collection unit 10 into the storage unit 17 or extracts the sound source signal section extracted
03-05-2019
8
by the sound source extraction unit 11 The system 12, including the means 12, the sound source
type recognition means (engine) 14, and the system of the distance determination means 13
perform acoustic processing, and the processing result is input to the storage means 17 (S102).
Next, sound extraction is performed by directivity processing in the pointing direction θ by the
sound source extraction unit 11 (S103). With respect to the extracted sound (sound source
signal), the sound source signal section extraction means 12 cuts out the sound source signal
(S104). Next, the sound source type recognition means (engine) 14 recognizes the sound source
20 based on the cut out sound source signal (S105).
[0022]
The perspective judgment means 13 compares the perspective of the sound source 20 with the
present and the previous detection of the perspective of the sound source 20 based on the sound
waveform of each channel (channel) extracted by the sound source display device 1. If the user is
further away than the user (S106, he has been away), the separation display setting is performed
(S107). When approaching the opposite direction (S106, approached), the approach display
setting is performed (S108). If there is no change from the previous time (S106, the same), the
display setting is performed as in the previous time (S109).
[0023]
Thereafter, as a result of recognition, parameters for emphasizing the sound signal for the target
direction are stored in the storage unit 17 (S110). Here, it is determined whether the angle θ at
the time of performing the microphone array processing is the last directivity angle within the
scan angle range of the sound source direction (S111), and if it is the last directivity angle (S111,
Yes), The display information processing is performed (S112), the processing result is displayed
on the display means 18, and the directivity angle is set to the initial value (θ = 0 °), and the
process returns from step S113 to step S102. repeat. If it is not the last directivity angle in step
S111 (S111, No), Δθ is added to θ (S114), and the process returns to step S103 to repeat the
above processing. The above is the flow of the entire processing.
[0024]
Note that the above signal processing is not limited to being performed only by the smartphone
or the HMD owned by the user 22, and may be processed by an external device via a wireless
03-05-2019
9
communication unit. Next, with regard to the method of detecting the sound source 20 described
above, the detection process of the sound source 20 which will be supplementarily described
below is roughly classified into detection of a sound source direction angle, sound source type
recognition and approach judgment, and display. 1) Detection of Sound Source Direction Angle
The calculation of the sound source direction angle is performed in an arbitrary angle range in
the range of 360 degrees or 180 degrees from the initial angle based on the scan angle of the
sound source direction. In this case, scanning is performed for each scan angle (Δθ) in the
sound source direction set in advance. Whether the scan angle range is 360 degrees or 180
degrees depends on the position and number of microphones in the microphone array (that is,
the microphone array disposed on a straight line in a two-dimensional plane can not determine
front and back, so the search angle is It will be 180 degrees).
[0025]
2) Sound source type recognition and approach (perspective) judgment When a plurality of
sound sources 20 exist, sound source type recognition and approach judgment are similarly
performed on each sound source 20. Assuming that the number of microphones used is N, the
number of recognized 20 sound sources can be up to N-1. 3) Display method Display is made on
the screen of the display means 18 based on the results obtained in the above 1) and 2). The
following four ways can be considered as display on the screen. i) The display position is fixed
regardless of the direction information in the recognition result. In this case, the perspective is
displayed as a relative relationship between figures and characters. 5 and 6 show display screens
of the display means 18, which display characters (FIG. 5) and a figure (FIG. 6) at a fixed display
position regardless of the direction information of the recognition result, and the perspective is
as shown in FIG. Display in relative relation. That is, FIGS. 5A and 6A show the case where the
sound source 20 is at a relatively distant position, and FIGS. 5B and 6B show the case where the
sound source 20 is relatively near.
[0026]
ii) Display the direction information in the recognition result on the screen. In this case, the
perspective is displayed as a relative relationship between figures and characters. iii) Display on
the screen in consideration of the direction and perspective information in the recognition result.
In this case, the size of characters and figures is constant (for example, distance information is
represented by color).
03-05-2019
10
[0027]
iv) Display on the screen in consideration of the direction and perspective information in the
recognition result. In this case, the perspective information compared to the previous time is
represented by the size of a figure or a letter. FIG. 7 is a display screen of the display means 18,
in which the direction information of the recognition result is displayed on the screen, and FIG.
7A shows the case where the sound source 20 is relatively far. 7B shows the case where the
sound source 20 is relatively close.
[0028]
In addition, if it is judged continuously several times or more, the display unit "appears" with
letters, changes the size and color of the letters, blinks, and warns the user 22 by vibration of the
device itself, or When a plurality of sound sources 20 are recognized, they may be displayed in
an overlapping manner. The above-mentioned perspective display i) or iv) corresponds to the
case where the perspective processing is performed, and when the perspective processing is not
performed, the size of the picture or character becomes constant.
[0029]
In the above embodiment, as described above, by providing the voice presenting means 19 in the
sound source display device, it is possible to present the recognition sound based on the
recognition result presented to the user. As a result, not only the screen but also the sound can
be shown to the user, which may be easier for the user to notice than in the case of the screen
alone. When a sound is presented, if another sound reproduction function is executed, the
reproduction sound is muted or the volume is reduced, and the recognition sound is mixed and
presented to the user.
[0030]
As described above, according to the present embodiment, not only the sound source direction
and the approach degree (perspective information) but also the sound source type are displayed.
Can be judged. In addition, since an object to be displayed by voice recognition is registered and
a recognition result that does not need to be notified is not displayed, only information required
03-05-2019
11
by the user can be presented. In the present invention, the meaning of the display is not limited
to what is simply displayed on the display means, but also includes what is notified to the user by
sound, vibration or the like. Therefore, the display means also includes those that enable such
notification.
[0031]
DESCRIPTION OF SYMBOLS 1 ... sound source display apparatus, 10 ... sound-collection means,
11 ... sound source extraction means, 12 ... sound source signal area cutting-out means, 13 ...
distance judgment means, 14 ... sound source type recognition Means (engine), 15 ... CPU, 16 ... I
/ F unit, 17 ... storage means, 18 ... display means, 19 ... voice presenting means, 20 ... sound
source, 22 · · · ··User.
[0032]
JP, 2013-183286, A JP, 2012-029209, A
03-05-2019
12
Документ
Категория
Без категории
Просмотров
0
Размер файла
24 Кб
Теги
jp2017147504
1/--страниц
Пожаловаться на содержимое документа