close

Вход

Забыли?

вход по аккаунту

?

JP2001296343

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2001296343
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
sound source direction setting apparatus, an image pickup apparatus having the same, and a
transmission system such as a video conference apparatus and a video telephone system using
the image pickup apparatus.
[0002]
2. Description of the Related Art Conventionally, in a video conference system etc., the voice of a
speaker is collected by a plurality of microphones provided in the microphone set, and the
direction of the speaker relative to the microphone set is detected using these microphones.
Techniques are described, for example, in JP-A-4-049756, JP-A-4-249999, JP-A-6-351015, JP-A7-140527, and JP-A-11-041577.
[0003]
The reason why the direction of the speaker can be detected by the microphone is that the time
until the voice of the speaker reaches each microphone is slightly different, so the crosscorrelation coefficient is calculated as described below based on the time difference. By searching
for a time difference that maximizes the cross correlation coefficient and converting the time
information into angle information, it is possible to detect the angle of the voice source.
[0004]
10-05-2019
1
FIG. 4 is a block diagram of a conventional video conference apparatus.
In FIG. 4, an image input unit 200 having a camera lens 103 for imaging a speaker and a
microphone set 170 having microphones 110a and 110b for collecting the voice of the speaker
are connected by a rotation means 101. The video conference apparatus 100 is shown.
[0005]
As described below, the video conference apparatus 100 picks up the voice of the speaker from
each of the microphones 110a and 110b, and detects the speaker azimuth direction from the
voice.
Then, based on the detection result, the camera lens 103 is controlled to be directed to the
direction of the speaker, where the image of the speaker is input through the camera lens 103,
and the other is collected along with the voice collected. Is sent to the video conferencing device
of
[0006]
FIG. 5 is an explanatory view of the principle of detecting the speaker azimuth direction by each
of the microphones 110a and 110b. Although FIG. 5 shows the two microphones 110a and 110b
and the speech of the speaker and the speaker, there is a difference in the time required for the
speech of the speaker to reach each of the microphones 110a and 110b.
[0007]
This time difference is calculated as the value of the rotation control of the camera lens 103 as
follows. That is, the distance between the microphones 110a and 110b is L, and on the scanning
surface of the camera lens 103 connecting the microphones 110a and 110b, the straight line
connecting the speaker and the microphones 110a and 110b is the same as the directivity line of
the first camera lens 103. Assuming that the angle is θ, the velocity of sound is V, and the
sampling frequency is Fs, the equation can be expressed as θ = SIN−1 (V [m / s] / (Fs [Hz] × L
10-05-2019
2
[m])).
[0008]
However, in the scanning plane of the camera lens 103 connecting the microphones, the angle θ
between the speaker and each directional line of the microphones and the first camera lens 103
follows the SIN-1 function. When the person is at approximately the same distance from each
microphone and the difference in angle θ is small and the time difference between voices
reaching each microphone is small, and otherwise the difference in angle θ is large and the time
difference between voices reaching each microphone is large And the angle accuracy is different.
Specifically, as the angle θ is larger, the detection accuracy is lower, and thus the improvement
is desired.
[0009]
In addition, the sound emitted by the speaker may not only be directly collected by each
microphone but also be reflected after being reflected to a wall, floor or other acoustic space, and
then collected. Furthermore, in addition to the voice of the speaker, there are background noises
and the like in what is collected by each microphone. Therefore, the cross-correlation coefficients
between the microphones are considered to have variations due to the influence of background
noise and the like, and as a result, it may be considered that the detection of the speaker
direction is erroneous.
[0010]
Therefore, it is an object of the present invention to provide a sound source direction setting
device capable of controlling the pointing direction of an imaging device including a camera lens
etc. to be correctly directed to a sound source such as a speaker in consideration of the above
circumstances. I assume.
[0011]
Another object of the present invention is to provide a sound source direction setting device
capable of performing control so as to be correctly directed to a sound source such as a
movement destination in response to movement or switching of a sound source promptly.
10-05-2019
3
[0012]
Still another object of the present invention is to provide a speaker orientation setting apparatus
that is less susceptible to the influence of reflection characteristics and the like.
[0013]
SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, the present
invention is equipped with at least two first microphones, which are rotatable about an axis of
rotation orthogonal to a scanning plane on which the microphones are located. A first
microphone set supported by the driving unit, driving means for rotating the first microphone set
about the rotation axis so as to move the first microphone on the scanning surface, and sound
from the sound source Controlling the drive means to calculate the difference in required time to
reach each of the microphones, reduce the time difference for the first set of microphones, and
converge to a set value. It features.
[0014]
The second microphone set includes at least two second microphones disposed in parallel with
the scanning surface, and the control means causes the sound from the sound source to reach
each of the first and second microphones. Preferably, the driving means is controlled to calculate
the difference in the required time until the first microphone set, and to reduce the time
difference and to converge to the set value for the first microphone set.
[0015]
In this case, the control means is a calculation means for calculating the cross-correlation
coefficient of the sound collected by each of the first and second microphones of the first and
second microphone sets; A time difference calculation means for calculating the time difference
based on a number and a means for converting the calculated time difference into angle
information are provided, and at least the rotational direction of the drive means is set by the
angle information.
Furthermore, the calculation means divides the sound collected by each of the first and second
microphones of the first and second microphone sets into several frequency bands, and for each
frequency band, the sound Calculate the cross-correlation coefficient of frequency components.
In addition, with the information collected by each of the second microphones in the second
10-05-2019
4
microphone set, the control means detects the change of the time difference as a sound source
movement or switching, and the rotation direction of the first microphone set, angle information
Can be corrected or changed.
[0016]
Furthermore, in the sound source direction setting apparatus as described above, the imaging
apparatus according to the present invention is located at or near the rotation axis of the first
microphone set, and collects at each of the first microphones of the first microphone set. The
imaging lens is directed to the direction of the sound source when there is no time difference in
the sound that has been sounded, and the imaging device equipped with the microphone set is
provided.
[0017]
Further, the transmission system of the present invention is characterized in that it is equipped
with transmission means for transmitting an image of a sound source taken by the abovementioned image pickup device to a required monitor and a speaker together with the sound
recorded by the microphone simultaneously.
[0018]
Furthermore, according to the transmission system of the present invention, a transmission
system according to the present invention is characterized in that a teleconferencing device is
provided in which a microphone, a monitor and a speaker are provided in each of the conference
seats.
[0019]
Further, the transmission system of the present invention is characterized by constituting a video
telephone system using a communication line provided with a microphone, a monitor and a
speaker for each of the callers.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be
described below with reference to the drawings.
[0021]
10-05-2019
5
FIG. 1A is a plan view of a video conference apparatus provided with a sound source direction
setting apparatus according to an embodiment of the present invention.
FIG. 1 (b) is a top view of FIG. 1 (a).
FIG. 1 (c) is an internal configuration diagram of the video conference apparatus shown in FIGS. 1
(a) and 1 (b).
[0022]
1 (a) and 1 (b), a first microphone set having a camera lens 103 for imaging a speaker who is a
sound source and microphones 120a and 120b for collecting the voice of the speaker and the
like. A television conference apparatus formed by connecting the microphone set 160 of the
present invention and the microphone set 170 which is the first microphone set having the
microphones 110a and 110b for collecting the voice of the speaker by the rotation means 101. It
shows 100.
[0023]
Each of the microphones 110a, 110b, 120a and 120b uses, for example, one capable of
collecting sound in a frequency range of about 50 Hz to 7 kHz.
[0024]
また。
In FIG. 1 (c), a speaker orientation detection unit 130, which is a control unit that detects a
speaker orientation based on a voice collected by the microphone set 170, and a voice collected
by the microphone set 160. Speaker orientation detection means 150 which is a control means
for detecting the speaker orientation based on the above, and the speaker orientation
information detected by the speaker orientation detection means 130, 150 is fed back to the
teleconference device 100 side and rotated. And a drive means 140 for driving the means 110.
Here, for example, the driving unit 140 is configured to input a signal from any of the speaker
10-05-2019
6
orientation detection units 130 and 150.
[0025]
FIG. 2 is a block diagram of the microphone set 170 and the speaker orientation detection means
130. As shown in FIG.
In FIG. 2, A / D conversion means 210a, 210b for sampling the sound collected by each of the
microphones 110a, 110b at a frequency of 16 kHz, for example, and converting it into digital
signals, and a timer are included. , 110b, and voice detection means 250 for detecting whether or
not the sound input from the speaker 110 is the voice of the speaker.
[0026]
Also, in FIG. 2, band pass filters 220a, 220b, 220a ', 220b', ... which pass only digital signals of a
predetermined frequency band, and calculation means 230, 230 'for calculating the crosscorrelation coefficient of the passed digital signals. ... and integrating means 240, 240 ′... Which
integrates the calculated cross-correlation coefficient, and detecting means 260, 260 which
detects a time difference between the microphones 110a and 110 b which maximizes each
integrated cross-correlation coefficient. '... is shown.
[0027]
For example, seven sets of each of these means 220a to 260, etc. are provided, for example, the
band pass filters 220a and 220b are for example 50 Hz to 1 kHz, and the band pass filters 220a
'and 220b' are for example 1 kHz to 2 kHz. Are set to pass only digital signals of the assigned
frequency band, for example, 2 kHz to 3 kHz,..., 6 kHz to 7 kHz.
[0028]
Furthermore, in FIG. 2, the time difference calculating means 270 for calculating the overall time
difference between the microphones 110a and 110b by adding the inherent coefficient in
advance to each time difference between the detected microphones 110a and 110b, and And
conversion means 280 for converting the delay time into angle information.
10-05-2019
7
The speaker orientation detection unit 150 is also configured in the same manner as the speaker
orientation detection unit 130.
[0029]
Subsequently, operations of FIG. 1A to FIG. 1C and FIG. 2 will be described.
First, the voices of the speaker are collected by the microphones 110a to 120b, and are output to
the speaker direction detection means 130 and 150, respectively.
In the speaker orientation detection means 130, 150, the voice is converted into a digital signal
by the A / D conversion means 210a, 210b.
This digital signal is output in parallel to the voice detection means 250 and the band pass filters
220a, 220b, 220a ', 220b' and so on.
[0030]
Here, each band pass filter 220a, 220b, 220a ', 220b', etc. passes each frequency band of 50 Hz
to 1 kHz, 1 kHz to 2 kHz, 2 kHz to 3 kHz, ..., 6 kHz to 7 kHz, as described above. Because the
band pass filters 220a, 220b, 220a ', 220b', etc., only the digital signal of the set low frequency
band passes.
[0031]
The digital signals that have passed through the band pass filters 220a, 220b, 220a ', 220b', etc.
are output to the calculation means 230, 230 ', etc., respectively.
The calculating means 230, 230 ', etc. calculate the cross-correlation coefficient of the input
digital signal.
The calculated cross-correlation coefficients are respectively output to the integrating means
10-05-2019
8
240, 240 'and so on, and are integrated here.
[0032]
On the other hand, the voice detection means 250 determines whether the digital signal relates
to voice, and the determination result is output to the integrating means 240, 240 'or the like.
In each of the integrating means 240, 240 ', etc., the cross-correlation coefficient integrated is
output to the detecting means 260 if the digital signal relates to the voice based on the
determination result of the voice detecting means 250, otherwise it is not Clears the integrated
cross correlation coefficient.
[0033]
Here, FIG. 3 is a flowchart showing the operation of the voice detection means 250, and the voice
detection means 250 distinguishes voice from background noise and the like according to the
procedure described below.
That is, the voice detection means 250 always measures the level of the digital signal with the
timer set to 0 (step S1). Then, a level ratio A between the level of the digital signal sampled at an
arbitrary time T and the level of the digital signal sampled at the time T-1 is obtained (step S2).
[0034]
Then, it is determined which of the level ratio A and the predetermined threshold value is larger
(step S3). If the level ratio A is larger than the threshold value, the process proceeds to step S4,
and if not, the process proceeds to step S8. Here, the predetermined threshold value to be
compared with the level ratio A is for determining whether the sound collected by any of the
microphones is within the frequency band of the sound, for example, about 100 Hz. And
[0035]
10-05-2019
9
Subsequently, in step S4, the timer is turned on, and the process proceeds to step S5, where the
measurement time of the timer is compared with the predetermined threshold value. Here, the
threshold value to be compared with the measurement time of the timer is, for example, for
discriminating between the sound generated by the conference participant dropping the
document etc. and the voice of the speaker, for example, 0.5 seconds. It is about the degree.
[0036]
If it is determined in step S5 that the measurement time of the timer is larger than the
predetermined threshold value, the process proceeds to step S6. If not, the process proceeds to
step S8. In step S6, the sound collected by any one of the microphones is determined to be voice,
while in step S8, the sound collected by any of the microphones is determined not to be voice.
Then, the process proceeds to step S7, and the timer is reset to zero. In practice, the voice
detection means 250 repeatedly performs the steps shown in FIG. 3 all the time.
[0037]
Further, in FIG. 2, the detection means 260 detects time differences D1 to D7 of arrival times of
voices between the microphones 110a and 110b and between the microphones 120a and 120b
which maximize each of the integrated cross correlation coefficients. It is output to the time
difference calculating means 270. Then, in the time difference calculation means 270, the total
time difference d between the microphones 110a and 110b is calculated taking into
consideration the inherent coefficients A1 to A7 which have been determined in advance in the
respective time differences D1 to D7 between the detected microphones 110a and 110b. Do.
[0038]
The time difference d can be shown as d = [D1, D2, ..., D7] [A1, A2, ..., A7] T (ΣAi = 1 (i = 0 ... 7)).
[0039]
Here, when sound is reflected by a wall or floor, it is diffused and reflected when reflected by a
wall or floor as the frequency is higher, but as the frequency is lower, the sum of the incident
angle and the outgoing angle is 90 degrees It is known to be close to.
10-05-2019
10
Therefore, as the frequency of the voice is lower, the voice reflected by the wall or floor may
interfere with the voice directly collected by each microphone and affect the identification of the
speaker direction.
[0040]
Therefore, for example, a time difference detected based on a digital signal passing through a
band pass filter 220a or the like passing D1 in a frequency band of 50 Hz to 1 kHz, a band pass
filter 220a passing D2 in a frequency band of 1 kHz to 2 kHz. Let D3 be the time difference
detected based on a digital signal passed through a band pass filter that passes a frequency band
of 6 kHz to 7 kHz, D3 ... D7 The coefficients of A1 and so on are determined such that A1 <A2 <...
<A7, AAi = 1 (i = 0..7).
[0041]
Then, as described above, the inner product of these coefficients A1 to A7 and the time
differences D1 to D7 is calculated, and the time difference d is obtained.
As described above, the lower the frequency is, the smaller the coefficient is inner product, and
the higher the frequency, the larger coefficient is the inner product, so that the influence of the
reflection on the wall, the floor, or the like is reduced.
[0042]
The calculated time difference d is output to the conversion means 280. The conversion means
280 converts time information into angle information using the following equation.
[0043]
θ d = SIN-1 ((d x V [m / s]) / (Fs [Hz] x L [m]) (where, V: sound speed Fs: sampling frequency L:
between microphones 110a, 110b, etc. The angle information obtained by the distance
conversion is output to the driving means 140. The drive means 140 selects one of the output
10-05-2019
11
signals of the speaker orientation detection means 130 and 150 and drives the rotation means
101 based on the selected signal, as described later.
[0044]
Specifically, first, based on the angle information signal output from the speaker orientation
detection unit 130, the microphone unit 160 is set by the rotation unit 101 so that the speakers
are equidistant from the microphones 120a and 120b. Rotate. Subsequently, based on the angle
information signal output from the detection unit 150, fine adjustment is performed so that the
speaker is equidistant from the microphones 120a and 120b.
[0045]
That is, first, for example, when the above-mentioned angle θ calculated based on the sound
collected by each of the microphones 110a and 110b is the angle θd1, the rotation means 101
is driven such that the angle θd1 becomes zero. At this time, the speaker is not located
equidistantly to each of the microphones 120a and 120b because there is an error due to using
the above equation.
[0046]
Therefore, if the above-mentioned angle θ calculated based on the sound collected by each of
the microphones 120a and 120b is the angle θd2, the rotation means 101 is driven so that the
angle θd2 becomes 0. At this time, since the angle θd2 is considerably smaller than the angle
θd1, the microphone set 160 can be directed to the speaker with high accuracy.
[0047]
Then, for example, when the speaker changes, the angle θd1 changes, so similarly, the rotation
means 101 is driven so that the angle θd1 becomes zero, and then the rotation means 101 so
that the angle θd2 becomes zero Drive.
[0048]
As described above, in the present embodiment, not only the microphone set 160 but also the
10-05-2019
12
microphone set 170 is provided with the microphones 110a and 110b as an example, but the
microphones 160a and 120b are provided only in the microphone set 160. By measuring the
time difference of the time it takes for the sound from each to reach each of the microphones, the
microphone set 160 is rotated around the rotation axis of the rotation means 101 so that this
time difference disappears. The direction of the sound source may be detected by the rotation
angle.
[0049]
However, since the microphone set 170 is usually oriented toward the centers of a plurality of
conference participants, the one provided with the microphones 110a and 110b also in the
microphone set 170 is faster when the speaker changes. The orientation of the microphone set
160 can be rotated.
[0050]
That is, for example, when it is necessary to rotate the microphone 160 by 90 degrees because
the speaker has changed, the microphone set 160 is rotated while the direction of the speaker is
calculated by the microphones 120a and 120b of the microphone set 160. More specifically,
when the direction of the speaker is specified by the microphone set 170, the angle between the
microphones 110a and 110b and the speaker is smaller, so that the error can be detected with
less.
[0051]
In the present embodiment, a television conference apparatus using a speaker orientation
detection apparatus has been described. However, the television conference apparatuses are
connected to each other by a communication line such as an integrated digital communication
network (ISDN line), for example. A video conference system can be configured by providing a
speaker and a monitor for outputting audio information and image information transmitted from
the video conference apparatus.
[0052]
Furthermore, the speaker orientation detection apparatus according to the present embodiment
can be used as an image pickup apparatus for picking up an image of a sound source including a
speaker, and also as a video telephone apparatus using the image pickup apparatus.
[0053]
As described above, the present invention calculates the difference in the time required for the
10-05-2019
13
sound from the sound source to reach each of the at least two first microphones provided in the
first microphone set, The first set of microphones can be correctly directed to the sound source
to rotate the first set of microphones so as to reduce the time difference and converge to the set
value.
[0054]
Further, according to the present invention, the change in time difference is regarded as
movement of the sound source or switching by the information collected by each of the second
microphones, and the rotation direction of the first microphone set is corrected or changed. It is
directed to the sound source such as the moving destination in response to the movement or
switching promptly.
[0055]
Furthermore, the present invention calculates the time difference based on the cross-correlation
coefficient of the sound collected by each of the first and second microphones.
Then, for example, the time information is converted into angle information, and at least the
rotational direction of the first microphone set is set by the angle information, so that it is not
easily affected by the reflection characteristic and the like.
10-05-2019
14
Документ
Категория
Без категории
Просмотров
0
Размер файла
24 Кб
Теги
jp2001296343
1/--страниц
Пожаловаться на содержимое документа