close

Вход

Забыли?

вход по аккаунту

?

JP2012147420

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2012147420
The present invention provides a video conference camera microphone device for displaying an
image of the voice level of a speaker above (overhead) the face of a conference attendee. A
camera / microphone unit (50) of the present invention processes an image taken by a camera
(3) to detect a face of a meeting attendee, a face detection means (15), and a plurality of
microphones (5) detect voice arrival directions. Voice arrival direction detection means 16; voice
collection direction change means 17 for changing the direction in which voices are collected
based on the deviation time information 22; and voice level collected by the voice collection
direction change means 17 Based on voice level calculation means 18, face detection information
20 detected by face detection means 15, voice arrival direction information 21 detected by voice
arrival direction detection means 16, and voice level information 24 calculated by voice level
calculation means 18. Audio level display / synthesizing means 19 for generating an image signal
25 for displaying an audio level on the head of the speaker of the meeting attendee in the
meeting room image; It has been made. [Selected figure] Figure 2
Image processing apparatus and image processing system
[0001]
The present invention relates to an image processing apparatus and an image processing system,
and more particularly to an image processing technique for displaying an audio level in response
to a meeting attendee.
[0002]
03-05-2019
1
2. Description of the Related Art Conventionally, in a still image teleconferencing device, there
has been known a technique for displaying an image of an audio level collected by a microphone
arranged for each conferee in correspondence with the conferee.
However, in the conventional conference apparatus, since it is necessary to arrange a
microphone for each conference attendee, there is a problem that it is difficult to respond
promptly when the number of people attending the conference changes. Further, in Patent
Document 1, for the purpose of clearly indicating who is the speaker, a configuration in which an
audio level collected by a microphone disposed for each conference attendee is displayed in an
image corresponding to the conference attendee Is disclosed.
[0003]
However, the prior art disclosed in Patent Document 1 is similar to the present invention in that
the audio level is displayed in an image corresponding to the meeting attendees, but a
microphone is arranged for each meeting attendee. It is necessary to solve the problem that it is
difficult to cope with the change in the number of people attending the meeting. The present
invention has been made in view of the above problems, and in order to eliminate the need for
microphones for the number of meeting attendees and a display device during speech, the
present invention uses the microphone array composed of a plurality of microphones to A video
conference camera microphone apparatus for detecting an incoming direction of voice, detecting
a face of a meeting attendee by image processing, and displaying an image of a voice level of a
speaker above (overhead) the face of the meeting attendee The purpose is
[0004]
In order to solve the above problems, the present invention is an image processing apparatus
including a photographing means and a microphone for picking up a voice, and a position of a
person based on an image photographed by the photographing means. , Voice arrival direction
detection means for detecting the arrival direction of voice based on shift time information of
voice data collected by a plurality of microphones, shift of data collected by the plurality of
microphones Voice collecting direction changing means for changing the direction in which the
voice is collected by correcting time and adding, voice level calculating means for calculating the
voice level collected by the voice collecting direction changing means, and Person detection
information detected by the person detection means, voice arrival direction information detected
by the voice arrival direction detection means, and calculation by the voice level calculation
03-05-2019
2
means Based on the audio level information, characterized by comprising the audio level display
synthesizing means for generating a signal for image display the audio level, to the image. The
present invention recognizes the face of a conferee and picks up the voice of each conferee to
detect who is speaking. Then, a mark corresponding to the voice level is displayed above the
image of the speaker according to the voice level of the speaker. In order to realize this,
according to the present invention, face detection means, voice arrival direction detection means,
voice collection direction change means, voice level calculation means, and voice level display
synthesis means are provided as the configuration of the video conference camera microphone
device. In preparation, an image signal is generated. This makes it possible to eliminate the need
for microphones for the number of meeting attendees.
[0005]
According to a second aspect of the present invention, the voice level display / synthesis means
displays in the vicinity of the speaker image in real time corresponding to the speaker
information specified by the person detection means and the voice arrival direction detection
means and the voice level. It is characterized in that the size of the circle is changed. According to
the audio level image display method of the present invention, the size of the circle at the top of
the speaker's image is changed according to the audio level. For example, if the sound level is
high, the size of the circle is increased, and if the sound level is low, the size of the circle is
decreased. Perform these displays in real time. This makes it possible to immediately recognize
who is the speaker and the voice level of the speaker.
[0006]
A third aspect of the present invention is characterized in that the detection of the voice is judged
as a voice when the signal level of the voice continues at a predetermined threshold or more and
for a predetermined time or more. At the meeting, the remarks of the attendees other than the
speaker are also collected. If all these sounds are detected, the image display may change rapidly.
In the present invention, in order to prevent such a phenomenon, the detection of voice is judged
as voice when the signal level of the voice continues above a predetermined threshold and for a
predetermined time or more. This makes it possible to prevent a drastic change in the image
display.
[0007]
03-05-2019
3
A fourth aspect of the present invention is an image processing apparatus comprising an image
display device for displaying an image including the sound level, a photographing means, and a
microphone for picking up a sound, and based on the image photographed by the photographing
means Voice detection means for detecting the arrival direction of voice based on deviation time
information of voice data collected by a plurality of microphones, and data detection data
collected by the plurality of microphones Voice collection direction changing means for changing
the direction in which the voice is collected by correcting and adding the deviation time, and
voice level calculation means for calculating the voice level collected by the voice collection
direction changing means; Person detection information detected by the person detection means,
voice arrival direction information detected by the voice arrival direction detection means, and
the voice level calculation means An image processing apparatus having audio level display
combining means for generating a signal for causing the image to display an audio level on the
basis of the audio level information calculated by the image processing apparatus, and
transmitting means for transmitting the signal to another image processing apparatus; It is
characterized by having. At least two video conference camera microphone devices according to
the present invention are prepared, and each conference room is equipped with the video
conference camera microphone device according to the present invention, an image display
device, a speaker, and a video conference device. Can be connected by a network such as a LAN
to construct a video conference system. This makes it possible to conduct a meeting with a
remote place smoothly.
[0008]
According to the present invention, the microphone array comprising a plurality of microphones
is used to detect the incoming direction of the voice of the speaker, to detect the face of the
meeting attendee by image processing, and to attend the voice level of the speaker Since the
image is displayed above (overhead) the person's face, it is possible to eliminate the need for
microphones for the number of meeting attendees.
[0009]
It is a figure explaining the appearance of the image processing device concerning the
embodiment of the present invention.
It is a block diagram explaining an internal configuration of an image processing device
concerning an embodiment of the present invention. It is a flow chart explaining operation of an
image processing device concerning an embodiment of the present invention. It is a figure
03-05-2019
4
explaining the principle of operation of voice arrival direction detection means. It is a figure
explaining the principle of operation of voice collection direction changing means. It is a figure
explaining the face detection means which is one of the Example of a person detection means. It
is a figure explaining an upper body detection means which is one of the examples of a person
detection means. It is a figure explaining a mode that the voice level of the speaker was displayed
as an image on the head of the speaker by the size of a circle. It is a figure explaining a mode that
the voice level of the speaker was displayed in the center of the upper body area of the speaker
by the length of the bar graph. It is a figure explaining a mode that a speaker's audio | voice level
was image-displayed by the thickness of the rectangular frame of a speaker image area. It is a
figure explaining a mode that a speaker's audio | voice level was image-displayed by the
thickness of the outline of a speaker image area | region. It is a figure explaining the image
processing system which used the image processing apparatus of this invention in the meeting
room. It is a figure explaining the operation | movement at the time of installing the image
processing system of this invention in two meeting rooms.
[0010]
Hereinafter, the present invention will be described in detail using embodiments shown in the
drawings. However, the constituent elements, types, combinations, shapes, relative arrangements,
and the like described in this embodiment are not intended to limit the scope of the present
invention thereto alone, as long as they are not specifically described, and are merely illustrative
examples. .
[0011]
FIG. 1 is a view for explaining the appearance of an image processing apparatus according to an
embodiment of the present invention. The image processing apparatus 50 according to the
present invention is provided on the front of the main body 4 and is provided with a
photographing device 3 for photographing a meeting attendee etc., a plurality of microphones 5
for picking up voices of speakers of the meeting attendees, It comprises a column 6 to be erected
and a pedestal 7 to fix the column 6. The internal configuration of the main body 4 will be
described later. Moreover, the main body 4 may be configured to be removable from the column
6. The photographing device 3 photographs the situation in which the conference is being
performed at the own base, and the photographed image is transmitted to the other bases, so
that the remote conference can be realized. The image photographed by the photographing
device 3 includes a person (conference attendee) who is holding a conference at his own base.
03-05-2019
5
[0012]
FIG. 2 is a block diagram for explaining the internal configuration of the image processing
apparatus according to the embodiment of the present invention. An image processing apparatus
50 according to the present invention is an image processing apparatus 50 including a
photographing apparatus 3 and a plurality of microphones 5 (microphones a to d: microphone
array) for picking up voices of meeting attendees. Voice arrival direction information 21 by
detecting the direction of arrival of voice by means of person detection means 15 for detecting
the position of a person (meeting attendee) included in the image by processing the image
captured by the And voice arrival direction detection means 16 for outputting deviation time
information 22; voice collection direction changing means 17 for changing the direction in which
voice is collected based on deviation time information 22; voice collection direction changing
means 17 Sound level calculating means 18 for calculating the sound level collected by the user,
person detection information 20 detected by the person detecting means 15, and detection by
the voice arrival direction detecting means 16 An image signal for displaying an audio level in
the vicinity of the speaker of the meeting attendee of the image captured by the imaging device 3
based on the voice arrival direction information 21 and the voice level information 24 calculated
by the voice level calculation means 18 And an audio level display / synthesizing means 19 for
generating 25. An audio signal 23 is output from the audio collection direction changing means
17.
[0013]
The image signal output from the photographing device 3 is input to the person detection means
15, detects a person from the image, and outputs position information of the person as person
detection information 20. The person detection is a prior art, but will be described later. Also, the
voice output signal of the microphone array consisting of the four microphones a to d is input to
the voice arrival direction detection means 16 to detect the arrival direction of the sound, that is,
the direction of the speaker viewed from the microphone array and the photographing device. .
Depending on the direction of the sound arriving at the microphone array, a time lag occurs in
the audio signal output of the four microphones (a to d) 5. The arrival direction of the sound is
detected from this time lag (shift time information 22), and the shift time information 22 and the
voice arrival direction information 21 are output. Also, the voice output signal of the microphone
array is input to the voice collection direction changing means 17, and the deviation time
information 22 is input to pick up voice from the direction of the speaker. The operating
principles of the voice arrival direction detecting means 16 and the voice collection direction
changing means 17 are prior art, but will be described later. The voice signal 23 of the speaker
03-05-2019
6
output from the voice collection direction changing means 17 is input to the voice level
calculating means 18 and at the same time, is output as the voice signal 23 of the image
processing apparatus 50. The sound level calculating means 18 calculates the effective value of
the sound signal at predetermined time intervals, and outputs the sound level information 24.
[0014]
For example, assuming that the sampling frequency of the audio signal is 8 kHz, the square root
of the sum of the values obtained by squaring each sample data every time interval of audio data
of 128 samples (1/8000 seconds × 128 samples = 16 msec) (= effective Value) is calculated and
voice level information is output. The person detection information 20, the voice arrival direction
information 21, and the voice level information 24 are input to the voice level display combining
means 19, and the voice level is circle 2 in the vicinity of the speaker 1 of the conference room
image as shown in FIG. An image signal of an image to be displayed is output. That is, the present
invention detects who is speaking based on the position information of the person and the voice
incoming direction information. Then, in accordance with the voice level of the speaker, a mark
or numerical value corresponding to the voice level is displayed in the vicinity of the image of the
speaker. In order to realize this, in the present embodiment, as the configuration of the image
processing apparatus 50, person detection means 15, voice arrival direction detection means 16,
voice collection direction change means 17, voice level calculation means 18, and voice level
display A combining means 19 is provided to generate an image signal 25. This makes it possible
to eliminate the need for microphones for the number of meeting attendees.
[0015]
FIG. 3 is a flow chart for explaining the operation of the image processing apparatus according to
the embodiment of the present invention. The processing (S7) for detecting a person from the
image signal output from the imaging device 3 and the processing (S1) for detecting sound from
an audio signal output from the microphone 5 are performed in parallel. It is determined that the
voice is detected when the signal level is equal to or higher than a predetermined threshold and
continues for a predetermined time or more. As a result, it is possible to prevent the display of
the image from being rapidly changed without displaying the level of the remarks of a short time
or the like. Next, when voice is detected, the arrival direction is detected by the arrival direction
detection means 16 (S2). If it is different from the current arrival direction, the sound collection
direction is changed by the sound collection direction changing means 17 (S3). Next, the level of
the sound being picked up is calculated by the sound level calculation means 18 (S4). Thereafter,
using the person detection information 20, the voice arrival direction information 21 and the
03-05-2019
7
voice level information 24, the voice level display combining means 19 performs image synthesis
of voice level display (S5). The above process is repeated until the meeting is over. The end of the
meeting may be determined by inputting a control signal of the end from the connected
conference apparatus 10 (see FIG. 13), or may be determined by powering off the image
processing apparatus 50.
[0016]
FIG. 4 is a diagram for explaining the operation principle of the voice arrival direction detection
means. For example, when the speaker is in the front direction of the microphone array, the
sounds entering the four microphones (a to d) are at the same time, and the audio signal outputs
of the four microphones have no time lag. When the sound 26 arrives from an oblique direction
of the microphone array, time lag occurs in the audio signal output of the four microphones
because the arrival time of the sound to each microphone is different. As an example, as shown in
FIG. 4A, the incoming sound 26 arrives, and delays of arrival times of the microphone b, the
microphone c, and the microphone d with respect to the microphone a are t1, t2, and t3. From
this time lag, the direction (direction of the speaker) of the incoming sound 26 can be detected
(see FIG. 4B).
[0017]
FIG. 5 is a diagram for explaining the operation principle of the sound collection direction
changing means. The voice arrival direction detecting means 16 adds a time delay to each
microphone output so as to cancel the delay (t1, t2, t3) of the detected arrival time of each
microphone. That is, as shown in FIG. 5A, a delay unit 27 having a time delay t3 for the
microphone a, a delay unit 28 having a time delay t2 for the microphone b, and a delay unit 29
having a time delay t1 for the microphone c. The timing of the voice signal of the incoming sound
matches (see FIG. 5B). By adding these, the voice signals from the direction of the incoming
sound reinforce each other, and the voice signals coming from other directions are canceled. In
this manner, the voice collection direction is changed, and the voice of the speaker is collected
and output.
[0018]
FIG. 6 is a diagram for explaining face detection means as an example of the person detection
03-05-2019
8
means. The method of detecting the face from the image is described in the reference (Face
image processing technology for digital cameras: OMRON KEC information No. 210 2009. JUL P.
16-P. This can be realized by known techniques as shown in 22). In particular, in the present
invention, it is not necessary to recognize that the detected face is the face of whom has already
been registered. FIG. 6 shows an example of a result of detecting a face from a conference room
image. As described above, when the face of the speaker 30 is detected, the position
(coordinates) on the rectangular image is output as face detection information, surrounded by
the rectangle 31. Thereby, the voice level can be displayed in a circle above the face of the
speaker 30 (above the head of the speaker). Although the sound level is displayed as a circle
above the face of the speaker 30 (above the head of the speaker) in FIG. 6, the position at which
the sound level is displayed and the method for displaying the sound level are not limited
thereto. That is, voice may be displayed on the lower side of the face of the speaker 30, on the
body of the face of the speaker 30. Also, the position at which the audio level is displayed may be
changed based on the position of the speaker of the image captured by the imaging device. Also,
the sound level is not limited to a circle, but may be another figure or display method. FIG. 7 is a
diagram for explaining an area detection unit including a face and an upper body as another
example of person detection. The method of detecting a person area from an image can be
realized by a known technique such as a reference document (person detection apparatus: Glory
Co., Ltd .: JP2009-140307A).
[0019]
FIG. 8 is a view for explaining how the voice level of the speaker is displayed as an image of the
size of a circle above the head of the speaker. According to the present invention, the voice level
of the speaker is synthetically displayed on the other party's side or the conference room image
of the conventional video conference. As an example, as shown in FIG. 8, a circle 2 of a size
corresponding to the voice level of speaker 1 is displayed above the head of speaker 1. The size
of circle 2 is changed in real time corresponding to the audio level. FIG. 8 (a) shows the case
where the sound level is high, and FIG. 8 (b) shows the case where the sound level is low. This
makes it possible to see who is speaking. Also, since the loudness of the speaker's voice can be
seen visually, it is possible to see for yourself whether the volume of his or her speech is large or
small. That is, during the video conference, it may be uneasy whether one's voice is being
transmitted to the other party, and the voice may be louder than necessary. Moreover, even if the
other party's voice is small and difficult to hear, it may be difficult to request the other party to
speak in a loud voice. Therefore, if it is possible to know for yourself whether the volume of one's
voice is large or small, it is possible to prevent the voice from being louder than necessary. Also,
if you find out that your voice is small, you can make yourself aware that you should make a loud
voice and conduct a smooth conference. That is, according to the audio level image display
method of the present invention, the size of the circle 2 is changed in the upper part of the image
03-05-2019
9
of the speaker 1 according to the audio level. For example, when the audio level is large, the size
of the circle 2 is increased as shown in FIG. 8A, and when the audio level is small, the size of the
circle 2 is reduced. Perform these displays in real time. This makes it possible to immediately
recognize who is the speaker and the voice level of the speaker.
[0020]
The center coordinates (x, y) of the display position at the time of level display with a circle are
determined by the following equation, for example. x = (X1 + Xr) / 2 y = Yt + Rmax + Yoffset
where X1: x coordinate of left end of person area Xr: x coordinate of right end of person area Yt:
y coordinate of top end of person area Rmax: maximum radius of circle (maximum level Size of
circle) Yoffset: Gap between the person area and the circle The radius r of the circle is
determined by the following equation according to the logarithmic scale so as to correspond to
the size of human hearing, for example. r = Rmax * log (p) / log (Pmax) (in the case of p> 1) r = 0
(in the case of p ≦ 1) where Rmax is the maximum radius of the circle (the size of the circle at
the maximum level) p: Voice level (short time power value) Pmax: maximum level (short time
power at maximum amplitude) Note that short time power p of signal X = (x1, x2, ... xN) is: <img
class = "EMIRef" id = "205438588-00000003" />
If the sampling frequency is 16 kHz, for example, if N = 320, it is possible to calculate short-time
power for data of 20 mS. The maximum level Pmax is Pmax = 32767 * 32767 / √2 in the case
of 16-bit wide PCM data (amplitude values in the range of -32768 to 32767). However, as in this
example, when setting the level display position outside (for example, above) the speaker's area,
it is necessary to secure a space for the image to be displayed. For example, depending on the
composition, it may be necessary to correct the position at which the level is displayed, for
example, when the face of the speaker is near the upper end of the image and there is no space
above it. In such a case, the display may be performed by moving the center coordinates of the
circle to the lower side, the left side, or the like of the face area.
[0021]
Examples of level display in which the problem of the display area as described above is less
likely to occur are shown in FIGS. FIG. 9 is a view for explaining how the voice level of the
speaker 1 is displayed as an image of the length of the bar graph 2 at the center of the upper
body area of the speaker. FIG. 10 is a diagram for explaining how the voice level of the speaker 1
is displayed with the thickness of the rectangular frame 2 of the speaker image area. FIG. 11 is a
03-05-2019
10
diagram for explaining how the voice level of the speaker 1 is displayed with the thickness of the
outline 2 of the speaker image area. In both cases, the same effect as in the example of FIG. 8 is
obtained, in which “who can tell who is speaking with eyes” and “speaker of the voice of the
speaker can be seen with eyes”, and the area already existing on the image Since the level is
displayed in the immediate vicinity or in the inside, the problem of space for displaying the level
is less likely to occur.
[0022]
FIG. 12 is a view for explaining an image processing system using the camera microphone unit of
the present invention in a conference room. An image processing system 60 according to the
present invention includes an image processing apparatus 50 described with reference to FIGS. 1
and 2, an image display apparatus 9 for displaying an image of a conference room, a speaker 8
for amplifying a voice of a meeting attendee, an image processing apparatus And a conference
apparatus 10 for transmitting the image signal 11 and the audio signal 12 output from the F.50
to another image processing apparatus via the network 32. The image processing apparatus 50
of FIG. 1 is used in combination with the conference apparatus 10 to illustrate a situation where
it is used in a conference room. The meeting attendees will sit at the meeting attendees 11 in the
seating arrangement as shown. The image display device 9 may be a television monitor or may
project an image on a screen or a wall using a projector. The image processing apparatus 50 is
placed on the conference desk 12 and installed at a position where all the meeting attendees can
shoot with the camera 3.
[0023]
FIG. 13 is a diagram for explaining the operation when the image processing system of the
present invention is installed in two conference rooms. FIG. 13 shows a case where a video
conference is performed in the A conference room and the B conference room. For example, the
image signal 11 and the audio signal 12 output from the image processing apparatus 50 of the A
conference room are transmitted to the B conference room on the other party side via the
conference apparatus 10 and the network 32. The received image signal 14 is displayed by the
image display device 9 on its own side, and the received audio signal 13 is output as sound from
the speaker 8 on its own side. The conference apparatus 10 can also display the conference room
image on the own side on the image display apparatus 9 on the own side. That is, at least two
image processing devices 50 of the present invention are prepared, and the image processing
device 50 of the present invention, the image display device 9, the speaker 8, and the conference
device 10 are provided in each of the conference rooms A and B, respectively. The image
03-05-2019
11
processing system can be constructed by connecting the conference rooms in the network 32 by
a LAN or the like. This makes it possible to conduct a meeting with a remote place smoothly.
[0024]
1 speaker, 2 voice levels, 3 cameras, 4 main units, 5 microphones, 6 pillars, 7 pedestals, 8
speakers, 9 image display devices, 10 video conference devices, 11 conference attendees, 12
conference desks, 13 audio signals, 14 image signal 15 face detection means 16 voice arrival
direction detection means 17 voice collection direction change means 18 voice level calculation
means 19 voice level display synthesis means 20 face detection information 21 voice arrival
direction information 22 deviation Time information, 23 voice signals, 24 voice level information,
25 image signals, 26 incoming sounds, 27, 28, 29 delay units, 30 speakers, 31 face detection
rectangles, 32 networks, 50 camera microphone units, 60 video conferencing systems
[0025]
Japanese Patent Application Laid-Open No. 60-116294
03-05-2019
12
Документ
Категория
Без категории
Просмотров
0
Размер файла
25 Кб
Теги
jp2012147420
1/--страниц
Пожаловаться на содержимое документа