close

Вход

Забыли?

вход по аккаунту

?

JP2013016929

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2013016929
Abstract: The present invention provides an imaging device, an imaging method, and a program
capable of collecting sound emitted by a person even when the person is not included in the
imaging range. SOLUTION: When the conference terminal 1 for imaging the participants 53 to
55 is panned so that the imaging direction becomes A3 and the participant is not included in the
imaging range B1 of the camera, the image P3 includes the participants 53 to 55. Can not be
detected. At this time, the directivity direction of the array microphone is set to C3, and the
sound collection range is an area excluding the area specified by the imaging direction A3 and
the imaging range B1 from 360 ° all directions centered on the conference terminal 1. Set to
D3. As a result, the area where the participants 53 to 55 are present becomes the area targeted
for sound collection with certainty, and it is possible to avoid collecting sound from the area
known to have no participants 53 to 55. Sounds emitted by 53 to 55 can be collected surely and
clearly. [Selected figure] Figure 6
Imaging device, imaging method and program
[0001]
The present invention relates to an imaging apparatus, an imaging method, and a program in
which an imaging unit and a sound collecting unit are integrally configured.
[0002]
A camera that picks up an image and a microphone that picks up audio (hereinafter referred to
as a "microphone".
03-05-2019
1
There is known an imaging device in which a housing is integrally formed. For example, a
terminal device for a conference used in a remote conference picks up an image of the own site
using an imaging device, collects voice, and transmits an image or voice to a terminal device of
another site through a network. Send and receive data.
[0003]
Among such imaging devices, there is known one using a unidirectional microphone as a
microphone for sound collection in order to reliably and clearly pick up the voice of the speaker
in the conference (see, for example, Patent Document 1) ). The imaging device (microphone with
camera) described in Patent Document 1 has a configuration in which the angle of view of the
camera is substantially equal to the range of uni-directionality of the microphone. Then, when the
face image can not be recognized in the image taken by the camera, the voice is not captured by
the microphone, so that unnecessary voice is not captured if the speaker is not shown. .
[0004]
Further, as a microphone for sound collection of an imaging device, one using a known array
microphone is known (see, for example, Patent Document 2). The array microphone is a plurality
of nondirectional microphones arranged in an array, and electrical control can be used to obtain
directivity in any direction. The imaging device (video camera with a built-in microphone)
described in Patent Document 2 using such an array microphone links the directivity
characteristics of the array microphone with the shake angle and zoom angle of the camera. As a
result, when the camera is pointed in the direction of the speaker, the array microphone is
directed to the speaker side, and when the speaker is zoomed, it is pointed sharply to the speaker
and the voice of the speaker projected on the camera is displayed. You can pick it up effectively.
[0005]
JP, 2009-49734, A JP, 10-155107, A
[0006]
03-05-2019
2
However, according to the inventions described in Patent Documents 1 and 2, the speaker is
captured by the camera, and the pointing direction of the microphone is determined based on the
image and the direction of the camera.
Therefore, for example, when an object other than the speaker or another participant is displayed
on the camera, for example, when the speaker explains while displaying the white board with the
camera, in Patent Document 1, capturing of the sound by the microphone is blocked. Had a
problem of Further, in Patent Document 2, the pointing direction of the array microphone linked
to the camera is directed to the whiteboard to which the camera is directed, so that there is a
problem that the voice of the speaker can not be captured clearly.
[0007]
The present invention has been made to solve the above problems, and provides an imaging
apparatus, an imaging method, and a program capable of collecting voice emitted by a person
even when the person is not included in the imaging range. The purpose is
[0008]
According to the first aspect of the present invention, based on an imaging unit configured to
capture an image, a plurality of sound collecting units configured integrally with the imaging unit
and configured to collect voice, and an imaging range of the image by the imaging unit It is
determined whether or not a person is included in the imaging range on the basis of a control
unit that controls a pointing direction and a sound collection range in which a plurality of sound
collection units collect voices, and an image captured by the imaging unit. The control unit is
configured to select one of the areas where the sound collection unit can collect sound when it is
determined that the person is not included in the imaging range by the first determination unit.
An imaging device for controlling a pointing direction and a sound collection range of the sound
collection unit such that at least a part of an area out of the imaging range of the image pickup
unit is an area of a sound collection target of sound by the sound collection unit Is provided.
[0009]
According to the first aspect, if a person is not included in the imaging range of the imaging
means, at least a part of the area outside the imaging range can be set as the area of the sound
collection target. It can be included in the range, and the voice emitted by the person can be
reliably collected.
03-05-2019
3
In addition, since the area within the imaging range not including the person is removed from the
area of the sound collection target, the person does not collect the sound even if there is noise or
the like having the generation source in the area. Sound can be collected more clearly.
[0010]
The imaging device according to the first aspect may further include second determination
means for determining whether the imaging range of the imaging means has changed.
In this case, when the second determination means determines that the imaging range has
changed, the control means determines the directivity direction and the sound collection range of
the sound collection means based on the contents of the change of the imaging range. You may
control.
[0011]
When the imaging range changes, the person is expected to be in the imaging range before the
change. Therefore, if the control means controls the pointing direction of the sound collection
means and the sound collection range based on the contents of the change of the imaging range,
the area where the person is present is surely included in the area of the sound collection target.
Can. Therefore, the voice emitted by the person can be collected surely and more clearly.
[0012]
In the first aspect, the control means is determined by the second determination means that the
imaging range has changed, and the change in the imaging range is due to a change in the
imaging direction of the imaging means. If it is determined that the imaging direction after the
change can not be specified from the imaging direction before the change, all areas outside the
imaging range of the imaging means in the area where the sound collecting means can collect
sound The pointing direction and the sound collection range of the sound collection means may
be controlled such that the area of the sound collection means becomes the area of the sound
collection target of the sound by the sound collection means.
[0013]
03-05-2019
4
When the imaging range changes, the person is expected to be in the imaging range before the
change, but if the imaging direction after the change can not be identified from before the
change, based on the imaging direction after the change, It is not possible to specify the pointing
direction and the sound collection range before the change.
Therefore, the control means controls the pointing direction and the sound collection range of
the sound collection means to collect the sound from all the areas out of the imaging range of the
image pickup means among the sound collection possible areas. Since it is possible to ensure that
the selected region is included in the target region for sound collection and not to collect sound
from the region known to have no person, the sound emitted by the person can be collected
more clearly and more clearly can do.
[0014]
In the first aspect, the control unit is determined by the second determination unit that the
imaging range has changed, and a change in the imaging range is caused by a change in the
imaging direction of the imaging unit. When it is determined that the imaging direction after the
change can be identified from the imaging direction before the change, the sound is collected
from the area of the sound collection target of the sound collection unit before the change of the
imaging range To control the pointing direction and the sound collecting range of the sound
collecting means.
[0015]
When the imaging range changes, the person is expected to be in the imaging range before the
change, and when the imaging direction after the change can be identified before the change, the
change is based on the imaging direction after the change The previous pointing direction and
sound collection range can be specified.
Therefore, the control means controls the pointing direction and the sound collection range of
the sound collection means, and collects the voice from the area of the sound collection target of
the sound collection means before the change of the imaging range, It is possible to reliably and
more clearly collect the voice emitted by the person while avoiding the sound collection from the
area where there is no person while ensuring the area as the sound collection target.
03-05-2019
5
[0016]
In the first aspect, the control means is determined by the second determination means that the
imaging range has changed, and the change in the imaging range is determined to be due to a
change in the angle of view of the imaging means An area excluding the area overlapping with
the imaging range after the change of the angle of view from the area of the sound collection
object of the sound collection means before the change of the angle of view is an object of sound
collection by the sound collection means. The pointing direction and the sound collection range
of the sound collection means may be controlled so as to be the area of
[0017]
When the imaging range changes, if the change is due to a change in the angle of view, the
person is expected to be in an area excluding the imaging range after the change from the
imaging range before the change.
Therefore, the control means controls the pointing direction and the sound collection range of
the sound collection means, and from the area of the sound collection target of the sound
collection means before the change of the angle of view, the area overlapping the imaging range
after the change of the angle of view By collecting the sound from the excluded area, while
making the area where the person is present as the area for sound collection surely, it is avoided
to collect the sound from the area where there is no person, and the voice emitted by the person
is collected clearly and more clearly. It can sound.
[0018]
In the first aspect, the first determination unit recognizes a portion having a human face feature
from the image captured by the imaging unit, and the image is captured when the size of the
recognized portion is larger than a predetermined size. It may be determined that the range
includes a person.
[0019]
If the part having the feature of the human face included in the captured image is not detected as
a person whose size is equal to or less than a predetermined size, a person who is not to be
imaged by the imaging device is accidentally included in the imaging range Even if it does, that
03-05-2019
6
person does not become the control condition of the sound collection means.
As a result, it is possible to prevent the control means from controlling in the wrong directional
direction and sound collection range, and it is possible to reliably collect the sound emitted by
the person to be collected.
[0020]
According to the second aspect of the present invention, there is provided an imaging method to
be executed by a computer in order to cause an imaging device in which an imaging unit for
capturing an image and a plurality of sound collecting units for collecting sound are integrated. A
control step of controlling a directional direction and a sound collection range for causing the
plurality of sound collecting means to collect a sound based on an image pickup range of the
image by the image pickup means; and an image picked up by the image pickup means A first
determination step of determining whether a person is included in the imaging range, and the
first determination step further includes determining that a person is not included in the imaging
range; In the control step, at least a part of the area out of the imaging range of the imaging
means among the areas where the sound collection means can collect sound is the area of the
sound collection target of the sound by the sound collection means. An imaging method
orientation and sound collecting range of the sound collecting means is controlled is provided.
[0021]
According to a third aspect of the present invention, there is provided a program for causing a
computer to function, comprising: an imaging device configured to capture an image; and a
plurality of sound collecting devices configured to collect sound. A control step for controlling a
pointing direction and a sound collection range in which a plurality of sound collecting means
collect sound based on an image pickup range of the image by the image pickup means; and the
image pickup range based on an image picked up by the image pickup means Performing a first
determination step of determining whether or not a person is included in the control unit, and
the control step is performed when it is determined in the first determination step that a person
is not included in the imaging range Wherein, in the area where the sound collecting means can
collect sound, at least a part of the area outside the imaging range of the image pickup means is
the area for sound collection by the sound collecting means. Program pointing direction and
sound collecting range means is controlled is provided.
[0022]
The same effect as that of the first aspect can be obtained by executing the processing according
to the imaging method according to the second aspect by the computer of the imaging device or
03-05-2019
7
by causing the computer to function as the imaging device by executing the program according
to the third aspect. You can get it.
[0023]
It is a perspective view of the conference terminal 1 and PC9.
FIG. 2 is a block diagram showing an electrical configuration of the conference terminal 1;
5 is a flowchart of a program executed by the conference terminal 1;
FIG. 7 is a diagram showing a pointing direction C1 and a sound collection range D1 which are
set according to the imaging direction A1 of the conference terminal 1, the imaging range B1 and
the like. FIG. 7 is a diagram showing a pointing direction C1 and a sound collection range D1
which are set according to the imaging direction A2 of the conference terminal 1, the imaging
range B1 and the like. FIG. 6 is a diagram showing a pointing direction C3 and a sound collection
range D3 which are set according to the imaging direction A3 of the conference terminal 1, the
imaging range B1 and the like. FIG. 6 is a diagram showing a pointing direction C1 and a sound
collection range D4 which are set according to the imaging direction A1 of the conference
terminal 1, the imaging range B4 and the like. FIG. 7 is a diagram showing a pointing direction
C1 and a sound collection range D1 which are set according to the imaging direction A1 of the
conference terminal 1, the imaging range B5, and the like. FIG. 6 is a diagram showing a pointing
direction C6 and a sound collection range D6 which are set according to the imaging direction
A6 of the conference terminal 1, the imaging range B1 and the like. It is a figure which shows the
sound collection range D7 set according to imaging direction A7 of the meeting terminal 1,
imaging range B1 grade | etc.,.
[0024]
Hereinafter, a conference terminal 1 which is an embodiment of an imaging device according to
the present invention will be described with reference to the drawings. The drawings to be
referred to are used to explain the technical features that can be adopted by the present
invention, and the configuration of the described apparatus, flowcharts of various processes, and
the like are merely illustrative examples.
03-05-2019
8
[0025]
First, a schematic configuration of the conference terminal 1 will be described with reference to
FIG. The conference terminal 1 shown in FIG. 1 includes an array microphone 25, a speaker 27, a
camera 28, and an operation unit 29. The conference terminal 1 is an apparatus capable of
capturing an image with the camera 28, collecting voice with the array microphone 25, and
generating voice with the speaker 27. The conference terminal 1 includes a rotary shaft 3 at the
upper end of the housing 4 and is configured to rotate a part of the housing 4 around the rotary
shaft 3 and to open and close the lower end side. By opening the lower end side of the housing 4,
the user can set the posture of the housing 4 to an independent posture, that is, the posture at
the time of use (see FIG. 1). Further, by closing the lower end side of the housing 4, the posture
of the housing 4 can be set to a folded posture, that is, a posture (not shown) when not in use.
[0026]
The conference terminal 1 collects (inputs) the sound of the installed base from the array
microphone 25 and captures (inputs) an image from the camera 28. The array microphone 25 is
one in which two or more omnidirectional microphones are arranged side by side. Although
details will be described later, the array microphone 25 can set the pointing direction and the
sound collection range by electrical control. In the present embodiment, three microphones are
used as the array microphone 25.
[0027]
As the camera 28, for example, a single focus digital camera mounted with an image sensor such
as a CMOS or a CCD is used. The conference terminal 1 according to the present embodiment is,
for example, in the form of being placed on a tabletop and used, and operations such as pan and
tilt for adjusting the imaging direction of the camera 28 manually perform the case 4 of the
conference terminal 1. It is done by moving. Further, the zoom in the conference terminal 1 is
electrically performed by so-called digital zoom. More specifically, since the camera 28 according
to the present embodiment uses a single-focus digital camera, the angle of view is fixed, and
zooming is realized by performing trimming and enlargement processing on a captured image.
We shall use zoom. Hereinafter, the range in which the camera 28 can capture (range within the
image to be captured) will be referred to as an imaging range, but the imaging range is
03-05-2019
9
represented by an angle range based on the direction in which the camera 28 faces (imaging
direction) It shall be treated as synonymous with the angle of view in the optical zoom (the angle
range in which imaging is possible which changes as the zoom lens moves and the focal distance
changes). Therefore, the enlargement / reduction operation with respect to the imaging range
performed by the digital zoom may be represented by a change in the angle of view for
convenience.
[0028]
The operation unit 29 of the conference terminal 1 is provided with a power button, a volume
control button, a microphone mute button, and the like. Also, the conference terminal 1 can be
equipped with the USB interface 21 (see FIG. 2), and can electrically connect to an external
device. In the present embodiment, the conference terminal 1 is abbreviated as, for example, a
personal computer (hereinafter, "PC"). ) Is connected to 9). The PC 9 is a general computer that
performs various types of information processing such as data communication and image
display.
[0029]
Although the PC 9 shown in FIG. 1 is a notebook PC and includes the display device 6 and the
operation unit 7 or the like, it is needless to say that a desktop PC not provided with a device
such as a display device or an operation unit may be used. . The PC 9 and the conference
terminal 1 are electrically connected by the USB cable 2. The connection between the PC 9 and
the conference terminal 1 is not limited to the USB cable 2, and various connection methods such
as wireless communication such as WiFi (registered trademark), optical communication such as
infrared, and other IEEE 1394 can be used.
[0030]
Data of sound collected by the array microphone 25 and data of an image taken by the camera
28 are transmitted to the PC 9 through the USB cable 2. Also, the conference terminal 1
generates a sound from the speaker 27 based on the data of the sound received from the PC 9.
[0031]
03-05-2019
10
By using the PC 9 and the conference terminal 1, the user can execute a remote conference
(video conference) using images. Specifically, the PC 9 transmits voice and image data input from
the conference terminal 1 to a communication device such as a PC disposed at another location
via the network 8 (see FIG. 2) such as the Internet. At the same time, the PC 9 receives voice and
image data of the other site from the communication device located at the other site. The PC 9
causes the display device 6 to display the image of the other site based on the received image
data. Furthermore, based on the received voice data, the PC 9 causes the speaker 27 of the
connected conference terminal 1 to generate voices of other sites. As a result, voices and images
of a plurality of bases are shared, and the conference smoothly progresses even when all the
users are not at the same base.
[0032]
The configurations of the PC 9 and the conference terminal 1 can be changed as appropriate. For
example, the voice received from the other site may be generated by the speaker contained in the
PC 9 and the speaker 27 of the conference terminal 1 may not be used. In addition, a smaller
camera may be connected to a PC provided with an array microphone, a speaker, and a display
device, and the PC may be used as a conference terminal to perform video conferencing. Of
course, the PC may have a built-in camera. Alternatively, the conference terminal 1 may further
include a function of transmitting voice and image data, and the PC 9 may be used as a device for
generating voice and receiving an image received from the conference terminal 1 at another site.
Of course, the conference terminal 1 does not necessarily have to be used for a video conference,
and it suffices if it merely functions as a device that picks up an image and picks up audio, and
the PC 9 is based on the image and audio data received from the conference terminal 1 It is
sufficient to display an image and generate sound.
[0033]
Next, the electrical configuration of the conference terminal 1 will be described with reference to
FIG. The conference terminal 1 includes a CPU 11 that controls the conference terminal 1. A
ROM 12, a RAM 13, a flash memory 14, and an input / output interface (I / F) 16 are connected
to the CPU 11 via a bus 15.
[0034]
03-05-2019
11
The ROM 12 stores a program for operating the conference terminal 1 and initial values and the
like. The RAM 13 temporarily stores various information. The flash memory 14 is a non-volatile
storage device. A USB interface (I / F) 21, an audio input processing unit 22, a directivity control
unit 26, an audio output processing unit 23, a video input processing unit 24, and an operation
unit 29 are connected to the input / output interface 16. There is. The USB interface 21 connects
the conference terminal 1 to the PC 9. The voice input processing unit 22 processes voice signals
from the array microphone 25 input through the directivity control unit 26 to generate voice
data. The directivity control unit 26 performs processing to control the directivity direction and
the sound collection range of the array microphone 25. The audio output processing unit 23
processes the operation of the speaker 27. The image input processing unit 24 processes an
image signal from the camera 28 to generate image data.
[0035]
Here, the operation principle of the process performed in the directivity control unit 26 to
control the directivity direction and the sound collection range of the sound collected by the
array microphone 25 will be briefly described. Audio arriving at individual microphones
arranged in an array form a difference in arrival time depending on from which direction the
microphones are arranged. For example, a direction orthogonal to the arrangement direction of
the microphones (for the sake of convenience, “front direction” is used. When the voice from)
arrives, the voice reaches each microphone simultaneously. For this reason, audio signals are
output from the individual microphones from the array microphone 25 and electrically added in
the audio input processing unit 22 so that the output of the audio amplified to a magnification
corresponding to the number of microphones is obtained. It will be obtained. On the other hand,
a direction oblique to the direction in which the microphones are arranged (for the sake of
convenience, it is referred to as "an oblique direction". In addition, the side is also included. When
voices arrive from the), the microphones reach the microphones closer to the source of the
voices earlier, causing time lag (phase shift) in the voices acquired by the individual microphones.
Therefore, the gain of voice when the voice signal from the array microphone 25 is electrically
added in the voice input processing unit 22 depends on the arrival angle of voice to each
microphone and the arrangement interval (or arrangement position) of the microphones. It is
smaller than when it arrives from the front. Since the arrangement intervals of the individual
microphones are known in advance, if the directivity control unit 26 acquires the time difference
of the sound acquired by each microphone and analyzes it by the CPU 11, the direction of the
sound generation source can be determined.
03-05-2019
12
[0036]
Further, the directivity control unit 26 can delay the sounds collected by the individual
microphones of the array microphone 25 and output the delayed sounds to the sound input
processing unit 22. This means that by controlling the delay time for the output of each
microphone, it is possible to maximize the gain when voices arriving from a predetermined
diagonal direction are added. In other words, by electrically controlling and delaying the outputs
from the individual microphones in the directivity control unit 26, the array microphone 25 can
obtain directivity for the desired direction.
[0037]
As described above, the gain of the output of the array microphone 25 that can obtain directivity
by delay control is maximum when it arrives from one direction, and decreases when voice
arrives from a direction slightly deviated from that direction. That is, if the delay control is
performed so that the sound collected by each microphone is uniformly shifted according to the
arrangement interval of the microphones, the array microphone 25 can be controlled to have
narrow directivity, and the sound collection range (the directivity direction It is possible to
narrow the angular range in which sound can be collected when centered. In addition, if the
delay time of each microphone is not uniform, and a combination of delay times obtained by
calculation or the like in advance is applied to each microphone, the array microphone 25 is
controlled in a wide directivity to collect the sound collection range. It is also possible to make it
wider. Furthermore, if the microphones are divided into several sets and the delay control is
made different for each set, it is possible to make the array microphone 25 have a plurality of
directivity directions. In the present embodiment, based on such an operation principle, the
directivity control unit 26 performs delay processing of the sound collected by each microphone
according to the calculation by the CPU 11, whereby the directivity direction and the collection
of the array microphone 25 are collected. Control of the sound range is performed. In the present
embodiment, as described above, an angle range in the direction in which the array microphone
25 can collect sound is set as the sound collection range as described above.
[0038]
Further, in the conference terminal 1 of the present embodiment, the control of the directivity
direction and the sound collection range of the array microphone 25 is performed so that the
voice emitted by the person shown in the image captured by the camera 28 can be reliably
03-05-2019
13
picked up. It is performed according to the analysis result of the image. Specifically, image
analysis for determining whether or not the face of a person is included in the image captured by
the camera 28, and whether or not the direction has been changed by rotation (pan) of the
camera 28 in the horizontal direction Image analysis to be performed. In image analysis for
detecting the face of a person, for example, parts having facial features such as eyes, nose, and
mouth are extracted from the image, and relative positions, sizes, etc. are compared with a
template, or geometrically analyzed Is carried out by a known method.
[0039]
In the present embodiment, even if the relative position of the part having a face feature matches
the template, if the size is less than a predetermined size, it is detected as the face of a person. I
will not. In other words, even if the image analysis includes a part having the feature of the face
of a person in the image, if the size is smaller than a predetermined size, it is not determined as
the face of the person. Thus, for example, even if a person at a position far away from the
conference terminal 1 is included in the imaging range of the camera 28 and imaged and appears
in the image, the person is excluded from the objects detected as the person .
[0040]
The image analysis for detecting the orientation of the camera 28 is performed by a known
method for detecting the presence or absence of the displacement of the arrangement position in
the image of the feature shown in both the latest image and the previously captured image, for
example. As described above, the conference terminal 1 is, for example, in the form of being
placed on a desktop and used, and panning and tilting of the camera 28 are performed by
manually moving the conference terminal 1 by the participants of the conference. In other words,
the conference terminal 1 is not equipped with a driving device for panning and tilting, and
panning and tilting are not performed by control according to the operation in the PC 9. For this
reason, in the conference terminal 1, detection of the imaging direction of the camera 28 using
the mechanism of pan and tilt control is not performed. Therefore, the conference terminal 1
detects a change in the direction of the camera 28 based on the analysis result of the image
captured by the camera 28. The feature is, for example, one that can detect a closed contour. In
the conference terminal 1, when there is a deviation in the arrangement position of the feature,
the size of the deviation (such as the number of dots in the horizontal direction) is obtained in the
image, and the camera 28 is It is determined how many turns in the direction. Note that these
image analysis methods are merely examples, and various known image analysis methods can be
applied.
03-05-2019
14
[0041]
Next, according to the flowchart of FIG. 3, a specific process flow in which the pointing direction
of the array microphone 25 and the sound collection range in the conference terminal 1 are
controlled will be described with reference to FIGS. A program for executing the process shown
in FIG. 3 is stored in the ROM 12 and executed by the CPU 11 according to the program.
[0042]
The conference terminal 1 is installed, for example, in a conference room or the like in a posture
in use toward the participants of the conference, and connected to the PC 9. When the power
button of the operation unit 29 is turned on by the user (which may be one of the participants),
communication with the PC 9 is started, and the conference terminal 1 enters a standby state
(S11: NO). Furthermore, when the user operates the PC 9 to receive an instruction signal to start
imaging from the PC 9 (S11: YES), the CPU 11 starts imaging by the camera 28 and sound
collection by the array microphone 25. Further, the CPU 11 starts generation (output) of sound
of the other site from the speaker 27 based on the data of the audio received by the PC 9 from
the communication device arranged at the other site.
[0043]
In the present embodiment, as shown in FIG. 4, it is assumed that the state of the conference is
imaged at the conference terminal 1 installed on the front side of the table 52 disposed at the
center of the conference room 50. In the conference room 50, three participants 53, 54, 55 are
seated around the table 52 on which the document 51 is placed, and a white board 56 is
prepared on the right front side of the table 52, and a flower 57 is on the right back. It shall be
decorated.
[0044]
At the start of imaging, the zoom of the camera 28 is not performed, and in the image captured
by the camera 28, an object included in the maximum angular range that can be captured by the
03-05-2019
15
camera 28 is projected. The front direction of the conference terminal 1 is directed to the center
of the table 52, and in the following description, this direction is taken as an imaging direction
A1 for the sake of convenience. The signal of the camera 28 that has captured the situation of
the conference room 50 is input to the video input processing unit 24, and data of the image P1
is generated. In the image P1, an imaging range B1 that can be imaged by the camera 28
(indicated by a thick solid line). (Participants 53, 54, 55) and objects (document 51, table 52,
flower 57) included in.
[0045]
Further, at the start of imaging by the conference terminal 1, the pointing direction C1 of the
array microphone 25 is set to the front direction of the camera 28, that is, the same direction as
the imaging direction A1 for convenience (that is, the front direction of the conference terminal
1). Further, in order to match the sound collection range D1 of the array microphone 25 to the
initial angle of view of the camera 28, the CPU 11 sets an arithmetic expression or table
previously set according to the above-described operation principle based on the imaging
direction A1 and the imaging range B1. Perform the operation by. The directivity control unit 26
sets the delay time of each microphone of the array microphone 25 in accordance with the result
of the calculation performed by the CPU 11. The audio signals of the conference room 50
collected by the array microphone 25 whose directivity direction C1 and the sound collection
range D1 are controlled are input to the audio input processing unit 22 and added together to
generate audio data. The image data generated by the video input processing unit 24 and the
audio data generated by the audio input processing unit 22 are transmitted in a streaming format
to the PC 9 via the USB cable 2.
[0046]
Next, as shown in FIG. 3, the CPU 11 performs image analysis of the image P1, detects the face of
the person (that is, the participants 53 to 55) shown in the image P1, and counts the number of
detected participants. (S12). From the image P1, the faces of three participants 53 to 55 (sites
having human face features) are recognized. The CPU 11 may be the RAM 13 (flash memory 14),
assuming that the number of participants in the conference is 3 (S12). Temporarily store in
[0047]
03-05-2019
16
The imaging of the image by the camera 28 and the collection of sound by the array microphone
25 are continuously performed, and the generated image data and sound data are streamed to
the PC 9. In the meantime, even if the conference terminal 1 is horizontally rotated (pan), the
video input processing unit 24 generates image data of an image captured in the direction in
which the camera 28 is directed. When the CPU 11 receives a zoom instruction signal from the
PC 9 by the user's operation on the PC 9, the video input processing unit 24 performs image
trimming and enlargement processing according to the instructed zoom magnification to
generate image data. Do. In this case, the angle of view corresponding to the zoom magnification
is calculated using a predetermined calculation formula or table, and the RAM 13 (flash memory
14 may be used as the current imaging range. Temporarily stored in).
[0048]
The imaging of the image by the camera 28 and the collection of voice by the array microphone
25 are continuously performed for a predetermined time (S13: NO, S13), and when the
predetermined time passes (S13: YES), S15 to S30. Processing is performed. In the processes of
S15 to S30, control of the pointing direction of the array microphone 25 and the sound
collection range is performed. Further, when the processing of S15 to S30 is performed, the
latest image is stored in the RAM 13 by the camera 28 and used for image analysis by the CPU
11. Note that each time the processing of S15 to S30 is performed, two images of the latest
image and the image captured last time are stored in the RAM 13, and the images stored before
that are overwritten and erased.
[0049]
First, the CPU 11 performs image analysis on the latest image captured anew and stored in the
RAM 13, and determines whether the face of a person shown in the image has been detected
(S15). When panning and zooming are not performed on the conference terminal 1 and the latest
image is, for example, substantially the same as the image P1 of FIG. 4 captured last time, the
CPU 11 controls the three participants 53 and 54. , 55 are recognized, that is, a person is
detected (S15: YES). If the number of faces to be detected is 3 and is not smaller than the number
of participants in the meeting stored in S12 (S22: NO), the CPU 11 assumes that all participants
are within the imaging range of the camera 28. Then, the process of adjusting the sound
collection range of the array microphone 25 to the current angle of view of the camera 28 is
performed (S23). That is, as shown in FIG. 4, the CPU 11 sets the pointing direction of the array
microphone 25 to C1 which is the same as the imaging direction A1. Then, similarly to the above,
calculation based on the imaging direction A1 and the imaging range B1 is performed so that the
03-05-2019
17
sound collection range becomes D1, and the directivity control unit 26 is instructed to set the
delay time of each microphone of the array microphone 25. Send out.
[0050]
As described above, if all of the participants 53 to 55 appear in the latest image P1, it can be
determined that the sound emitted from all the participants 53 to 55 can be collected by
collecting sound from the imaging range B1. Therefore, the CPU 11 sets the imaging direction A1
as the pointing direction C1, and calculates and sets the sound collection range D1 having the
same size as the imaging range B1 by calculation. Thereby, the voices emitted by the participants
53 to 55 can be reliably collected. Thereafter, the process returns to S13.
[0051]
Next, panning is performed on the conference terminal 1 to display the whiteboard 56. For
example, as shown in FIG. 5, when the imaging direction of the camera 28 is directed to A2, the
participants 53 to 55 fall within the imaging range B1. May not be visible. In this case, the
participants 53 to 55 are not shown in the latest image P2, and the CPU 11 can not detect the
face of the person even if the image P2 is analyzed in S15 (S15: NO). If the zoom instruction
signal is not received from the PC 9 (S16: NO), the rotation angle is estimated by comparing the
latest image P2 stored in the RAM 13 with the image P1 captured last time and stored in the
RAM 13 Direction detection is performed (S17).
[0052]
As described above, in the image analysis for detecting the imaging direction (the direction of the
camera 28), the CPU 11 similarly detects the feature (for example, the flower 57) shown in the
previous image P1 in the latest image P2 and arranges it. It is carried out by a known method of
detecting whether there is a displacement in position. As shown in FIG. 5, since the flower 57 is
located near the left end in the image P2 and is located near the right end in the previous image
P1, the positional deviation occurs as shown by the arrow E1, so It is detected that the terminal 1
has been panned. Furthermore, since the imaging range B1 (field angle) is known, the size of the
pan made to the conference terminal 1 from the size of the positional deviation of the flower 57
in the images P1 and P2 with respect to the width of the images P1 and P2. That is, the rotation
angle of the conference terminal 1 is calculated.
03-05-2019
18
[0053]
If the rotation angle of the conference terminal 1 can be estimated (calculated) (S17: YES), the
CPU 11 determines the sound collection direction in the direction before the rotation by the
calculated rotation angle from the current imaging direction A2. Return and set the direction to
the pointing direction. In the example shown in FIG. 5, it is found that the imaging direction is
directed from A1 (see FIG. 4) to A2 (see FIG. 5) by the above-described image analysis, so the
pointing direction is set to C1 before rotation. . Further, the CPU 11 sets the sound collection
range of the array microphone 25 to D1 which is the sound collection range before rotation
(S20).
[0054]
As described above, if the participants 53 to 55 shown in the previous image P1 do not appear in
the latest image P2, it can be determined that only the conference terminal 1 is panned.
Therefore, when the rotation angle by panning is known from the image analysis, the CPU 11
aligns the directivity direction of the array microphone 25 and the sound collection range with
the directivity direction and the sound collection range before rotation. Thereby, the voices
emitted by the participants 53 to 55 can be reliably collected. That is, even if the participants 53
to 55 do not appear in the image P3 in order to project the whiteboard 56, the sounds emitted by
the participants 53 to 55 can be reliably collected. Thereafter, the process returns to S13.
[0055]
Furthermore, when panning for projecting the whiteboard 56 to the conference terminal 1 is
performed, for example, as shown in FIG. 6, the imaging direction of the camera 28 also features
the participants 53 to 55 within the imaging range B1. (Flower 57) is also not included, may be
directed to A3. In this case, the CPU 11 can not detect the face of a person from the latest image
P3 (S15: NO). If the zoom instruction signal is not received (S16: NO), the rotation angle is
estimated (detection of the imaging direction) as described above (S17). Since the CPU 11 can
not detect the feature (flower 57) shown in the previous image P1 (see FIG. 4) in the latest image
P3 by the above image analysis, it can not estimate the rotation angle. It judges (S17: NO).
03-05-2019
19
[0056]
In this case, the CPU 11 sets the direction C3 opposite to the current imaging direction A3 as the
pointing direction, and the sound collection range of the array microphone 25 is from the entire
range of 360 ° to the range of the angle of view of the current camera 28. Except for a certain
imaging range B1, the remaining range D3 is set (S18). In other words, the directivity of the array
microphone 25 is set outside the angle of view of the camera 28.
[0057]
As described above, if the participants 53 to 55 that appear in the previous image P1 do not
appear in the latest image P3, it can be determined that only the conference terminal 1 is panned
as described above. At this time, when the rotation angle by panning is not known from the
image analysis, the CPU 11 sets the pointing direction of the array microphone 25 and the sound
collection range outside the angle of view of the current camera 28. As a result, sound can not be
collected from the range where it is known that the participants 53 to 55 are not present, and
sound can be collected from other ranges. That is, even if the participants 53 to 55 do not appear
in the image P3 in order to project the whiteboard 56, the sounds emitted by the participants 53
to 55 can be reliably collected. Thereafter, the process returns to S13.
[0058]
Next, for example, as shown in FIG. 7, as a result of the CPU 11 receiving a zoom instruction
signal from the PC 9 and performing trimming and enlargement processing of the captured
image P1, participants 53 to 53 fall within the imaging range B4 reduced by the zoom. 55 may
not be included. In this case, the CPU 11 can not detect the face of a person from the latest image
P4 (S15: NO). Further, since the zoom instruction signal has been received (S16: YES), the
process proceeds to S21, and the directivity direction of the array microphone 25 is set to the
directivity direction C1 before zooming. The sound collection range of the array microphone 25
is from the imaging range B1 that is the range of the angle of view of the camera 28 before
zooming to the imaging range B4 that is the range of the angle of view reduced by zooming. It
sets to (S21).
[0059]
03-05-2019
20
As described above, if the participants 53 to 55 appearing in the previous image P1 are not
shown in the latest image P4, and the CPU 11 receives a zoom signal at that time, the angle of
view is narrowed by the zoom From this, it can be determined that the participants 53 to 55 are
not visible in the image P4. Therefore, the CPU 11 sets the sound collection range of the array
microphone 25 to a sound collection range D4 which is a range excluding the current imaging
range B4 which is known to have no participants 53 to 55 from the imaging range B1 before
zooming. As a result, even when the participants 53 to 55 are not displayed in the image P4 by
zooming, it is possible to reliably collect the sound emitted from the participants 53 to 55.
Thereafter, the process returns to S13.
[0060]
Next, processing will be described in a case where panning and zooming are performed on the
conference terminal 1 and the number of participants in the captured image is reduced. Even if
the CPU 11 detects the face of a person in the image at S15 (S15: YES), if the number is smaller
than the number of participants of the conference stored at S12 (S22: YES), Processing is
performed.
[0061]
For example, as shown in FIG. 8, as a result of the CPU 11 having received the zoom instruction
signal performing trimming and enlargement processing of the image P1, some participants 53
and 54 are included in the imaging range B5 reduced by the zoom. May be In this case, the CPU
11 can detect the face of a person from the latest image P5 (S15: YES), the number is smaller
than the number of participants (S22: YES), and the zoom instruction signal is received (S25:
YES). ), S30. The CPU 11 sets the directivity direction of the array microphone 25 to the
directivity direction C1 before zooming, and the same as the imaging range B1 which is the range
of the angle of view of the camera 28 before zooming It is set to D1 (S30).
[0062]
As described above, if some participants 53 and 54 appear in the latest image P5 and the CPU 11
receives a zoom signal at that time, the participant 55 narrows the angle of view due to the zoom.
It can be determined that the image P5 has not appeared. Therefore, the CPU 11 sets the sound
03-05-2019
21
collection range of the array microphone 25 to the same sound collection range D4 as the
imaging range B1 before zooming. As a result, it is possible to reliably collect voices emitted by
the participants 53 and 54 appearing in the zoomed image P5 and the participants 55 not
appearing. Thereafter, the process returns to S13.
[0063]
By the way, the conference terminal 1 may be panned, and as a result, only a part of participants
may be included in the imaging range. For example, as shown in FIG. 9, when the imaging
direction of the camera 28 is directed to A6 in which the participant 54 is included in the
imaging range B1 and the feature (flower 57) is included, the CPU 11 analyzes the image Thus,
the face of the person (the participant 54) is detected from the image P6 (S15: YES). Since the
faces of the other participants 53 and 55 not shown in the image P6 can not be detected, the
number of persons to be detected is smaller than the number of participants stored in S12 (S22:
YES).
[0064]
If the zoom instruction signal has not been received (S25: NO), the CPU 11 estimates the rotation
angle (detection of the imaging direction) (S26). The feature (flower 57) shown at the position
near the right end of the previous image P1 (see FIG. 4) is shown at a position slightly left from
the center in the image P6, and a positional deviation occurs as shown by the arrow E2. From
this, it is detected that the conference terminal 1 has been panned. Furthermore, based on the
imaging range B1, the size of the pan made to the conference terminal 1, that is, the rotation
angle of the conference terminal 1 is calculated from the size of the positional deviation of the
flower 57 in the images P1 and P6.
[0065]
When the rotation angle of the conference terminal 1 can be estimated (calculated) (S26: YES),
the CPU 11 sets the directivity direction of the array microphone 25 to an intermediate direction
between the imaging direction A6 and the directivity direction C1 before rotation. Set to C6.
Then, the sound collection range of the array microphone 25 is set to D6 obtained by adding the
range of the angle of view of the imaging range B1 of the current camera 28 with respect to the
imaging direction A6 and the sound collection range D1 with respect to the previous directivity
03-05-2019
22
direction C1 S28).
[0066]
As described above, when some participants 54 appear in the latest image P6 by panning, it is
possible to determine that the conference terminal 1 has been panned in order to project the
participants 54. Therefore, when the rotation angle by panning is known from the image
analysis, the CPU 11 sets the directivity direction of the array microphone 25 to an intermediate
direction between the directivity direction before rotation and the imaging direction after
rotation of the camera 28. Then, the sound collection range of the array microphone 25 is
adjusted to a range obtained by adding the imaging range based on the imaging direction after
rotation to the sound collection range before the rotation of the camera 28. As a result, it is
possible to reliably collect voices emitted by the participant 54 who is focused by panning and
the participants 53 and 55 who are not seen in the image P6. Thereafter, the process returns to
S13.
[0067]
When one participant 54 performs an explanation using the whiteboard 56, the user may move
from the original position, and the conference terminal 1 may be panned to project the
participant 54 accordingly. For example, as shown in FIG. 10, the imaging direction of the camera
28 is directed to A7 in which the participant 54 is included in the imaging range B1 but the
feature (flower 57) is not included. As described above, the CPU 11 detects the face of a person
(the participant 54) from the image P7 by image analysis (S15: YES), but the number of detected
people is smaller than the number of participants (S22: YES).
[0068]
In addition, the CPU 11 determines that the rotation angle can not be estimated if the feature
(flower 57) shown in the previous image P1 (see FIG. 4) can not be detected in the latest image
P7 S26: NO). In this case, the CPU 11 sets the directivity direction of the array microphone 25 to
360 ° all directions (non-directional), and sets the sound collection range of the array
microphone 25 to D7 which is the entire range of 360 °.
03-05-2019
23
[0069]
As described above, when the rotation angle by panning is not known from the image analysis,
the CPU 11 sets the sound collection range D7, and collects the sound from the entire range of
360 °, thereby the voice emitted from the participant 54 appearing in the image P7. Not only
the voices emitted by the participants 53 and 55 not shown in the image P7 can be dealt with.
That is, it is possible to reliably collect voices emitted by the participant 54 who is focused by
panning and the participants 53 and 55 who are not seen in the image P7. Thereafter, the
process returns to S13.
[0070]
As described above, in the conference terminal 1 according to the present embodiment, at least a
part of the area outside the imaging range may be set as the area to be collected unless the
person is included in the imaging range of the conference terminal 1 Since it is possible, the area
where the person is present can be included in the sound collection range, and the sound emitted
by the person can be reliably collected. In addition, since the area within the imaging range not
including the person is removed from the area of the sound collection target, the person does not
collect the sound even if there is noise or the like having the generation source in the area. Sound
can be collected more clearly.
[0071]
When the imaging range changes, the person is expected to be in the imaging range before the
change. Therefore, if control of the pointing direction of the array microphone 25 and the sound
collection range is performed based on the contents of the change of the imaging range, the area
where the person is present can be surely included in the area of the sound collection target.
Therefore, the voice emitted by the person can be collected surely and more clearly.
[0072]
When the imaging range changes, the person is expected to be in the imaging range before the
change, but if the imaging direction after the change can not be identified from before the
change, based on the imaging direction after the change, It is not possible to specify the pointing
03-05-2019
24
direction and the sound collection range before the change. Therefore, by controlling the
pointing direction and the sound collection range of the array microphone 25 and collecting
sound from all areas out of the imaging range of the conference terminal 1 among the sound
collection possible areas, the area where the person is present Since sound can be collected from
the area known to have no person while being surely included in the area to be collected, the
sound emitted by the person can be collected more clearly and more clearly. it can.
[0073]
When the imaging range changes, the person is expected to be in the imaging range before the
change, and when the imaging direction after the change can be identified before the change, the
change is based on the imaging direction after the change The previous pointing direction and
sound collection range can be specified. Therefore, by controlling the pointing direction and the
sound collection range of the array microphone 25 and collecting the voice from the area of the
sound collection target of the array microphone 25 before the change of the imaging range, the
area where the person is surely collected. It is possible to collect sound from the area without a
person while collecting the sound from the area with certainty and more clearly while making the
area as a target area.
[0074]
When the imaging range changes, if the change is due to a change in the angle of view, the
person is expected to be in an area excluding the imaging range after the change from the
imaging range before the change. Therefore, the directivity direction and the sound collection
range of the array microphone 25 are controlled to remove the area overlapping the imaging
range after the change of the angle of view from the area of the sound collection target of the
array microphone 25 before the change of the angle of view. By collecting the sound, the area
where the person is present can be reliably taken as the area to be collected, and sound
collection from the area without the person can be avoided, and the voice emitted by the person
can be collected clearly and more clearly. it can.
[0075]
If the part having the feature of the human face included in the captured image is not detected as
a person whose size is equal to or less than a predetermined size, a person who is not to be
03-05-2019
25
imaged by the imaging device is accidentally included in the imaging range Even if this is done,
the person will not be the control condition of the array microphone 25. As a result, it is possible
to prevent an erroneous pointing direction and a sound collection range from being set, and it is
possible to reliably collect the sound emitted by the person to be collected.
[0076]
The present invention is not limited to the above embodiment, and various modifications are
possible. A single focus digital camera is used as the camera 28, and the zoom is performed by
pseudo digital zoom that is realized by performing trimming and enlargement processing on the
captured image, but the camera 28 mechanically changes the focal distance A zoom lens may be
provided to realize an optical zoom.
[0077]
Although three microphones are provided in the array microphone 25 as an example, the
number may be two or more, preferably three or more, and the sound collection range is set
more accurately as the number is larger. be able to. Moreover, although the nondirectional
microphone was used for each microphone which comprises the array microphone 25 in this
Embodiment, you may use a directional microphone. Alternatively, the array microphone 25 may
be configured by combining an omnidirectional microphone and a directional microphone.
[0078]
Although the calculation of the rotation angle of the pan of the conference terminal 1 was
performed by detecting the positions of the features from the images before and after rotation by
image analysis, the conference terminal 1 is provided with an acceleration sensor, and the
orientation of the conference terminal 1 It may be possible to keep track of In addition, markers
may be provided at several places in the conference room 50 as the features so that the
orientation of the conference terminal 1 can be grasped from the markers appearing in the image
by image analysis. In consideration of the cost of providing an acceleration sensor and the time
and effort of preparing a marker, it is preferable to adopt a method of grasping the direction of
the conference terminal 1 by image analysis as in the present embodiment, since processing can
be performed only by software.
03-05-2019
26
[0079]
The installation direction of the conference terminal 1 may be any direction. For example, the
conference terminal 1 may be inclined 90 degrees and attached to a wall or the like, and the pan
in this embodiment may correspond to the tilt operation. In this case, if the movement of the
feature in the image is detected in the vertical direction by image analysis and the rotation angle
is determined, it is possible to control the pointing direction of the array microphone 25 and the
sound collection range.
[0080]
In the present embodiment, the conference terminal 1 corresponds to the “imaging device” of
the present invention. The camera 28 corresponds to "imaging means". The array microphone 25
corresponds to the "sound collection means". The delay time of each microphone of the array
microphone 25 is controlled based on the calculation result of the CPU 11 which performs
calculation for determining the directivity direction and the sound collection range of the array
microphone 25 according to various conditions, and the array microphone 25 The directivity
control unit 26 that controls the pointing direction and the sound collection range of the above
corresponds to “control means”. The CPU 11 which detects a person at S15 and determines
whether a person is included in the image corresponds to the "first determination means". The
CPU 11 that determines that a person can not be detected in S15, and determines that the
number of people has decreased in S22 in response to detection of a person corresponds to
"second determination means".
[0081]
1 conference terminal 11 CPU 13 RAM 25 array microphone 26 directivity control unit 28
camera
03-05-2019
27
Документ
Категория
Без категории
Просмотров
0
Размер файла
45 Кб
Теги
jp2013016929
1/--страниц
Пожаловаться на содержимое документа