close

Вход

Забыли?

вход по аккаунту

?

JP2016513410

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016513410
Generally, techniques are described for capturing multi-channel audio data. Devices comprising
one or more processors may be configured to implement the technology. The processor may
analyze the captured audio data to identify the audio object and analyze the captured video data
simultaneously with the capture of the audio data to identify the video object. The processor then
associates at least one of the audio objects with at least one of the video objects, and from the
audio data based on the association of the at least one of the video objects with the at least one
of the audio objects Multi-channel audio data may be generated. [Selected figure] Figure 1B
Video Analysis Assisted Generation of Multi-Channel Audio Data Related Applications
[0001]
[0001] This application claims the benefit of US Provisional Application No. 61 / 765,556, filed
Feb. 15, 2013.
[0002]
[0002] The present disclosure relates to capturing audio data, and more particularly to capturing
multi-channel audio data.
[0003]
[0003]
Typically, video capture devices such as video camcorders, tablets or slate computers, mobile
03-05-2019
1
phones (including so-called "smart phones"), personal gaming devices, personal media devices,
etc. are given for generating video data. With a camera to capture a series of images at a frame
rate of
Often, these video capture devices include a microphone to capture mono audio data of the scene
depicted in the video data.
Higher performance video capture devices may include more than one microphone to increase
the number of audio channels that can be captured (from a single channel in mono audio data).
These more sophisticated video recording devices may include at least two microphones to
capture stereo audio data (referring to audio data having left and right channels).
[0004]
[0004]
Given the increasing adoption of so-called smartphones, smartphones are increasingly becoming
a dominant way of capturing video data. Often, due to the characteristics of smartphones and
their use as audio communication devices, smartphones may include two, three, four or five
microphones. Additional microphones may be used by the smartphone for the purpose of noise
cancellation during calls, video conferencing, or other forms of communication, including voice
communication. Smartphones are equipped with a large number of microphones, but these
microphones are often located at locations on the smartphone that limit their ability to properly
capture anything other than stereo audio data, so these microphones are In general, it is not used
to capture multi-channel audio data other than stereo audio data.
[0005]
[0005]
In general, the present disclosure describes techniques by which video capture devices may use
video analysis to assist in capturing multi-channel audio data. Video capture devices may
facilitate the generation of surround sound audio data (often with five or more channels) using
video scene analysis (or computer vision techniques). In some examples, a video capture device
captures both audio data and video data and processes the audio data to identify the audio object
while processing the video data to identify the video object obtain. The video capture device may
perform video scene analysis techniques to identify these video objects and generate various
03-05-2019
2
metadata about these objects. Video capture devices may also perform auditory scene analysis in
an attempt to identify audio objects and the various metadata pertaining to these objects. By
comparing these objects, the video capture device may identify those video objects that may be
sources of audio objects.
[0006]
[0006]
Video capture devices often rely solely on inaccurate beamforming techniques, given that video
analysis techniques can more closely identify the position of the video object relative to the video
capture device as compared to the audio object alone In comparison with the above, the audio
object can be better localized. These audio objects may then be rendered for one or more
channels, using decibel differences to better position the audio objects with respect to one or
more forward channels, whereby Enables better generation of surround sound audio data as
compared to that generated by a video capture device.
[0007]
[0007]
In one aspect, a method analyzes audio data captured by a device to identify one or more audio
objects, and captures audio data to identify one or more video objects. And at the same time
analyzing the video data captured by the device. The method involves associating at least one of
the one or more audio objects with at least one of the one or more video objects, and one of the
at least one of the one or more video objects. Or generating multi-channel audio data from the
audio data based on association of at least one of the plurality of audio objects.
[0008]
[0008] In another aspect, a device obtains an audio object, obtains a video object, associates the
audio object with the video object, compares the audio object to the associated video object, and
is associated with the audio object The one or more processors configured to render the audio
object based on the comparison with the video object.
[0009]
[0009]
03-05-2019
3
In another aspect, a device for generating an audio output signal comprises a first video object of
a first video object based on a first comparison of a data component of the first audio object and
a data component of the first video object. First means for identifying a first audio object
associated with a first video object counterpart, a second audio object data component, and a
second video object data component And means for identifying a second audio object not
associated with the second video object's counterpart based on the comparison.
The device comprises means for rendering a first audio object in a first zone, means for
rendering a second audio object in a second zone, and rendered in the first zone And a means for
generating an audio output signal based on combining the first audio object and the rendered
second audio object in the second zone.
[0010]
[0010] In another aspect, a non-transitory computer readable storage medium, when executed,
causes audio data captured by the device to identify one or more audio objects to one or more
processors of the device. The video data captured by the device at the same time as the capture
of audio data is analyzed to analyze and identify one or more video objects, at least one of the
one or more audio objects, one Or at least one of the plurality of video objects is associated with
the at least one of the one or more video objects based on the association of the at least one of
the one or more audio objects with the at least one of the one or more video objects. It may store
instructions to generate a switch-channel audio data.
[0011]
[0011]
The details of one or more aspects of the technology are set forth in the accompanying drawings
and the description below.
Other features of the technology, objects and advantages will be apparent from the description,
the drawings and the claims.
[0012]
03-05-2019
4
FIG. 1 illustrates various views of an example video capture device 10 in accordance with the
techniques described in this disclosure. FIG. 7 is a block diagram illustrating in more detail a
video capture device performing the techniques described in this disclosure. FIG. 5 illustrates the
operations performed by the video capture device of FIG. 1 in associating a video object with an
audio object in accordance with the techniques described in this disclosure. FIG. 5 illustrates the
operations performed by the video capture device of FIG. 1 in associating a video object with an
audio object in accordance with the techniques described in this disclosure. FIG. 5 illustrates the
operations performed by the video capture device of FIG. 1 in associating a video object with an
audio object in accordance with the techniques described in this disclosure. FIG. 5 illustrates the
operations performed by the video capture device of FIG. 1 in associating a video object with an
audio object in accordance with the techniques described in this disclosure. FIG. 2 is a block
diagram illustrating the assisted audio rendering unit of FIG. 1B in more detail. FIG. 2 is a
diagram illustrating a scene captured by a camera of the video capture device shown in the
example of FIG. 1B and processed according to the techniques described in this disclosure. FIG. 7
is a diagram illustrating another scene captured by a camera of the video capture device shown
in the example of FIG. 1B and processed in accordance with an augmented reality aspect of the
technology described in this disclosure. 5 is a flow chart illustrating an example operation of a
video capture device in performing the techniques described in this disclosure. FIG. 7 illustrates
how various audio objects may be rendered with foreground and background of multi-channel
audio data in accordance with the techniques described in this disclosure.
[0013]
[0020]
FIG. 1A is a diagram illustrating various views 8A-8C (respectively front, plan, and side) of an
exemplary video capture device 10 that performs the techniques described in this disclosure.
Video capture device 10 is any video capable of capturing video and audio data, such as a video
camcorder, tablet or slate computer, mobile phone (including a so-called "smart phone"), personal
gaming device, personal media device, etc. May represent a type of device. For purposes of
illustration, video capture device 10 is assumed to represent a smartphone. Although this
disclosure is described in the context of a particular type of device, ie, a smartphone, the
techniques may be implemented by any type of device capable of capturing video data and multichannel audio data. .
[0014]
[0021]
03-05-2019
5
In the example of FIG. 1A, video capture device 10 is shown from three different views 8A-8C.
View 8A shows video capture device 10 from the front. View 8B shows video capture device 10
from the back. View 8C shows video capture device 10 from the side.
[0015]
[0022]
As shown in view 8A, video capture device 10 includes an earpiece 9, loudspeakers 11A, 11B,
and microphones 16A, 16B, and 16E. The earpiece 9 represents a small speaker used for the
reproduction of sound or audio data when listening to the audio with the device 10 close to the
user's ear. The speakers 11A and 11B are each used to play sound or audio data when listening
to audio on the device 10 further from the user (when playing music, watching video or when
used as a speakerphone) Represents a speaker used for The speaker 11A may be called a left
speaker 11A (or "speaker L") because the speaker 11A can reproduce the left channel of multichannel audio data. The speaker 11B may be called a right speaker 11A (or "speaker R") because
the speaker 11B can reproduce the right channel of multi-channel audio data. Microphones 16A,
16B and 16E are described in more detail below.
[0016]
[0023]
As shown in FIG. 8B, in one example, video capture device 10 also includes camera 14,
microphones 16C and 16D. Camera 14 may represent any type of device capable of capturing an
image. The camera 14 may capture a series of images at a predetermined rate (generally referred
to as a "frame rate") to form video data. The camera 14 may include lenses and other
components that may facilitate light capture to generate or otherwise generate an image. The
camera 14 can also interface with a flash or other light generating element (not shown in the
example of FIG. 1A), and in some cases, the camera 14 can be integrated with the flash. In the
supposed situation of a smartphone, the camera 14 typically senses the light intensity and
chromaticity of light entering the lens, in contrast to the celluloid media for sensing light that is
common with film cameras To do this, it comprises a digital camera including a light sensitive
sensor (such as a complementary metal oxide semiconductor (CMOS) light image sensor or a
charge coupled device (CCD) image sensor). The camera 14 may capture light and generate a
series of images shown as video data 18 in the example of FIG. 1B below.
[0017]
03-05-2019
6
[0024]
The microphones 16A-16E ("microphones 16") may each represent any type of device capable of
capturing audio data. The microphone 16 may generally refer to any type of acoustic to electrical
converter or sensor that can convert sound into an electrical signal. There are several different
types of microphones, each of which differs in the way the different types capture sound. To
provide some examples, the microphone 16 is a dynamic microphone (refers to a microphone
that captures sound using electromagnetic induction) and a microphone that captures a sound
using a change in capacitance (capacitance change) And a piezoelectric microphone. Although
shown as being integrated within or within video capture device 10, one or more microphones
16 may be external to video capture device 10, either wired or wireless connection It may be
coupled to the video capture device 10 through. Each of the microphones 16 may capture
separate audio data 20A-20E, as shown in more detail in connection with the example of FIG. 1B.
[0018]
[0025]
Typically, video capture devices such as video camcorders, tablets or slate computers, mobile
phones (including so-called "smart phones"), personal gaming devices, personal media devices,
etc. are given for generating video data. With a camera to capture a series of images at a frame
rate of Often, these video capture devices include microphones to capture mono audio data of the
scene depicted in the video data. Higher performance video capture devices may include more
than one microphone to increase the number of channels that can be captured (from a single
channel in mono audio data). These more sophisticated video recording devices may include at
least two microphones to capture stereo audio data (referring to audio data having left and right
channels).
[0019]
[0026]
Three or more microphones, such as the five microphones shown in FIG. 1A as microphones 16,
allow the video capture device to do what is called "beamforming" to distinguish between front
and rear and left and right (or forward or left It may facilitate capture of surround sound audio
having so-called "channels" of audio data, such as center channel, front left channel, front right
channel, rear left channel, and rear right channel. After capturing the microphone signal (which
may also be called "audio data"), the smartphone algorithmically creates a spatial beam (which
03-05-2019
7
can refer to the process by which sound in a particular direction is amplified) relative to other
spatial directions. obtain. By filtering the captured sound separately with these beams, the
smartphone may generate different output surround sound channels. In some examples, the
smartphone may generate a beam such that the difference between the beam area and the
corresponding null beam area indicates a volume level difference of 6 dB. As an example, a
smartphone may generate 5.1 surround sound audio data based on these beams.
[0020]
[0027]
Smartphones can capture surround audio using beamforming technology, which can capture
more realistic audio compared to video capture devices with only one or two microphones, but
often Microphone placement in some smartphones, as shown in the views 8A-8C of the example
1A, does not allow for the highest quality surround sound audio. Typically, the decibel difference
with respect to the corners is less pronounced. That is, the 6 dB difference when combining
beams does not create a large difference so that the sound producing the identified beam does
not feel very localized when it is reproduced. When generating surround sound audio data, for
example, when the audio is to be localized by the front right channel, the smartphone places
what should be localized audio in both the center channel and the front right channel. obtain.
[0021]
[0028]
In addition, given the proximity between several front and back microphones, for example,
between microphones 16B and 16C, it may be that the smart phone can not distinguish
sufficiently between front and back audio. is there. The inability to distinguish sufficiently
between forward and backward audio results in the smartphone generating surround sound or
multi-channel audio data that does not present sufficient distinction between audio in the
forward and backward channels Sometimes. In other words, the rear sound is played by the front
speaker (often along with the rear sound to mix the front and rear) and the front sound is by the
rear speaker (often the front sound is mixed with the front and rear When played with the
sound), the front and back channels can be mixed and sounding.
[0022]
[0029]
03-05-2019
8
Video capture device 10 may implement the techniques described in this disclosure to facilitate
the generation of surround sound or multi-channel audio data that better replicates the audio
data that is heard when capturing video data. In order to generate this multi-channel audio data
in accordance with the techniques described in this disclosure, video capture device 10 may use
video analysis to assist in capturing multi-channel audio data. Video capture device 10 may
facilitate the generation of multi-channel audio data (often with five or more channels) using
video scene analysis (or computer vision) techniques. In some examples, video capture device 10
can capture both audio data and video data, and process audio data to identify audio objects
while processing video data to identify video objects. Do. Video capture device 10 may perform
video scene analysis techniques to identify these video objects and the various metadata
associated with these objects. Video capture device 10 may also perform auditory scene analysis
in an attempt to identify audio objects and the various metadata pertaining to those objects. By
comparing these objects, the video capture device may identify those video objects that may be
sources of audio data.
[0023]
[0030]
Given that video analysis techniques can more closely identify the position of a video object
relative to video capture device 10 as compared to audio objects alone, video capture device 10
often relies solely on inaccurate beamforming techniques. In comparison with the above, the
audio object can be better localized. These audio objects may then be rendered for one or more
channels, using decibel differences to better position the audio objects for one of the forward
channels, thereby providing a conventional Enables better generation of surround sound or other
types of multi-channel audio data as compared to that generated by a video capture device. The
techniques performed by video capture device 10 are described in more detail in connection with
FIG. 1B below.
[0024]
[0031]
FIG. 1B is a block diagram illustrating in more detail a video capture device 10 performing the
techniques described in this disclosure. In the example of FIG. 1B, video capture device 10
includes control unit 12, camera 14, and microphones ("mic") 16A-16E ("microphone 16" or "mic
16"). Although not shown in the example of FIG. 1B for purposes of illustration ease, the video
capture device 10 may generally include additional modules, elements, and / or / or perform
03-05-2019
9
various other functions associated with the video capture device 10. Or, as with the unit, it may
also include an earpiece 9, speakers 11A and 11B.
[0025]
[0032] In any event, control unit 12 may be a storage device (eg, a disk drive or optical drive) or
(flash memory) that stores instructions for causing one or more processing units to perform the
techniques described herein. Software or stored in a non-transitory computer readable storage
medium (not shown in FIG. 1), such as random access memory, or RAM) memory, or any other
type of volatile or non-volatile memory One or more central processing units (again, the "CPU"
not shown in FIG. 1) that execute software instructions such as those used to define computer
programs, graphics processing units (again, in FIG. 1) "GPU" not shown) or may represent other
processing units .
[0026]
[0033]
Alternatively, or additionally, control unit 12 may include one or more integrated circuits, one or
more application specific integrated circuits (ASICs), one or more application specific special
processors (ASSPs). Application Specific Special Processor), one or more Field Programmable
Gate Arrays (FPGAs), or any one or more of the above examples of dedicated hardware for
performing the techniques described herein. It may represent dedicated hardware, such as a
combination.
Regardless of whether it is configured with a CPU and / or GPU, dedicated hardware, or some
combination thereof executing software, control unit 12 may be referred to as a "processor" in
some contexts.
[0027]
[0034]
As described above, the camera 14 can represent any type of device capable of capturing an
image, while the microphone 16 can be any type of device capable of capturing audio data Can
be represented respectively. The camera 14 may capture light and generate a series of images
shown as video data 18 in the example of FIG. Each of the microphones 16 may capture separate
audio data 20A-20E.
03-05-2019
10
[0028]
[0035]
As further shown in the example of FIG. 1, the control unit 12 includes a visual analysis unit 22,
an auditory analysis unit 24, an object related unit 26, and rendering units 28A-28C ("rendering
units 28"), And an audio mixing unit 30. Visual analysis unit 22 may represent hardware or a
combination of hardware and software that performs visual scene analysis of video data, such as
video data 18. Visual scene analysis includes aspects of computer vision that refer to a process
by which a computer or other device processes and analyzes an image to detect and identify
various objects, elements, and / or aspects of the image. obtain. Because computer vision and
machine vision have many overlapping or related concepts, computer vision may, in some
instances, be referred to as machine vision. Often machine vision uses aspects and concepts of
computer vision, albeit in different contexts. While the present disclosure will refer to computer
vision when describing the technology, the technology may also be practiced using machine
vision in conjunction with or as an alternative to computer vision. For this reason, the terms
"machine vision" and "computer vision" may be used interchangeably.
[0029]
[0036]
Although not shown in the example of FIG. 1, visual analysis unit 22 may, in some examples,
communicate with an image server or other database external to video capture device 10 when
performing visual scene analysis. The visual analysis unit 22 may communicate with the image
server to offload various aspects of the resource-intensive visual scene analysis process, often
meaning processing resources and / or memory resources. For example, visual analysis unit 22
may perform some initial analysis to detect objects and pass these objects to an image server for
identification. The image server may then classify or otherwise identify the object and return the
classified object to the visual analysis unit 22. Typically, visual analysis unit 22 communicates
with the image server via a wireless session. As such, video capture device 10 may include one or
more interfaces (not shown in the example of FIG. 1), which allow video capture device 10 to be
wireless or It may communicate with peripheral devices, servers, and any other type of device or
accessory via a wired connection. Visual analysis unit 22 may output video object 32 as a result
of performing visual scene analysis.
[0030]
03-05-2019
11
[0037]
Auditory analysis unit 24 may perform auditory scene analysis of audio data, such as audio data
20A-20N ("audio data 20"), to generate audio object 34. Auditory analysis unit 24 may analyze
audio data to detect and identify audio objects. Audio objects may refer to discrete or
recognizable sounds that may be classified or otherwise associated with a given object. For
example, the engine of a car can emit a sound that is easily recognizable. Auditory scene analysis
may attempt to detect, identify or classify these sounds in audio data.
[0031]
[0038]
Similar to the visual analysis unit 22, the auditory analysis unit 24 is, in some instances, remote
from the video capture device 10, perhaps away from the video capture device 10 when
performing auditory scene analysis (FIG. 1 It may communicate with an audio network server or
other database (not shown). Visual analysis unit 24 may communicate with the audio server to
offload various aspects of resource-intensive (meaning processing and / or memory resources)
intensive auditory scene analysis processes. For example, auditory analysis unit 24 may perform
some initial analysis to detect objects and pass these objects to an audio server for identification.
The audio server may then classify or otherwise identify the object and return the classified
object to the auditory analysis unit 24. Auditory analysis unit 24 may communicate with this
audio server using the interface described above in describing visual analysis unit 22. Auditory
analysis unit 24 may output audio object 34 as a result of performing auditory scene analysis.
[0032]
[0039]
Object association unit 26 represents hardware or a combination of hardware and software that
attempts to associate video object 32 with audio object 34. Video object 32 and audio object 34
are each compatible or common format in the sense that video object 32 and audio object 34 are
both defined in a manner that facilitates association between object 32 and object 34. It may be
defined according to Each of the objects 32 and 34 has a predicted position (e.g., x, y, z
coordinates) of the corresponding object, a size of the corresponding object (or a predicted size)
to provide some examples. , The shape of the corresponding object (or predicted shape), the
velocity of the corresponding object (or predicted velocity), the confidence level of the position,
and whether the object is in focus or the object is It may include metadata defining one or more
03-05-2019
12
of near foreground, far foreground, near background, or far background. Object association unit
26 associates one or more video objects 32 with one or more audio objects 34 based on the
metadata (often a single one of video objects 32 is a single audio object 34). Can be associated
with one).
[0033]
[0040]
Object association unit 26 may classify objects 32 and 34 into one of three classes. The first class
includes those of audio objects 34 having metadata associated with one of the video objects 32
having metadata. The second class includes audio objects 34 that are not associated with any of
the video objects 32. The third class includes video objects 32 that are not associated with any of
audio objects 34. Object association unit 26 may pass audio objects 34 (shown as audio objects
34 ') classified into the first class to assisted audio rendering unit 28A. Object association unit 26
may pass audio objects 34 (shown as audio objects 34 & quot;) classified into the second class to
unassisted audio rendering unit 28B. Object association unit 26 may pass video object 32 (shown
as video object 32 ') classified into the third class to augmented reality audio rendering unit 28C.
[0034]
[0041]
Although described in connection with three classes, the techniques may be implemented in
connection with only the first two classes. The third class may, in other words, be made
adaptively based on the available resources. In some instances, the third class is not specifically
utilized for power limited or resource limited devices. In some instances, these power limited or
resource limited devices may not include the augmented reality audio rendering unit 28C as the
third class is not utilized. Furthermore, the object association unit 26 may not pass the video
object or may not otherwise classify it into the third class. Thus, this technique should not be
limited to the examples described in this disclosure, but may be performed on the first and
second classes rather than the third class.
[0035]
[0042]
In any event, the rendering unit 28 is hardware configured to render audio data 38A-38C from
03-05-2019
13
one or more of audio object 34 ', 34' 'and video object 32' respectively, respectively. Or a
combination of hardware and software. Assisted audio rendering unit 28A receives audio object
34 'in that assisted audio rendering unit 28A has metadata that is potentially extended by a
matching or associated one of video objects 32: It may be called "assisted" audio rendering unit
28A. In this sense, the rendering unit 28A may receive assistance in rendering the audio object
34 'more accurately from the corresponding or associated ones of the video objects 32. When
the assisted audio rendering unit 28A considers that the unit 28A receives audio objects
associated with the video objects, these audio objects are captured by the camera and thus
associated with the video objects present in the foreground May be referred to as foreground
rendering unit 28A.
[0036]
[0043]
The unassisted audio rendering unit 28B associates these audio objects 34 '' with any of the
video objects 32 in that the rendering unit 28B renders the audio objects 34 '' classified into the
second class. Can be called "unsupported" in the sense that Thus, rendering unit 28B does not
receive any assistance in rendering audio object 34 & quot; from any of video objects 32. The
unassisted audio rendering unit 28B also behind the user capturing these scenes as background
or video data 18, in that the processing of the audio object unit 28B is not associated with any
video objects. It may be referred to as background rendering unit 28B in the sense that it may be
present.
[0037]
[0044]
The augmented reality audio rendering unit 28C causes the rendering unit 28C to obtain an
audio object corresponding to the unmatched or unassociated video object 32 'and to reflect the
audio data 38C to the audio data 20 captured by the microphone 16 "Reality can be extended" in
the sense that it can access an audio library (located either internal or external to the device 10),
or other audio repositories, for rendering into the expanded audio data 38A and 38B. The
augmented reality audio rendering unit 28C may process video objects 32 'detected in the scene
captured by the camera 14 as video data 18 to render foreground audio data provided by the
unit 28C.
[0038]
03-05-2019
14
[0045]
Each of rendering units 28 may render audio data 38A-38C in a spatialization manner. In other
words, rendering unit 28 may generate spatialized audio data 38A-38C, where each of audio
objects 34 ', 34 "and 34'" (where audio object 34 '"is , Pointing to the augmented reality audio
object 34 ′ ′ ′ obtained by the augmented reality audio rendering unit 28 C) are assigned
and rendered assuming a specific speaker calibration for playback. The rendering unit 28 uses
the HRTFs and other algorithms commonly used in rendering spatialized audio data to generate
audio objects 34 ', 34' ', and It can render 34 '' '.
[0039]
[0046]
Audio mixing unit 30 represents hardware or a combination of hardware and software that mixes
audio data 38A-38C ("audio data 38") into a particular multi-channel audio data format.
References to multi-channel audio data in this disclosure may refer to stereo or higher order
multi-channel audio data. Higher order multi-channel audio data may include 5.1 surround sound
audio data or 7.1 surround sound audio data, where the first number before the period refers to
the number of channels, The number after the period refers to the number of bass or low
frequency channels. For example, 5.1 surround sound audio data includes a left channel, a center
channel, a right channel, a left rear or surround left channel, and a right rear or surround right
channel with a single low frequency channel. Mixing unit 30 may mix audio data 38 into one or
more of these multi-channel audio data formats to generate multi-channel audio data 40.
[0040]
[0047]
In operation, video capture device 10 may be configured to call camera 14 to capture video data
18, and at the same time, to capture audio data 20A-20E ("audio data 20"), a microphone It may
be configured to call one or more, often all, of sixteen. In response to receiving video data 18 and
audio data 20, control unit 12 of video capture device 10 may be configured to perform the
techniques described herein for generating multi-channel audio data 40.
[0041]
[0048]
03-05-2019
15
Upon receiving the audio data 20, the control unit 12 can invoke the aural analysis unit 24,
which analyzes the audio data 20 to identify one or more audio objects 34. obtain. As briefly
described above, auditory analysis unit 24 may perform auditory scene analysis to identify and
generate audio objects 34. Similarly, upon receiving video data 18, control unit 12 may be
configured to invoke visual analysis unit 22, which may be configured to identify one or more
video objects 32 as audio. Video data 18 may be analyzed simultaneously with the analysis and /
or capture of data 20. Also, as briefly described above, the visual analysis unit 22 performs visual
scene analysis (using computer vision algorithms) to identify and generate one or more video
objects 32. obtain.
[0042]
[0049]
Visual analysis unit 22 and auditory analysis unit 24 may be configured to generate video object
32 and audio object 34, respectively, using a common or shared format. Often, this shared format
includes text components that may be referred to as metadata. This metadata may describe
various characteristics or aspects of the corresponding one of the video object 32 and the audio
object 34. The video metadata describing the corresponding one of the video objects 32 is, by
way of several non-limiting examples, the position, the shape, the velocity, and the confidence
level of the corresponding video object. One or more audio metadata comprising one or more
may be specified. The audio metadata describing the corresponding one of the audio objects 32
is likewise the position of the audio object and the shape of the audio object of the corresponding
audio object, to provide a non-limiting example. One or more of the velocity of the audio object
and the confidence level of the position may be specified.
[0043]
[0050]
Since both audio and video metadata are abstracted to this same semantic level, ie the same
textual semantic level in this example, the respective tags specified by this metadata (the
metadata described above The video capture device 10 can directly compare and map (in other
words, associate objects) in the text region, which can point to each of the different types of.
Using the mapped objects, video capture device 10 may directly associate how the device
“sees” the object with how the device “listens” the object in the scene.
03-05-2019
16
[0044]
[0051]
Control unit 12 may receive video object 32 and audio object 34 and may call object association
unit 26. Object association unit 26 may associate at least one of audio objects 34 with at least
one of video objects 32. When the object association unit 26 makes this association, it classifies
each of the audio objects 34 as a type of audio object, typically based on metadata (in some
instances, the type of audio object can be defined) It can. Similarly, when the object association
unit 26 makes this association, typically each of the video objects 32 is based on the
corresponding metadata (which in some cases can also define the type of video object). , Can be
classified as a type of video object. Exemplary types of video objects may comprise cars, beaches,
waves, running water, music, people, dogs, cats, and the like. Object association unit 26 may then
determine that the type of one of audio objects 34 is the same type as one of video objects 32. In
response to determining that the type of one of the audio objects 34 is of the same type as one of
the video objects 32, the object association unit 26 determines one of the audio objects 34 as one
of the video objects 32. Can be associated with one of
[0045]
[0052]
Object association unit 26 may generate various audio objects based on the classification of
audio object 34 into one of three different classes described above. Again, the first class includes
those of the audio objects 34 that have metadata associated with one of the video objects 32 that
have metadata. The second class includes audio objects 34 that are not associated with any of the
video objects 34. The third class includes video objects 32 that are not associated with any of
audio objects 34.
[0046]
[0053]
Object association unit 26 may pass audio objects 34 (shown as audio objects 34 ') classified into
the first class to assisted audio rendering unit 28A. Object association unit 26 may pass audio
objects 34 (shown as audio objects 34 & quot;) classified into the second class to unassisted
audio rendering unit 28B. Object association unit 26 may pass video object 32 (shown as video
object 32 ') classified into the third class to augmented reality audio rendering unit 28C.
03-05-2019
17
[0047]
[0054]
With respect to audio objects 34 determined to belong to the first class, object association unit
26 may be configured to associate the audio metadata of one of audio objects 34 with the video
metadata of one video object 32 associated with it. The level of correlation with the data can be
determined, and based on the determined level of correlation, composite metadata is generated
for one of the audio objects 34 with which one video object 32 is associated. In some examples,
object association unit 26 may replace audio metadata and portions thereof with corresponding
video metadata or portions thereof, as well as locations specified by the audio metadata. The
object association unit 26 may then pass this audio object 34 to the assisted audio rendering unit
28A as one of the audio objects 34 '. The assisted audio rendering unit 28A then determines,
based on the composite metadata generated for one of the audio objects 34 ', one or more
foreground channels of the multi-channel audio data 40 of the audio objects 34' Can render one
of Assisted audio rendering unit 28A passes this portion of multi-channel audio data 40 to audio
mixing unit 30 as audio data 38A.
[0048]
[0055]
For those of the audio objects 34 determined to belong to the second class, the object rendering
unit 26 determines that one of the audio objects 34 is not associated with any of the video
objects 32 obtain. Object rendering unit 26 may pass these audio objects 34 to unassisted audio
rendering unit 28B as one of the audio objects 34 ′ ′. Unassisted audio rendering unit 28 B
may generate multi-channel audio data 40 such that one of audio objects 34 ′ ′ originates on
one or more background channels of multi-channel audio data 40. That is, because these audio
objects 34 are not associated with any of the video objects 32, the non-assisted audio rendering
unit 28 B may be configured to capture these audio objects 34 ′ ′ in the scene captured by the
camera 14. It is configured to assume that it is an externally occurring object. As such, the
unassisted audio rendering unit 28B may be configured to render the audio object 34 & quot; in
the background, often as diffuse sound. Unassisted audio rendering unit 28B passes this portion
of multi-channel audio data 40 to audio mixing unit 30 as audio metadata 38B.
[0049]
[0056]
03-05-2019
18
With respect to those video objects 32 determined to belong to the third class, ie, when the video
objects 32 are not associated with any of the audio objects 34 in the example of FIG. Video object
32 may be passed to augmented reality audio rendering unit 28C as video object 32 '. The
augmented reality audio rendering unit 28C is responsive to receiving the video object 32 'to
generate a reference audio object from the audio library that would have been associated with
each (if possible) of the video object 32'. You can get it. The augmented reality audio rendering
unit 28C may then render each of the reference audio objects (which may also be referred to as
audio objects 34 '' ') to generate at least a portion of the multi-channel audio data 40. The
augmented reality audio rendering unit 28C passes this portion of multi-channel audio data 40 to
the audio mixing unit 30 as audio data 38C.
[0050]
[0057]
Audio mixing unit 30 receives audio data 38 and mixes the audio data 38 to form multi-channel
audio data 40. Audio mixing unit 30 may mix this audio data 38 in the manner described above
to generate any form of multi-channel audio data 40. These formats may include 5.1 surround
sound format, 7.1 surround sound format, 10.1 surround sound format, 22.2 surround sound
format, or any other proprietary or non proprietary format.
[0051]
[0058]
In this method, control unit 12 of video capture device 10 analyzes audio data to identify one or
more audio objects and simultaneously captures audio data to identify one or more video objects.
The device may be configured to analyze captured video data. The control unit 12 further
associates one of the audio objects 34 with one of the video objects 32 and audio data based on
the association of one of the audio objects 34 with one of the video objects 32. 20 may be
configured to generate multi-channel audio data 40.
[0052]
[0059]
03-05-2019
19
Given that video scene analysis can more closely identify the position of the video object relative
to the video capture device 10 as compared to the audio object alone, the video capture device
10 often relies solely on inaccurate beamforming techniques. In comparison with the above, the
audio object can be better localized. These audio objects may then be rendered for one or more
channels, using decibel differences to better position the audio objects for one of the forward
channels, thereby providing a conventional Enables better generation of surround sound or
multi-channel audio data as compared to that generated by a video capture device.
[0053]
[0060]
Additionally, the video capture device may, in some examples, render audio object 32 as a
separate audio source in the foreground (180 degrees ahead of the listener). With respect to
audio objects 32 that the video capture device 10 “listens to” but does not “see”, the video
capture device 10 backgrounds these audio objects 32 as these audio objects 32 are likely to be
behind the listener. It can be rendered inside.
[0054]
[0061]
Although described above as being performed by the video capture device 10, the techniques
may be implemented by a device different from the device that captured the video data 18 and
the audio data 20. In other words, a smartphone or other video capture device can capture video
data and audio data, and this video data and audio data can be processed by a dedicated
processing server, desktop computer, laptop computer, tablet or slate computer, or data. Upload
to a different device, such as any other type of device that can be processed. This other device
may then perform the techniques described in this disclosure to facilitate the generation of what
may be considered more accurate surround sound or multi-channel audio data. Thus, although
described as being performed by a device that has captured video and audio data, the techniques
may be performed by a different device than the device that has captured video and audio data,
as described in this disclosure in this regard. It should not be limited to the example.
[0055]
[0062]
03-05-2019
20
FIGS. 2A-2D illustrate the operations performed by the video capture device 10 of FIG. 1 in
associating a video object 32 with an audio object 34 in accordance with the techniques
described in this disclosure. In FIG. 2A described above, one of the audio objects 34 (denoted as
“audio object 34A” in the example of FIG. 2A) and one of the video objects 32 (“video object
32A in the example of FIG. "Denoted as" includes each audio metadata 54A and video metadata
52A. The object association unit 26 of the video capture device 10 includes audio data 54A to
generate an extended audio object 34A '(which is one of the audio objects 34 shown in the
example of FIG. 1B) having the extended metadata 56A. Audio metadata 34A may be associated
with video objects 32A using video metadata 52A. This extended metadata 56A may include both
audio metadata 54A and video metadata 52A, where, in some examples, video metadata 52A
includes some or all of audio metadata 54A. It can be replaced. In some examples, object
association unit 26 may determine that audio metadata 54A and video metadata 52A have high
correlation.
[0056]
[0063]
In another example, object association unit 26 may determine that audio metadata 54A and
video metadata 52A have low correlation. In this example, object association unit 26 may weight
video metadata 52A to support video metadata 52A more than audio metadata 52A when
generating extended metadata 56A. When rendering and mixing this audio object 34A 'to
generate multi-channel audio data 40, the assisted audio rendering unit 28A generates a lack of
correlation between the audio metadata 54A and the video metadata 52A. Thus, the audio object
34A 'may be rendered as the audio object 34A' spreading more spread across more channels in
the foreground. Video capture device 10 may perform various diffusion algorithms such as sound
decorrelation on these objects to diffuse the objects.
[0057]
[0064]
In the example of FIG. 2B, the auditory analysis unit 24 identifies another one of the audio
objects 34 (denoted as audio object 34B in the example of FIG. 2B), but any of the audio objects
34B. Unable to identify metadata. This example reflects an example where multiple microphones
are not available on the video capture device 10 and as a result, the video capture device 10 can
not determine audio metadata. As a result, the object association unit 26 is associated instead of
audio metadata when rendering this audio object to generate an audio object 34B '(pointing to
one of the audio objects 34'). The video metadata 52B of the video object 32B may be utilized. As
03-05-2019
21
shown in the example of FIG. 2B, audio object 34B 'includes video metadata 52B.
[0058]
[0065]
In the example of FIG. 2C, the auditory analysis unit 24 identifies one of the audio objects 34
(denoted "audio object 34C") and determines audio metadata 54C for this audio object. , This
audio object 34C can not identify any of the corresponding video objects 32. Because no video
object has been identified for this audio object 34C, the object association unit 26 may determine
that the audio object 34C is to be placed behind the video capture device 10. Based on this
determination, object association unit 26 passes audio object 34C to unassisted rendering unit
28B as one of audio objects 34 '' (ie, audio object 34C '' in the example of FIG. 2C). The
unassisted rendering unit 28 B may then render this audio object in the background channel of
the multi-channel audio data 40. When rendering this audio object 34 C ′ ′, the unassisted
audio rendering unit 28 B diffuses the audio object 34 C ′ ′ very much based on the predicted
position in the audio metadata 54 C or in the background channel. Can be rendered across That
is, the video capture device 10 estimates the actual position based on the audio metadata, or, as
the object has a cloud-like shape in space without a particular perceptual angle (identified above
Objects may be rendered very diffusely using the sound diffusion process.
[0059]
[0066]
In the example of FIG. 2D, the object association unit 26 receives one of the video objects 32
(denoted as “video object 32D” in the example of FIG. 2D) including video metadata 52D, but
the video object 32D Can not be associated with any of the audio objects 34. As a result, object
association unit 26 passes video object 32D to augmented reality audio rendering unit 28C as
one of video objects 32 '(ie, video object 32D' in the example of FIG. 2D). Video object 32D
includes video metadata 52D. The augmented reality audio rendering unit 28 C accesses the
library of reference audio objects 34 ′ ′ ′, and of the reference audio objects 34 ′ ′ ′
would have been associated with the video object 32 D ′ (e.g. Video metadata 52D may be
utilized to obtain one (such as a reference audio object 34 '' ') that matches the type specified in
video metadata 52D that identifies the type. The augmented reality audio rendering unit 28 C
then uses video metadata 52 D to refine this reference audio object 34 ′ ′ ′ to refine or
otherwise spatialize the rendering of the audio object 34 ′ ′ ′. It can be rendered.
03-05-2019
22
[0060]
[0067]
Thus, the video capture device 10 renders the audio object based on the correlation of the
metadata specified by the audio object with the metadata specified by the associated video
object, and It may be attempted to localize as originating from a video object or part of it.
Considering that video scene analysis is often much more accurate than auditory scene analysis,
video capture device 10 is more than video object metadata than audio object metadata in some
instances (as in FIG. 2A). Support (using weights). Video capture devices may, in some instances,
generate audio objects with no metadata at all or very uncertain metadata (as shown in the
example of FIG. 2B), Then, the video capture device may import "matching" video object metadata
for use as metadata used when rendering the audio object.
[0061]
[0068]
To illustrate, enhanced metadata 56A may include both audio metadata 54A and video metadata
52A, where, in some examples, video metadata 52A may replace audio metadata 54A. . In some
examples, video capture device 10 may determine audio metadata 54A and video metadata 52A
with high correlation. In other words, the video capture device 10 indicates the position of the
object that generated the sound specified by the audio metadata 54A to a high degree (for
example, often as a percentage) of the position of the corresponding object defined by the video
metadata 52A. It may be decided to correlate up to some reliability threshold). Video capture
device 10 may then render and mix audio objects to generate multi-channel audio data 40 with
high confidence.
[0062]
[0069]
In another example, video capture device 10 may determine that audio metadata 54A and video
metadata 52A have low correlation. In this example, video capture device 10 may weight video
metadata 52A to support video metadata 52A more than audio metadata 54A when generating
enhanced metadata 56A. When rendering and mixing audio objects 34A 'to generate multichannel audio data 40, the video capture device 10 may use the audio objects due to the lack of
correlation between audio metadata 54A and metadata 52A. 34A 'may be rendered more diffuse,
spreading audio object 34A' across more foreground channels.
03-05-2019
23
[0063]
[0070]
FIG. 3 is a block diagram illustrating the assisted audio rendering unit 28A of FIG. 1B in more
detail. In the example of FIG. 3, assisted audio rendering unit 28A includes several spatial audio
rendering units 60A-60N ("spatial audio rendering unit 60"). Although several spatial audio
rendering units 60 are shown in the example of FIG. 3, the assisted audio rendering unit 28 is, in
some examples, only a single spatial audio rendering unit 60 that can process multiple objects in
parallel. May be included. Alternatively, assisted audio rendering unit 28 may include a single
spatial audio rendering unit 60 capable of processing only a single audio object. The technology
should therefore not be limited in this respect to the example of FIG.
[0064]
[0071]
In the example of FIG. 3, each of the spatial audio rendering units 60 is spatial with respect to the
audio objects 34A'-34N '("audio objects 34'" shown in the example of FIG. 1B) to generate the
audio object 38A. It may represent a separate audio rendering process that performs audio
rendering. Spatial audio rendering can refer to various algorithms or processes for rendering
audio data, and as a few examples, ambisonics, wave field synthesis (WFS), and vector based
amplitude panning ( VBAP: can include vector-based amplitude panning. Spatial audio rendering
unit 60 may process each of audio objects 34 'based on enhanced metadata 56A-56N ("enhanced
metadata 56"). That is, spatial audio rendering unit 60 further refines corresponding ones of
audio objects 34 'so that they can be more accurately reproduced as multi-channel audio data 40
is reproduced. Or, the expanded metadata 56 may be used to render the audio object 34 'for
more accurate localization in other ways. Spatial audio rendering unit 60 may output rendered
audio data 38 A to audio mixing unit 30, which may then mix the rendered audio data 38 A to
generate multi-channel audio data 40. In some examples, audio data 38 A corresponding to a
given audio object 34 ′ may be mixed across two or more channels of multi-channel audio data
40.
[0065]
[0072]
03-05-2019
24
As described with respect to assisted audio rendering unit 28A in the example of FIG. 3, each of
rendering units 28 may include a spatial audio rendering unit similar to spatial audio rendering
unit 60, which may be (Again pointing to the reference audio object 34 ′ ′ ′, which was
obtained from the reference audio library and would have been associated with the video object
32 ′), audio object 34 ′ ′ to generate audio data 38 B and 38 C And 34 '' 'can be treated
similarly. Additionally, although described as including a rendering unit 28C, the video capture
device 10 may not include the rendering unit 28C, where the video capture device 10 may
include augmented reality audio of the techniques described in this disclosure. The aspect of
rendering may not be performed.
[0066]
[0073]
FIG. 4 is a diagram illustrating a scene 70 captured by the camera 14 of the video capture device
10 shown in the example of FIG. 1B and processed according to the techniques described in this
disclosure. Scene 70 may represent a portion of video data 18 shown in the example of FIG. 1B.
Video capture device 10 may invoke visual analysis unit 22 in response to receiving scene 70,
wherein visual analysis unit 22 processes scene 70 to identify video object 32.
[0067]
[0074]
As shown in FIG. 4, a scene 70 includes, for example, a first frame or image 72A, a second frame
or image 72B, and a third frame or image 72C, in a temporal sequence of frames. Although
illustrated as including only three frames or images 72A-C ("image 72") for purposes of
illustration ease, scene 70 may include multiple images 72 or a single image 72. The technology
should not be limited in this respect to the example shown in FIG.
[0068]
[0075]
In any event, visual analysis unit 22 may process image 72A using computer vision algorithms to
identify video objects 32A-32G. Visual analysis unit 22 may generate video objects 32A-32G to
include video metadata 52A-52G or otherwise be associated with video metadata 52A-52G. Video
metadata 52A-52G may define corresponding positions of video objects 32A-32G relative to
03-05-2019
25
camera 14 that captured scene 70. The video metadata 52A-52G can also generally identify the
type of the corresponding one of the video objects 32 based on, for example, machine vision
based object recognition, machine vision based object recognition It may be fully supported by
the visual analysis unit 22 within the dynamic analysis unit 22 or with one or more external and
possibly remote network servers. For example, video metadata 52A associated with video object
32A may identify video object 32A as a car. Video metadata 52B-32F may, as another example,
identify the corresponding type of video objects 32B-32F as human. Video metadata 52G may, as
yet another example, identify the type of corresponding video object 32G as stereo.
[0069]
[0076]
Visual analysis unit 22 may use visual metadata to represent metrics associated with movement,
velocity, or other locations that describe how video objects 32A-32G move between scenes 70.
One or more images 72 may be analyzed simultaneously to generate positional information in
the form 52A-52G. To illustrate, consider video object 32A from image 72A to image 72C, where
video object 32A is located generally along a horizontal line from a first position to a second
position, and then to a third position. It is moving. The visual analysis unit 22 identifies the object
32A and moves the video object 32A from the first position to the second position and then to
the third position from image 72A to image 72B then to image 72C. Video metadata 52A may be
generated to indicate that it is present. The video metadata 52A, when associated with a
corresponding one of the audio objects 34 (e.g., audio object 34A), the location of the object from
which the object association unit 26 emits audio data identified as audio object 34A. It may be
possible to extend the audio metadata 54A to more accurately specify (as visual scene analysis is
generally considered more accurate than auditory scene analysis). Object association unit 26 may
then generate audio object 34 'with enhanced metadata 56A (eg, as shown in FIG. 2A).
[0070]
[0077]
As another example, consider video object 32 G as moving within scene 70. Initially, image 72A
shows video object 32G in a first position. Image 72B shows video object 32G in a second
position. Image 72C does not include video object 32G, suggesting that video object 32G has left
the scene and is either in the background or off the left or right of scene 70 captured by camera
14 . Object association unit 26 may then generate video object 32G to include video metadata
52G that specifies the position of video object 32G as video object 32G moves through scene 70.
The object association unit 26 may associate the video object 32G with one of the audio objects
03-05-2019
26
34 having metadata of the same type, ie in this example stereo, in this example. However,
considering that video object 32G leaves the scene, object association unit 26 either replaces or
otherwise does not utilize the location information specified by video metadata 52G, and instead,
this of audio object 34 is Maintain the location information specified by the audio metadata 54
associated with one.
[0071]
[0078]
The object association unit 26 utilizes the location specified by the video metadata 52G when
rendering the associated one of the audio objects 34, eg the audio object 34G, for playback with
respect to the images 72A, 72B. It can. However, video metadata 52G may specify high
confidence levels for these time positions, but may specify low to zero confidence levels for
position information when corresponding to image 72C. As a result, the object association unit
26 does not replace the position information specified by the video metadata 52G, or otherwise,
when rendering the associated audio object 34G for playback when the image 72C is presented.
May not be used. Instead, object association unit 26 may utilize position information specified by
audio object 34G when rendering audio object 34G during the time that image 72C is to be
presented.
[0072]
[0079]
As noted above, object association unit 26 may not be able to identify video object 32G
corresponding to audio object 34G, as in the example of image 72C. That is, video object 32G
may be leaving scene 70 as shown in image 72C, but music playing from stereo may still be
captured and identified as audio object 34G. In this example, object association unit 26 may
perform the operations described above with respect to FIG. 2C. That is, the object association
unit 26 reclassifies the audio object 34G from the current classification of the audio object
associated with the video object into an audio object 34G not associated with any of the video
objects 32, with reference to FIG. 2C. Audio object 34G may be processed as described above.
Object association unit 26 may generate audio object 34G "and pass this audio object 34G" to
unassisted audio rendering unit 28B. In this regard, the audio object 34G may be transitioned to
being processed as described above with respect to FIG. 2C from being processed in the manner
described above with respect to FIG. 2A.
03-05-2019
27
[0073]
[0080]
As such, video capture device 10 may dynamically perform the techniques described in this
disclosure to potentially generate more accurate multi-channel audio data 40. To this end, video
capture device 10 adaptively classifies audio objects 34 and migrates these audio objects 34 and
video objects 32 among various ones of the three classes described above. It can be done. In
some examples, video capture device 10 adaptively classifies audio object 34 and video object
32, and audio object 34 and video from one of the methods described above with respect to FIGS.
2A-2D. Processing the object 32 may transition to a different one of the methods described
above with respect to FIGS. 2A-2D.
[0074]
[0081]
FIG. 5 is a diagram illustrating another scene 80 captured by the camera 14 of the video capture
device 10 shown in the example of FIG. 1B and processed in accordance with the augmented
reality aspect of the technology described in this disclosure. In the example of FIG. 5, a scene 80
may represent a portion of the video data 18 shown in the example of FIG. 1B. Video capture
device 10 may invoke visual analysis unit 22 in response to receiving scene 80, which analyzes
scene 80 to identify video objects 32I and 32H. To process. The scene 80 includes an image 82.
Although shown as including a single image, image 82, for purposes of illustration ease, the
scene 80 may include additional images, and the technique in this regard is illustrated in the
example shown in FIG. It should not be limited to
[0075]
[0082]
In any event, visual analysis unit 22 may identify and generate video objects 32I and 32H to
include video metadata 52I and 52H, respectively. The visual analysis unit 22 can pass the visual
objects 32I and 32H to the object association unit 26, and the object association unit 26 can
convert the visual objects 32I and 32H to one of the audio objects 34. It may try to associate.
Object association unit 26 is assumed to associate visual object 32I with one of audio objects 34,
eg, audio object 34I, for purposes of example. Object association unit 26 may then process audio
object 34I in view of associated video object 32I in a manner similar to that described above with
respect to the example of FIG. 2A. The object association unit 26 may then generate an audio
03-05-2019
28
object 34I 'with the expanded metadata 56I.
[0076]
[0083]
In addition to the human identified as video object 32I, scene 80 includes a beach that visual
analysis unit 22 has identified as video object 32H, where for the purpose of illustration, the
sound of the waves by microphone 16 It is assumed that it has not been captured. That is, the
video capture device 10 is sufficiently far from the beach so that the sound of waves crashing
into the sand is not heard due to any of distance, talking people, wind noise, or any other
disturbances. It is assumed that The object association unit 26 consequently belongs to the third
class, ie, in the example of the present disclosure, of the video objects 32 which are not
associated with any of the audio objects 34 in the object association unit 26 It can be classified
as a thing. As a result, object association unit 26 may process video object 32H and generate
video object 32H 'in the manner described above with respect to the example of FIG. 2D. Object
association unit 26 may then transfer video object 32H 'to augmented reality audio rendering
unit 28C.
[0077]
[0084]
Audio rendering unit 28C receives video object 32H 'and obtains a corresponding one of
reference audio objects 34' '' that are of the same type, which in this example may be of the type
such as wave, beach, etc. It can. Audio rendering unit 28C may then render this one of reference
audio objects 34 "', eg, audio rendering object 34H"', based on video metadata 52H. The
augmented reality audio rendering unit 28C can pass this rendered audio data as audio data 38C
to the mixing unit 30, which mixes the multi-channel audio data 40 in the manner described
above. To mix the audio data 38A-38C.
[0078]
[0085]
FIG. 6 is a flow chart illustrating an exemplary operation of a video capture device, such as the
video capture device 10 shown in the example of FIG. 1B, in performing the techniques described
in this disclosure. Initially, video capture device 10 may be configured to call camera 14 to
03-05-2019
29
capture video data 18 and, simultaneously, also one or more of microphones 16 often to capture
audio data 20. It may be configured to call everything (90, 92). In response to receiving video
data 18 and audio data 20, control unit 12 of video capture device 10 is configured to perform
the techniques described in this disclosure for generating multi-channel audio data 40. obtain.
[0079]
[0086]
Upon receiving video data 18, control unit 12 may be configured to invoke visual analysis unit
22, which relates to video data 18 to identify one or more video objects 32. Visual scene analysis
may be performed (94). Upon receiving the audio data 20, the control unit 12 can invoke the
auditory analysis unit 24, which aurally relates to the audio data 20 to identify one or more
audio objects 34. Scene analysis may be performed (96).
[0080]
[0087]
Control unit 12 may receive video object 32 and audio object 34 and may call object association
unit 26. Object association unit 26 may compare audio object 34 to video object 32 in an attempt
to associate at least one audio object 34 with at least one video object 32 (98). As explained
above, the object association unit 26, when making this association, is based on each of the audio
objects 34, typically (in some cases, the type of audio object can be defined) metadata Can be
classified as a type of audio object. Similarly, when the object association unit 26 makes this
association, each of the video objects 32 is typically based on corresponding metadata (which
may also define the type of video object in some instances). And may be classified as a type of
video object. Exemplary types may comprise cars, beaches, waves, running water, music, humans,
dogs, cats, winds, and the like. The object association unit 26 may then determine that the type of
one of the audio objects 34 is the same type as one of the video objects 32, thereby determining
a match (100) of the audio objects 34. In response to the determination that the type of one is
the same as the type of one of the video objects 32, ie, a match is being identified ("yes" 100), the
object association unit 26 determines whether the audio object 34 One of the video objects 32
may be associated 102 with the matching one of the video objects 32 (102).
[0081]
[0088]
03-05-2019
30
With respect to audio objects 34 determined to belong to the first class, object association unit
26 may associate audio data metadata of one of audio objects 34 with associated one of video
objects 32. The level of correlation between the two video metadata may be determined, and
based on the determined level of correlation, composite metadata may be generated for one of
the audio objects with which one video object 32 is associated. In some examples, object
association unit 26 may also replace audio metadata or portions thereof, such as locations
specified by audio metadata, with corresponding video metadata or portions thereof. Thus, the
object association unit 26 generates one or more audio objects 34 based on the associated one of
the video objects 32 in order to generate an updated or expanded audio object 34 '. Can be
updated (104).
[0082]
[0089]
Object association unit 26 may then pass these audio objects 34 'to assisted audio rendering unit
28A. Assisted audio rendering unit 28A then determines audio object 34 'in one or more
foreground channels of multi-channel audio data 40 based on the composite metadata generated
for one of audio objects 34'. One of the can be rendered (106). Assisted audio rendering unit 28A
passes this portion of multi-channel audio data 40 to audio mixing unit 30 as audio data 38A.
[0083]
[0090]
With regard to those of the audio objects 34 that are determined to belong to the second class,
ie, determined to not correspond to any of the video objects 32 in the example of the present
disclosure (or in other words, With respect to matches, "No" 100, "Yes" audio objects for which
audio objects 108 do not exist ", the object association unit 26 sets these audio objects 34 as one
of the audio objects 34 ''. It can be passed to Unassisted audio rendering unit 28 B may generate
multi-channel audio data 40 such that one of audio objects 34 ′ ′ occurs in one or more
background channels of multi-channel audio data 40. . Unassisted audio rendering unit 28 B may
be configured 110 to render unmatched audio objects 34 ′ ′ in the background, often as
diffuse sound. Unassisted audio rendering unit 28B passes this portion of multi-channel audio
data 40 to audio mixing unit 30 as audio data 38B.
[0084]
03-05-2019
31
[0091]
For those audio objects 32 determined to belong to the third class, ie, the video object 32 is not
associated with any of the audio objects 34 in the example of FIG. 1B (or in other words, If the
object association unit 26 does not match any of the audio objects 34 and is a video object, "no"
100, "no" 108 object of a video object 32), these video objects 32 It may be passed to the
augmented reality audio rendering unit 28C as a video object 32 '. The augmented reality audio
rendering unit 28C responds to receiving the video object 32 'from the audio library the
reference audio that would have been associated with each one of the video objects 32' (if
possible). The library may be obtained and then each of the reference audio objects (which may
be referred to as audio objects 34 ′ ′ ′) may be rendered 112 to generate at least a portion of
the multi-channel audio data 40. The augmented reality audio rendering unit 28C passes this
portion of multi-channel audio data 40 to the audio mixing unit 30 as audio data 38C.
[0085]
[0092]
Audio mixing unit 30 receives audio data 38 and mixes the audio data 38 to form multi-channel
audio data 40 (114). Audio mixing unit 30 may mix this audio data 38 as described above to
generate any form of multi-channel audio data 40. These formats may include 5,1 surround
sound format, 7.1 surround sound format, 10.1 surround sound format, 22.2 surround sound
format, or any other proprietary or non proprietary format. Audio mixing unit 30 may then
output this multi-channel audio data 40 (116).
[0086]
[0093]
Thus, the control unit 12 of the video capture device 10 analyzes the audio data to identify one
or more audio objects and captures the audio data to identify one or more video objects. It may
be configured to analyze video metadata captured by the device at the same time. The control
unit 12 further associates one of the audio objects 34 with one of the video objects 32 and audio
data based on the association of one of the audio objects 34 with one of the video objects 32. 20
may be configured to generate multi-channel audio data 40.
[0087]
03-05-2019
32
[0094]
Although described in the context of generating multi-channel audio data 40, video capture
device 10 may further encode video data. When encoding, video data spreading audio objects
may allow video capture device 10 to encode these audio objects using fewer bits. That is, audio
objects in the background behind or far away are less important than audio objects in closespaces where they are not seen or focused by the eye and are presented with other audio objects
When it is done, it may not need to be rendered in high quality as it is very likely to be masked.
As a result, video capture device 10 may assign fewer bits to these audio objects when encoding
and transmitting them for the playback system.
[0088]
[0095]
It is also described as being done after capture of audio and video data (or "off-line" as this type
of processing is commonly referred to), or not as a real-time or near-real-time system However,
the techniques may be implemented in a real time or near real time system during capture of
audio data and at least part of video data. While implementations of video scene analysis exist for
quasi-real time or real-time systems, audio scene analysis is typically less complex than video
scene analysis, meaning that audio scene analysis can be done with quasi-real time or real-time
devices Do.
[0089]
[0096]
Furthermore, although described with respect to audio and visual areas, techniques may be
performed with respect to other areas. For example, touch, motion, compass, altitude,
temperature, and other sensor areas may also be considered together to improve media
rendering quality with potential focus on 3D spatial properties. Thus, the technology should in
this regard not be limited to the examples described in the present disclosure.
[0090]
[0097]
03-05-2019
33
FIG. 7 is a diagram illustrating how various audio objects 126A-126K may be rendered in the
foreground and background of multi-channel audio data in accordance with the techniques
described in this disclosure. The diagram of FIG. 7 designates a view 120 showing what is
commonly referred to as a "sweet spot" from a downward point of view or bird's eye view. A
sweet spot refers to an indoor location where the surround sound experience is most suitable
when the speakers are properly configured for 5.1 or higher order surround sound playback.
[0091]
[0098]
In the example of FIG. 7, the view 120 is divided into two parts, which are shown as a foreground
part 122A and a background part 122B. Within the circle, the listener 124 is centered on the
sweet spot, horizontally above separating the foreground portion 122A from the background
portion 122B. During playback of multi-channel audio data 40, listener 124 may listen to audio
objects 126A-126K in the sound field as presented in view 120. That is, audio objects 126A126D appear to originate from the foreground further from the view of listener 124. Audio
objects 126A-126D may be processed by object association unit 26 as described above with
respect to FIG. 2B, so that assisted audio rendering unit 28A has become more diffuse due to the
lack of any audio metadata Render them in the far foreground as audio objects.
[0092]
[0099]
Audio objects 126E-126G may appear from the listener 124's field of view as occurring in a
closer foreground as a more focused object. Audio objects 126E-126G may be processed by
object association unit 26 in the manner described above with respect to FIG. 2A, such that
assisted audio rendering unit 28A has enhanced audio and metadata correlation with enhanced
meta-data. The ability of the object association unit 26 to provide data renders them in the more
focused foreground.
[0093]
[0100]
One or more audio objects 126A-126G may be reference audio objects obtained from a reference
library in the manner described above with respect to augmented reality audio rendering unit
03-05-2019
34
28C. In this sense, the object association unit 26 identifies those of the video objects 32 that do
not match any of the audio objects 34, and those of the video objects 32 as augmented reality
audio as video objects 32 '. It can be passed to the rendering unit 28C. The augmented reality
audio rendering 28C then obtains one of the reference audio objects 34 '' 'corresponding to or
matches one of the video objects 32' and is associated with one of the video objects 32 ' Based on
the video metadata contained within one, this one of the reference audio objects 34 '' 'may be
rendered.
[0094]
[0101]
Audio objects 126H-126K may appear as occurring in the background and form the field of view
of listener 124. Audio objects 126H-126K may have been processed by object association unit
26 in the manner described above with respect to FIG. 2C, so that unassisted audio rendering
unit 28B may video these audio objects 34 ''. The lack of the ability of the object association unit
26 to associate with any one of the objects 32 renders them in the background. That is, since
auditory scene analysis is typically not accurate in locating the source of sound as compared to
visual scene analysis, unassisted audio rendering unit 28B may Sometimes the source can not be
positioned correctly. The unassisted audio rendering unit 28B may simply render the audio
object 34 '' based on the corresponding audio metadata 54 at the most, so that the audio
rendering unit 28B may render these audio objects in the background as more diffuse objects. It
can result in rendering 34 ''.
[0095]
[0102]
In this manner, the technique allows the device to analyze audio data captured by the device to
identify one or more audio objects, and to identify one or more video objects. It may be possible
to analyze video data captured by the device at the same time as capture. The device further
associates at least one of the one or more audio objects with at least one of the one or more video
objects, one with at least one of the one or more video objects. Multi-channel audio data may be
generated from the audio data based on an association of at least one of the one or more audio
objects.
[0096]
[0103]
03-05-2019
35
In some examples, an auditory scene of audio data to identify one or more audio objects and
audio metadata describing the one or more audio objects when the device analyzes the audio
data Analysis may be performed, where the audio metadata comprises one or more of the
position, shape, velocity, and confidence level of the corresponding audio object. The device may
perform visual scene analysis of video data to identify one or more video objects and video
metadata describing the one or more video objects when analyzing video data. The video
metadata may comprise one or more of the position, shape, velocity, and confidence level of the
corresponding audio object.
[0097]
[0104] When the device associates at least one of the one or more audio objects with at least one
of the one or more video objects in some examples, the one or more types of audio objects are
Classify each of the audio objects, classify each of the one or more video objects as a video object
type, and at least one of the audio objects being of the same type as at least one of the video
objects One or more audio objects in response to the determination and determining that at least
one type of one or more audio objects is the same type as at least one of the one or more video
objects At least one of, it may associate at least one of the one or more video objects.
[0098]
[0105] In some examples, when the device generates multi-channel audio data, it associates at
least one audio metadata of the one or more audio objects with at least one of the one or more
audio objects Determining a level of correlation with at least one video metadata of the one or
more video objects, and based on the determined level of correlation, at least one of the one or
more video objects Generate composite metadata regarding at least one of the one or more audio
objects with which one is associated, and based on the composite metadata generated regarding
at least one of the one or more audio objects , Multi-channel audio One or more foreground
channels Iodeta may render at least one of the one or more audio objects.
[0099]
[0106]
In some examples, at least one of the one or more audio objects comprises a first of the one or
more audio objects.
03-05-2019
36
The device further determines, in some instances, that the second of the one or more audio
objects is not associated with any of the one or more video objects, and When generating channel
audio data, multi-channel audio data may be generated such that a second one of the audio
objects occurs within one or more background channels of multi-channel audio data.
[0100]
[0107] When the device generates multi-channel audio data, the multi-channel audio such that
the second of the audio objects occurs as a diffused audio object in one or more background
channels of the multi-channel audio data It can generate data.
[0101]
[0108]
In some examples, at least one of the one or more video objects comprises a first of the one or
more video objects.
In these examples, the device may determine that the second of one or more of the video objects
is not associated with any of the one or more audio objects.
In response to determining that the second of one or more of the video objects is not associated
with any of the one or more audio objects, the device may A reference audio object that may
have been associated with the second one of the plurality of video objects may be obtained from
the audio library. Additionally, the device may render the reference audio object based on a
second of the one or more video objects to generate at least a portion of the multi-channel audio
data.
[0102]
[0109]
In some instances, when the device analyzes audio data, audio data may be audible to identify
one or more audio objects and audio metadata describing the one or more audio objects. Scene
analysis can be performed. The device may perform visual scene analysis of video data to identify
one or more video objects and video metadata describing the one or more video objects when
analyzing video data. In these examples, audio metadata is defined in a text format common to
03-05-2019
37
the text format used to define video metadata.
[0103]
[0110]
In some instances, when the device analyzes audio data, audio data may be audible to identify
one or more audio objects and audio metadata describing the one or more audio objects. Scene
analysis can be performed. When analyzing video data, the device may perform visual scene
analysis of video data to identify one or more video objects and video metadata describing the
one or more video objects. In these examples, when the device generates multi-channel audio
data, audio metadata identified for at least one of the audio objects and a video meta identified
for the associated one of the video objects When determining the level of correlation with the
data and generating multi-channel audio data, at least one of the audio objects may be rendered
as a diffusing audio object based on the determined level of correlation. Often, the level of
correlation is based on some form of confidence interval, where the level of confidence can be
derived as a function of the percentage difference between the metadata of the audio and peer
video object and the confidence interval.
[0104]
[0111] Various aspects of the technique also include a device comprising one or more processors
obtaining an audio object, obtaining a video object, associating the audio object with the video
object, and associating the audio object with the video object. It may be possible to render an
audio object based on the comparison between the object and the audio object and the associated
video object.
[0105]
[0112]
In some instances, an audio object includes audio metadata.
In some examples, audio metadata comprises size and position. In some examples, a video object
includes video metadata. In some examples, video metadata comprises size and position.
[0106]
03-05-2019
38
[0113] In some examples, the one or more processors further at least partially composite
metadata comprising one or more of size and position when comparing the audio object to the
associated video object Configured to generate
[0107]
[0114]
In some examples, an audio object includes position metadata and a video object includes
position metadata.
When generating composite metadata, the one or more processors further compare the audio
object's position metadata to the video object's position metadata to determine a correlation
value, the correlation value being a confidence threshold Are configured to generate location
metadata of the composite metadata based on the determination of whether
[0108]
[0115] In addition, various aspects of the technique compare acquiring audio objects, acquiring
video objects, associating audio objects with video objects, and comparing audio objects with
associated video objects. And providing an audio object based on the comparison between the
audio object and the associated video object.
[0109]
[0116] Additionally, when comparing an audio object to an associated video object, the method
may further include at least partially generating composite metadata comprising one or more of
size and position. .
[0110]
[0117] Also, when the audio object includes position metadata and the video object includes
position metadata, generating the composite metadata may include position metadata of the
audio object to position metadata of the video object to determine a correlation value.
Comparable to the data, and generating location metadata of the composite metadata based on
the determination of whether the correlation value has exceeded a confidence threshold.
[0111]
03-05-2019
39
[0118] Furthermore, various aspects of the technology associate means for acquiring audio
objects, means for acquiring video objects, associating audio objects with video objects, audio
objects, and associated video objects. A device may be provided comprising means for comparing
and means for rendering an audio object based on a comparison between the audio object and an
associated video object.
[0112]
[0119] In addition, the means for comparing the audio object to the associated video object
comprises means for at least partially generating composite metadata comprising one or more of
size and position. It can be equipped.
[0113]
[0120] Also, when the audio object comprises position metadata and the video object comprises
position metadata, the means for generating the compound metadata comprises position
metadata of the audio object of the video object to determine correlation values. Means may be
provided for comparing with the position metadata and means for generating position metadata
of the compound metadata based on the determination of whether the correlation value has
exceeded the confidence threshold.
[0114]
[0121] In some instances, when executed, one or more processors get the audio object, get the
video object, associate the audio object with the video object, and the audio object is associated
with the video A non-transitory computer readable storage medium having stored thereon
instructions for causing an audio object to be rendered based on the comparison between the
object and the audio object and the associated video object.
[0115]
[0122]
Various aspects of the techniques described in this disclosure may also be performed by a device
that generates an audio output signal.
A device is a first audio object associated with a first video object relative based on a first
comparison between a first audio object data component and a first video object data component.
Not associated with the counterpart of the second video object based on the second comparison
03-05-2019
40
of the data component of the second audio object and the data component of the second video
object. And means for identifying a second audio object.
The device additionally comprises means for rendering a first audio object in a first zone, means
for rendering a second audio object in a second zone, and in the first zone And means for
generating an audio output signal based on combining the rendered first audio object and the
rendered second audio object in the second zone.
The various means described herein may comprise one or more processors configured to
perform the functions described with respect to each of the means.
[0116]
[0123]
In some examples, the data component of the first audio object comprises one of position and
size.
In some examples, the data component of the first video object comprises one of position and
size.
In some examples, the data component of the second audio object comprises one of position and
size.
In some examples, the data component of the second video object comprises one of position and
size.
[0117]
[0124]
In some examples, the first zone and the second zone are different zones in the audio foreground,
or different zones in the audio background.
03-05-2019
41
In some examples, the first zone and the second zone are the same zone in the audio foreground,
or the same zone in the audio background. In some instances, the first zone is in the audio
foreground and the second zone is in the audio background. In some instances, the first zone is in
the audio background and the second zone is in the audio foreground.
[0118]
[0125] In some examples, the data component of the first data object, the data component of the
second audio object, the data component of the first video object, and the data component of the
second video object are each: It has metadata.
[0119]
[0126]
In some examples, the device is further based on the means for determining whether the first
comparison is outside the confidence interval and the determination of whether the first
comparison is outside the confidence interval , And means for weighting data components of the
first audio object and data components of the first video object.
In some examples, the means for weighting comprises means for averaging the data component
of the first audio object and the data component of the first video object.
[0120]
[0127] In some examples, the device may also include means for assigning different numbers of
bits based on one or more of the first comparison and the second comparison.
[0121]
[0128] In some examples, when the technology is implemented, the one or more processors are
based on a first comparison of a data component of a first audio object and a data component of
a first video object. , Identifying a first audio object associated with a first video object relative,
and based on a second comparison of a data component of the second audio object and a data
component of the second video object Causes a second audio object not associated with the
second video object's relative to be identified, and causes the first audio object to be rendered in
the first zone and the second audio object in the second zone. Means for rendering the first audio
in the first zone and Based on coupling the second audio object rendered in I o the object and the
03-05-2019
42
second zone may provide a non-transitory computer-readable storage medium which instructions
are stored for generating an audio output signal.
[0122]
[0129]
Depending on the example, certain operations or events of any of the methods described herein
may be performed in a different order, added, merged, or not at all (e.g., all described operations
or It should be understood that events are not necessary for the implementation of the method).
Further, in particular examples, operations or events may be performed simultaneously rather
than sequentially, eg, through multi-threaded processing, interrupt processing, or multiple
processors.
In addition, although certain aspects of the present disclosure are described as being performed
by a single module or unit for purposes of clarity, the techniques of the present disclosure
include units or modules associated with video coders. It should be understood that the
combination of
[0123]
[0130]
In one or more examples, the functions described may be implemented in hardware, software,
firmware, or any combination thereof. If implemented in software, the functions may be stored or
transmitted as one or more instructions or code on a computer readable medium and executed
by a hardware based processing unit. Computer-readable media includes, for example, computerreadable storage media corresponding to tangible media, such as data storage media or
communication media, including any media that facilitates transfer of a computer program from
one place to another according to a communication protocol. May be included.
[0124]
[0131]
03-05-2019
43
In this way, the computer readable medium may generally correspond to (1) a tangible computer
readable storage medium that is non-transitory, or (2) a communication medium such as a signal
or carrier wave. The data carrier may be accessed by one or more computers or one or more
processors to obtain instructions, code and / or data structures for implementation of the
technology described in this disclosure. Can be any available media that can A computer program
product may include computer readable media.
[0125]
[0132]
By way of example, and not limitation, such computer readable storage media may be RAM, ROM,
EEPROM®, CD-ROM, or other optical disk storage, magnetic disk storage, or other magnetic
storage device. Flash memory or any other medium that can be used to store desired program
code in the form of instructions or data structures and can be accessed by a computer. Also, any
connection is properly termed a computer-readable medium. For example, the instructions may
use a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or a web site,
server, or other using wireless technology such as infrared, wireless, and microwave. When
transmitted from a remote source, wireless technologies such as coaxial cable, fiber optic cable,
twisted pair, DSL, or infrared, wireless, and microwave are included in the definition of medium.
[0126]
[0133]
However, it should be understood that computer readable storage media and data storage media
do not include connections, carriers, signals, or other temporary media, but instead are directed
to non-transitory tangible storage media. Disks and discs as used herein are compact discs (CDs),
laser discs (registered trademark), optical discs, digital versatile discs (DVDs), floppy discs
(registered trademark), And Blu-ray® discs, where the disc normally reproduces data
magnetically and the disc reproduces data optically using a laser. Combinations of the above
should also be included within the scope of computer readable media.
[0127]
[0134]
The instructions may be like one or more digital signal processors (DSPs), general purpose
03-05-2019
44
microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays
(FPGAs), or other equivalent integrated or discrete logic networks. May be performed by one or
more processors. Thus, the term "processor," as used herein, may refer to any of the foregoing
structures or any other structure suitable for implementation of the techniques described herein.
In addition, in some aspects, dedicated hardware and / or software modules incorporated into
codecs configured or combined for encoding and decoding, and the functions described herein. It
can be provided within. Also, the techniques may be fully implemented in one or more circuits or
logic elements.
[0128]
[0135]
The techniques of this disclosure may be implemented in a wide variety of devices or
apparatuses, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chip set).
Although various components, modules, or units are described in the present disclosure to
highlight functional aspects of a device configured to perform the disclosed technology, they do
not necessarily need to be implemented by different hardware units. do not do. Rather, as
described above, the various units may be combined or interworking hardware in the codec
hardware unit, including one or more processors as described above, along with suitable
software and / or firmware. It can be given by a set of units.
[0129]
[0136]
Various embodiments of the present technique have been described. These and other
embodiments are within the scope of the following claims.
03-05-2019
45
Документ
Категория
Без категории
Просмотров
0
Размер файла
69 Кб
Теги
jp2016513410
1/--страниц
Пожаловаться на содержимое документа