вход по аккаунту



код для вставкиСкачать
Patent Translate
Powered by EPO and Google
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
Abstract: Embodiments are interconnects for combining components in an object-based
rendering system, combining a renderer with an array of individually addressable drivers that
project sound in a listening environment, and an audio signal. And a first network channel for
transmitting control data from the renderer to the array and a microphone located in a listening
environment to a calibration component of the renderer for calibration control of acoustic
information generated by the microphone And a second network channel for transmitting a
signal to the calibration component.
Bidirectional interconnection for communication between renderer and array of individually
addressable drivers
This application claims priority to US Provisional Patent Application Ser. No. 61 / 696,030, filed
Aug. 31, 2012. The content of the same application is hereby incorporated in its entirety by
FIELD OF THE INVENTION One or more implementations are generally bi-directional
interconnects for audio signal processing, and more particularly for systems that render audio
and direct audio signals reflected through individually addressable drivers. [Interconnect].
The subject matter discussed in the Background section should not be assumed to be prior art
merely as a result of the mention in the Background section.
Likewise, the problems mentioned in the background section or related to the subject of the
background section should not be assumed to have been previously recognized in the prior art.
The subject matter in the background section merely represents various approaches, which may
themselves be inventions.
The interconnection system for audio applications is typically a simple one-way link that sends
speaker feed signals from a source or renderer to an array of speakers. With the advent of
advanced audio content such as object-based audio, the complexity of the rendering process and
the nature of the audio content transmitted to the speakers of the various arrays that are
currently possible has increased significantly. For example, movie soundtracks typically include
many different sound elements corresponding to on-screen images, dialogs, noise and sound
effects emanating from various locations on the screen, combined with background music and
environmental effects, and overall Create a great audience experience. Accurate reproduction
requires that the sound be reproduced in a way that corresponds as closely as possible to that
shown on the screen, in terms of sound source position, intensity, motion and depth. Traditional
channel-based audio systems send audio content in the form of speaker feeds to individual
speakers in a listening environment. In this case, usually a normal one-way interconnection to the
speaker is sufficient.
However, with the introduction of digital cinema and the development of truly three-dimensional
("3D") or virtual 3D content, a new standard for sound is being created. For example, the
incorporation of multi-channel audio allows content creators greater creativity and allows the
audience to have a more immersive, realistic hearing experience. Extending beyond traditional
speaker feeds and channel-based audio as a means to deliver spatial audio is critical, and the
listener will render specifically for the configuration of his choice There has been considerable
interest in model-based audio descriptions that allow one to select the desired playback
configuration with the audio being played. Spatial presentation of sound utilizes audio objects. An
audio object is an audio signal with an apparent source location (e.g. 3D coordinates), an
apparent source width and associated parametric source descriptions of other parameters. As a
further advance, next-generation spatial audio (also referred to as "adaptive audio") formats
include a mix of audio objects and traditional channel-based speaker feeds, along with location
metadata for audio objects. It is being developed. In a spatial audio decoder, channels are sent
directly to the associated speakers (if appropriate speakers are present) or downmixed to an
existing speaker set, and the audio objects are made flexible by the decoder Rendered with
3Parametric source descriptions associated with each object, such as position trajectories in D
space, are received as inputs, along with the number and location of speakers connected to the
decoder. The renderer then distributes the audio associated with each object across a set of
attached speakers, using some sort of algorithm such as the Pan Law. In this way, the authored
spatial intent of each object is best presented through the particular speaker configuration
present in the listening room.
Current interconnection systems do not take full advantage of the full features and functionality
of such next-generation audio systems. Such interconnections are limited to sending speaker feed
audio signals and possibly some limited control signals, and all of the system-wide rendering,
configuration and calibration functions. It does not have sufficient structure to utilize. Thus,
transmitting appropriate information from the listening environment to the renderer, the
renderer transmits the speaker feed for a particular speaker array, and some automation for
optimized playback of object-based audio content There is a need for an interconnect system that
allows the configuration and calibration routines to be invoked.
Embodiments of an interconnection system are described for use in rendering spatial audio
content in a listening environment. A renderer configured to generate a plurality of audio
channels including information specifying the playback position of each audio channel in the
physical / logical interconnection system, and for placement around the listening environment
The components of the system are coupled together, including an array of individually
addressable drivers, and calibration / configuration components that process the acoustic
information provided by the microphones placed in the listening environment. The
interconnection may be implemented as a bi-directional interconnection for the transmission of
audio and control signals between the renderer / calibration unit and the speaker driver.
Embodiments are particularly interconnects for combining components in an object-based
rendering system, combining a renderer with an array of individually addressable drivers that
project sound in a listening environment, and an audio signal. And a first network channel for
transmitting control data from the renderer to the array and a microphone located in a listening
environment to a calibration component of the renderer for calibration control of acoustic
information generated by the microphone And a second network channel for transmitting a
signal to the calibration component.
The rendering system described in this article includes updated content generation tools, delivery
methods based on adaptive audio systems including new speakers and channel configurations,
audio formats and systems including improved user experience, and cinema sound A new spatial
description format may be implemented that is enabled by the advanced set of content
generation tools created for the mixer.
Audio streams (generally including channels and objects) are transmitted along with metadata
that describes the intent of the content creator or sound mixer, including the desired position of
the audio stream. The position can be expressed as a named channel (from among the predefined
channel configuration settings) or as 3D spatial position information. Embodiments are intended
to be played through a speaker or driver array that includes both a direct (forward launch) driver
and a reflected (upper launch or side launch) driver, reflected sound and direct It is also directed
to systems and methods for rendering adaptive audio content that includes sound.
Incorporation by Reference Each publication, patent and / or patent application mentioned in the
specification is designated when it is indicated that each individual publication and / or patent
application is specifically and individually incorporated by reference. Likewise, it is incorporated
herein by reference in its entirety.
Like reference symbols are used to refer to like elements throughout the drawings.
Although the following drawings depict various examples, one or more implementations are not
limited to the examples depicted in the drawings. FIG. 6 illustrates an exemplary speaker
arrangement in a surround system (eg, 9.1 surround) providing height speakers for reproduction
of height channels. FIG. 5 illustrates a combination of channel and object based data to generate
adaptive audio mixing under an embodiment. FIG. 1 is a block diagram of a playback architecture
for use in an adaptive audio system, under an embodiment. FIG. 6 is a block diagram illustrating
functional components for adapting movie theater based audio content for use in a consumer
environment, under an embodiment. FIG. 3B is a detailed block diagram of the components of
FIG. 3A under an embodiment. FIG. 1 is a block diagram of functional components of a consumer
based adaptive audio environment under an embodiment. FIG. 6 illustrates a distributed
rendering system in which part of the rendering function is implemented in the speaker unit,
under an embodiment. FIG. 5 illustrates the deployment of an adaptive audio system in an
exemplary home theater environment. FIG. 5 illustrates the use of an upper launch driver using
the reflected sound to simulate an overhead speaker in a home theater. A shows a speaker with a
plurality of drivers in a first configuration for use in an adaptive audio system with a reflected
sound renderer, under an embodiment. B is a diagram of a speaker with drivers distributed
across multiple enclosures for use in an adaptive audio system with a reflected sound renderer,
under an embodiment. FIG. 6C illustrates an exemplary configuration for a sound bar used in an
adaptive audio system using a reflected sound renderer under an embodiment. FIG. 6 illustrates
an exemplary arrangement of speakers with individually addressable drivers including an upper
launch driver located in a listening room. A shows a speaker configuration for an adaptive audio
5.1 system that utilizes multiple addressable drivers for reflected audio, under an embodiment.
FIG. B shows a speaker configuration for an adaptive audio 7.1 system that utilizes multiple
addressable drivers for reflected audio, under an embodiment.
A shows the composition of a bi-directional interconnect under an embodiment. FIG. B shows the
composition of a unidirectional interconnect under an embodiment. FIG. 6 illustrates an
automatic configuration and system calibration process for use in an adaptive audio system,
under an embodiment. FIG. 6 is a flow chart illustrating process steps for a calibration method
used in an adaptive audio system, under an embodiment. FIG. 5 illustrates the use of an adaptive
audio system in an exemplary television and sound bar consumer use case. FIG. 7 shows a
simplified representation of three-dimensional binaural headphone virtualization in an adaptive
audio system, under an embodiment. FIG. 6 is a table that illustrates certain metadata definitions
for use in an adaptive audio system that utilizes a reflected sound renderer for a consumer
environment, under an embodiment.
Systems and methods for interconnection between an object-based renderer and an array of
individually addressable speaker drivers are described. The interconnect supports the
transmission of audio and control signals to the driver and the transmission of audio information
from the listening environment to the renderer. The renderer includes or is coupled to a
calibration unit that processes acoustic information about the listening environment for
automatic configuration and calibration of the renderer and driver. The driver array includes
drivers that are configured and oriented to propagate sound waves directly at one location or
reflected from one or more surfaces or otherwise diffuse within the listening area. Aspects of one
or more embodiments described herein relate to an audio or audio processing source audio
information in a mixture including one or more computers or processing devices executing
software instructions, a rendering and playback system. It may be implemented in a visual
system. Any of the described embodiments can be used alone or together with each other in any
combination. While the various embodiments have been motivated by the various shortcomings
of the prior art that may be discussed or implied in one or more places herein, the embodiments
are not necessarily limited to any of these deficiencies. It is not something to deal with. In other
words, various embodiments may address various shortcomings that may be discussed in the
specification. Some embodiments may only partially address some or only one drawback that
may be discussed in the specification, and some embodiments may not have any of these
drawbacks. May not be addressed.
For the purposes of this description, the following terms have an associated meaning: the term
"channel" adds to the audio signal metadata whose position is encoded as a channel identifier, eg
left front or top right surround "Channel-based audio" is audio that has been formatted for
playback through a predefined set of speaker zones with associated nominal positions, eg 5.1,
7.1, etc .; The terms "object" or "object-based audio" mean one or more audio channels with
parametric source descriptions such as apparent source location (eg 3D coordinates), apparent
source width etc .; "Adaptive audio" is channel based and / or object based Audio signal plus
metadata, which is based on the playback environment using an audio stream plus metadata
whose position is encoded as a 3D position in space. Rendering audio signals; "listening
environment" can be used to play audio content alone or together with video or other content,
home, cinema, theater, auditorium, studio, game -Means any open, partially enclosed or
completely enclosed area, such as a room that can be embodied in a console or the like. Such
regions may have one or more surfaces capable of directly reflecting or diffusely reflecting sound
waves, such as walls or baffles, disposed therein.
Adaptive Audio Formats and Systems In one embodiment, the present interconnect system is
configured to function with a sound format and processing system that may be referred to as a
"spatial audio system" or "adaptive audio system". Implemented as part of an audio system. Such
systems are based on audio formatting and rendering techniques to allow enhanced audience
immersion, greater artistic control and system flexibility and scalability. The overall adaptive
audio system generally comprises an audio encoding, delivery and decoding system. The system
is configured to generate one or more bitstreams that include both regular channel based audio
elements and object audio coding elements. Such combined approach provides greater coding
efficiency and rendering flexibility than adopting channel-based or object-based approaches
separately. An example of an adaptive audio system that may be used in connection with
embodiments of the present application is filed on August 20, 2012, and is entitled "System and
Method for Adaptive Audio Signal Generation, Encoding and Rendering". No. 61 / 636,429. The
contents of the same application are incorporated herein by reference.
An exemplary implementation of the adaptive audio system and the associated audio format is
the Dolby Atmos platform. Such systems incorporate height (upper and lower) dimensions that
can be implemented as a 9.1 surround system or similar surround sound configuration. FIG. 1
shows the speaker arrangement in a current surround system (e.g. 9.1 surround), providing
height speakers for reproduction of the height channel. 9.1The speaker configuration of the
system 100 consists of five speakers 102 at the floor and four speakers 104 at the height. In
general, these speakers can be used to generate sounds that are designed to emit from any
location in the room, with more or less accuracy. Predefined speaker configurations, as shown in
FIG. 1, may naturally limit the ability to accurately represent the position of a given sound
source. For example, the sound source can not be panned further to the left than the left speaker
itself. This is true for all loudspeakers, thus providing a one dimensional (e.g. left and right), two
dimensional (e.g. front and back) or three dimensional (e.g. left and right, front and back, up and
down) geometry in which downward mixing is constrained. A variety of different speaker
configurations and types may be used in such speaker configurations. For example, certain
enhanced audio systems may use speakers in 9.1, 11.1, 13.1, 19.4 or other configurations.
Speaker types may include full range direct speakers, speaker arrays, surround speakers,
subwoofers, tweeters and other types of speakers.
An audio object can be thought of as a group of sound elements that can be perceived as
originating from one or more specific physical locations in the listening environment. Such
objects can be static (i.e. stationary) or dynamic (i.e. moving). Audio objects are controlled by
metadata that, together with other functions, define the position of the sound at a given point in
time. When the object is played back, the object is not necessarily output to a predefined physical
channel, but is rendered using the existing speakers according to the location metadata. Tracks in
a session can be audio objects, and standard pan data is similar to position metadata. In this way,
content placed on the screen can be panned in much the same way as channel-based content, but
content placed in surround will be rendered to individual speakers if desired Can. While the use
of audio objects provides the desired control of discrete effects, other aspects of the soundtrack
may work in a channel based environment in nature. For example, many environmental effects or
reverberations actually benefit from being fed into the array of speakers. Although these can be
treated as objects with sufficient width to fill the array, it is beneficial to retain some channel
based functionality.
An adaptive audio system is configured to support "beds" in addition to audio objects. Here, the
bed is, in effect, a channel based submix or stem. These can be delivered individually or in a
single bed for ultimate playback (rendering), depending on the content creator's intent. These
beds can be generated in different channel based configurations, such as an array including
overhead speakers as shown in 5.1, 7.1 and 9.1 and in FIG. FIG. 2 illustrates a combination of
channel and object based data to generate adaptive audio mixing under an embodiment. As
shown in process 200, channel-based data 202, which may be 5.1 or 7.1 surround sound data,
for example provided in the form of pulse code modulated (PCM) data, comprises audio object
data 204. And in combination produce an adaptive audio mix 208. Audio object data 204 is
generated by combining elements of the original channel based data with associated metadata
specifying certain parameters related to the position of the audio object. As conceptually shown
in FIG. 2, the authoring tool provides the ability to generate an audio program that
simultaneously includes a combination of speaker channel groups and object channels. For
example, the audio program may include descriptive metadata for one or more channels, one or
more speaker channels, optionally organized into groups (or tracks, eg stereo or 5.1 tracks), Or
descriptive metadata for multiple object channels and one or more object channels.
The adaptive audio system goes virtually beyond simple "speaker feeds" as a means of delivering
spatial audio, and allows the listener to choose a playback configuration that meets their
individual needs or budget, and Advanced model-based audio descriptions have been developed
that allow for the freedom to render specifically for their individually chosen configurations. At
high levels, there are four main spatial audio description formats: (1) Speaker Feed. Here, the
audio is described as a signal intended for a loudspeaker located at a nominal speaker position;
(2) Microphone feed. Here, audio is described as a signal captured by real or virtual microphones
in a predefined configuration (the number of microphones and their relative positions); (3)
Model-based description. Here, the audio is described using a sequence of audio events at the
described time and position; (4) Binaural. Here, the audio is described by the signals reaching the
two ears of the listener.
The four description formats are often associated with the following common rendering
techniques: Here, the term "rendering" means the conversion to an electrical signal that is used
as a speaker feed. (1)パン。 Here, the audio stream is converted to a speaker feed (typically
rendered before delivery) using a set of pan rules and known or assumed speaker locations (2)
Ambisonics . Here, the microphone signals are converted into feeds for a scalable array of
loudspeakers (typically rendered after delivery); (3) Wave Field Synthesis (WFS). Here, to
synthesize the sound field, sound events are converted into appropriate speaker signals (typically
rendered after delivery); (4) Binaural. Here, the L / R binaural signal is delivered to the L / R ear,
typically through headphones, but also through the speakers in the context of crosstalk
In general, any format can be converted to another format (although this may require blind
source separation or similar techniques), rendering using any of the techniques described above
It can be done. However, in practice, not all transformations give good results. The speaker feed
format is the most common because it is simple and effective. The best sound results (i.e. the
most accurate and reliable) are achieved by mixing / monitoring and then delivering the speaker
feed. That is because no processing is required between the content creator and the listener.
While the speaker feed description provides the highest fidelity if the playback system is known
in advance, the playback system and its configuration are often not known in advance. In
contrast, model-based descriptions are the most adaptable as they make no assumptions about
the playback system and are therefore most easily applied to fit multiple rendering techniques.
Model-based descriptions can efficiently capture spatial information, but become very inefficient
as the number of audio sources increases.
Adaptive audio systems combine the benefits of both channel and model based systems. Specific
benefits include high sound quality, optimal reproduction of artistic intent when mixing and
rendering single inventory items with down adaptation to rendering configurations, using the
same channel configuration, to the system pipeline Includes relatively low impact and finer
horizontal speaker spatial resolution and increased immersion through new height channels. The
adaptive audio system provides several new features, including the following: A single inventory
item with downward and upward adaptation to a specific cinema rendering configuration, ie
delayed rendering and playback environment Use of available speakers in a computer; enhanced
wrapping, including optimized down-mixing to avoid inter-channel correlation (ICC) artifacts; via
a steer-thru array Increased spatial resolution (eg, allowing audio objects to be dynamically
assigned to one or more loudspeakers in a surround array); and via a high resolution center or
similar speaker configuration Increased forward channel resolution.
The spatial effects of the audio signal are crucial in providing an immersive experience for the
listener. Sounds intended to emanate from a particular area of the viewing screen or room should
be reproduced through the loudspeaker (s) located at the same relative position. Thus, the
primary audio metadata for sound events in model-based descriptions is position. However, other
parameters such as size, orientation, velocity and acoustic dispersion can also be described. In
order to convey position, model-based 3D audio spatial description requires a 3D coordinate
system. The coordinate system used for transmission (e.g. Euclidean, spherical, cylindrical) is
generally chosen for convenience or compactness, but other coordinate systems may be used for
the rendering process. In addition to the coordinate system, a reference frame is required to
represent the position of the object in space. In order for the system to accurately reproduce
position-based sound in a variety of different environments, selection of the appropriate
reference frame may be critical. In other centered reference frames, audio source locations are
defined relative to features in the rendering environment such as the walls or corners of the
room, standard speaker locations and screen locations. In a self-centered reference frame, the
position is expressed relative to the listener's point of view, such as "my front", "slightly left."
Scientific studies of spatial perception (hearing and others) indicate that the egocentric point of
view is almost universally used. However, for cinemas, other-centred reference frames are
generally more appropriate. For example, the precise location of an audio object is most
important when there is an associated object on the screen. When using other-centred
references, the sound is localized at the same relative position on the screen, eg, "one-third from
the center of the screen", for all listening positions and for any screen size. Another reason is that
mixers tend to think and mix with others, pan tools are laid out with others centered frames (ie
room walls) and mixers are rendered that way For example, expect "this sound should be on the
screen", "this sound should be off the screen" or "from the left wall" and so on.
Despite the use of other-centered reference frames in cinema environments, there are several
cases where a self-centered reference frame may be useful and more appropriate. It includes
sounds that do not relate to muscle, ie sounds that do not exist in the "story space", such as mood
music. Self-centered uniform presentation may be desirable for that. Another case is a near field
effect (eg, a mosquito that produces a flutter in the listener's left ear) that requires self-centered
representation. Furthermore, sources at infinity (and the resulting plane waves) are felt to come
from a certain autocentric position (eg 30 degrees left), and such sounds are more self-centered
than others It is easier. In some cases, it is possible to use other centered reference frames as
long as the nominal listening position is defined. On the other hand, some examples require selfcentered representations that are not yet capable of rendering. Although other centered
references may be more useful and appropriate, audio presentations should be extensible, as
many new features, including self-centered representations, may be more desirable in certain
applications and listening environments. is there.
Embodiments of the adaptive audio system augment with recommended channel configurations
for optimal fidelity and rendering of diffuse or complex multi-point sources (eg stadium crowds,
environments) using self-centered reference Hybrid spatial description approach, including
other-centric model-based sound description that allows for efficient spatial resolution and
scalability. The system of FIG. 3 performs processing blocks that perform legacy object and
channel audio decoding, object rendering, channel remapping and signal processing prior to
audio being sent to the post processing and / or amplification and speaker stages. Including.
The playback system 300 is configured to render and play audio content generated through one
or more capture, preprocessing, authoring and encoding components. The adaptive audio preprocessor may include source separation and content type detection functionality. This
automatically generates the appropriate metadata through analysis of the input audio. For
example, location metadata may be derived from multi-channel recordings through analysis of
relative levels of correlated inputs between channel pairs. Detection of content types, such as
speech or music, may be achieved, for example, by feature extraction and classification. Some
authoring tools optimize the input and encoding of the sound engineer's creative intentions, one
at a time for the final audio mix that is optimized for playback in virtually any playback
environment Allow authoring of audio programs by allowing them to be generated. This can be
accomplished through the use of position data associated with the audio object as well as the
original audio content and encoded with the original audio content. In order to accurately place
the sound around the listening space, the sound engineer needs control over how the sound is
finally rendered, based on the actual constraints and features of the playback environment . The
adaptive audio system provides this control by allowing the sound engineer to change how audio
content is designed and mixed through the use of audio objects and position data. Once the
adaptive audio content is authored and encoded at the appropriate codec device, it is decoded
and rendered at various components of the playback system 300.
As shown in FIG. 3, the processing blocks of (1) legacy surround sound audio 302, (2) object
audio including object metadata 304 and (3) channel audio including channel metadata 306 are
processed. The decoder stages 308, 309 in 310 are input. Object metadata may be rendered in
object renderer 312, while channel metadata may be remapped as needed. Room configuration
information 307 is provided to the object renderer and the channel remapping component. The
hybrid audio data is then processed through one or more signal processing stages, such as
equalizer and limiter 314, prior to playback through B-chain processing stage 316 and speaker
System 300 represents an example of a playback system for adaptive audio, and other
configurations, components and interconnections are also possible.
<Playback Application> As mentioned above, early implementations of adaptive audio formats
and systems are authored using new authoring tools, packaged using an adaptive audio cinema
encoder, PCM or proprietary lossless It is in a digital cinema (D-cinema) context that includes
content capture (objects and channels) delivered using the existing Digital Cinema Initiative (DCI)
distribution mechanism using codecs.
In this case, the audio content is intended to be decoded and rendered at a digital cinema to
create an immersive spatial audio cinema experience. However, as with previous theater
improvements such as analog surround sound, digital multi-channel audio, etc., it is imperative
that the improved user experience provided by the adaptive audio format be delivered directly to
consumers at home. is there. This requires that certain characteristics of the format and system
be adapted for use in a more restricted listening environment. For example, homes, rooms, small
listening spaces or similar locations may have reduced space, acoustical attributes and facility
capabilities as compared to a cinema or theater environment. For the purpose of description, the
term "consumer based environment" refers to any non-movie like listening environment used by
regular consumers or professionals such as home, studio, room, console area, listening space etc.
It is intended to include the building environment. Audio content may be sourced independently
and rendered, or may be associated with graphic content, such as still images, illuminations,
videos and the like.
FIG. 4A is a block diagram illustrating functional components for adapting movie theater based
audio content for use in a consumer environment, under an embodiment. As shown in FIG. 4A,
cinema content, which typically includes movie soundtracks, is captured and / or authored at
block 402 using appropriate equipment and tools. In the adaptive audio system, this content is
processed at block 404 through encoding / decoding and rendering components and interfaces.
The resulting object and channel audio feed are then sent to the appropriate speakers in the
cinema or theater (406). In system 400, cinema content is also processed for playback in a
consumer environment, such as a home theater system 416. The consumer environment shall
not have the inclusiveness or ability to reproduce all of the sound content intended by the
content creator due to limited space, reduced number of speakers, etc. However, the
embodiments allow the original audio content to be rendered in a manner that minimizes the
constraints imposed by the reduced ability of the consumer environment, and positioned in a
manner that maximizes the available equipment. It is directed to systems and methods that allow
clues to be processed. As shown in FIG. 4A, movie theater audio content is processed through a
movie theater to consumer converter component 408. Here, the consumer content encoding and
rendering chain 414 is processed. This chain also processes the original consumer audio content
that is captured and / or authored at block 412. The original consumer content and / or
converted cinema content is then played back in the consumer environment 416. In this way,
even with the potentially limited speaker configuration of the home or consumer environment
416, the associated spatial information being encoded in the audio content is more immersive
Can be used to render sounds in an interactive manner.
FIG. 4B shows the components of FIG. 4A in more detail. FIG. 4B illustrates an exemplary delivery
mechanism for adaptive audio cinema content through the consumer ecosystem. As shown in
drawing 420, the original cinema and TV content is captured 422 and authored 423 for playback
in a variety of different environments to provide a cinema experience 427 or a consumer
environment experience 434. . Similarly, certain user generated content (UGC) or consumer
content may be captured 423 and authored 425 for playback in the consumer environment 434.
Cinema content for playback in the cinema environment 427 is processed through a known
cinema processor 426. However, in system 420, the output of the cinema authoring tool box 423
also consists of audio objects, audio channels, and metadata conveying the artistic intent of the
sound mixer. This can be thought of as a mezzanine-like audio package that can be used to create
multiple versions of the cinema content for consumer reproduction. In one embodiment, this
functionality is provided by a cinema to consumer adaptive audio converter 430. This converter
has an input to the adaptive audio content and extracts from it the appropriate audio and
metadata content for the consumer endpoint 434. This converter produces separate and possibly
different audio and metadata outputs depending on the consumer delivery mechanism and the
As shown in the example of system 420, a cinema-to-consumer converter 430 feeds pictures (eg,
broadcast, disc, OTT, etc.) and sounds for game audio bitstream generation module 428. These
two modules suitable for delivering cinema content can be fed into multiple delivery pipelines
432. These pipelines can all be delivered to consumer endpoints. For example, adaptive audio
cinema content may be modified to convey channels, objects and associated metadata, but may
be encoded using a codec suitable for broadcast purposes such as Dolby Digital Plus ,
Transmitted through the broadcast chain via cable or satellite, and then decoded and rendered at
the consumer's home for home theater or television playback. Similarly, the same content can be
encoded using a suitable codec for bandwidth-limited online delivery, in which case it is
transmitted over a 3G or 4G mobile network and then using headphones Are decoded and
rendered for playback through mobile devices. Other content sources such as TV, live broadcast,
games and music may also use this adaptive audio format to generate and provide content for
next generation consumer audio formats.
The system of FIG. 4B can be used for home theater (eg A / V receiver, sound bar and Blu-ray), Emedia (eg PC, tablet, mobile including headphone playback), broadcast (eg TV and set top box),
music, games Provide an enhanced user experience throughout the consumer audio ecosystem,
which may include live sounds, user-generated content, etc. Such systems: Improved
immersiveness for the consumer audience for all endpoint devices, enhanced artistic control for
audio content creators, improved for improved rendering Provides content dependent (content
describing) metadata, enhanced flexibility and scalability for consumer playback systems, audio
quality preservation and matching, and opportunities for dynamic rendering of content based on
user location and interaction. The system includes new mixing tools for content creators, updated
new packaging and encoding tools for delivery and playback, dynamic mixing in the home
(appropriate for different consumer configurations) and Includes rendering, additional speaker
location and design.
The consumer-based adaptive audio ecosystem is completely inclusive, using an adaptive audio
format that includes content generation, packaging, delivery and playback / rendering across a
wider number of endpoint devices and use cases. Configured to be an end-to-end next-generation
audio system. As shown in FIG. 4B, the system originates from content captured for the use case
from several different use cases 422 and 424. These capture points include all relevant
consumer content formats including cinema, TV, live broadcast (and sound), UGC, games and
music. Content goes through several key phases as it passes through the ecosystem. These
phases include pre-processing and authoring tools, conversion tools (ie conversion of adaptive
audio content for cinemas to consumer content delivery applications), specific adaptive audio
packaging / bitstreams Encoding (which complements audio essence data and additional
metadata and audio playback information), existing or new codecs (eg DD +, etc.) for efficient
delivery through various consumer audio channels Delivery encoding using TrueHD, Dolby Pulse,
transmission over associated consumer delivery channels (eg broadcast, disc, mobile, Internet
etc), and finally the benefits of spatial audio experience, content Creator Thus to reproduce and
transmit the defined adaptive audio user experience, but such dynamic rendering conscious
endpoint. Consumer-based adaptive audio systems can be used during rendering for a wide
variety of consumer endpoints, and the rendering techniques applied are optimized depending on
the endpoint device Can be For example, home theater systems and sound bars may have two,
three, five, seven or even nine separate speakers at various locations. Many other types of
systems have only two speakers (eg, TVs, laptops, music docks), and almost all commonly used
devices have headphone outputs (eg, PCs, laptops, tablets, etc.) Mobile phones, music players
Current authoring and delivery systems for consumer audio are predefined using limited
knowledge of the type of content conveyed in the audio essence (i.e. the actual audio played back
by the consumer playback system) Generate and deliver audio intended for playback to a fixed
speaker position. However, the adaptive audio system consists of fixed speaker position specific
audio (left channel, right channel etc) and an object based audio element with generalized 3D
spatial information including position, size and velocity. Provides a new hybrid approach to audio
generation, including options for both. This hybrid approach provides a balanced approach for
fidelity (provided by fixed speaker positions) and flexibility in rendering (generalized audio
objects). The system also provides additional useful information about the audio content via new
metadata that is paired with the audio essence by the content creator at the time of content
generation / authoring. This information provides detailed information about the attributes of the
audio that can be used during rendering. Such attributes include audio object information such as
content type (eg, dialog, music, effects, sound effects, background / environment etc) as well as
spatial attributes (eg 3D position, object size, speed etc) And useful rendering information (eg,
snap to speaker location, channel weights, gains, base management information, etc.). Audio
content and playback intent metadata are generated through the use of automatic media
intelligence algorithms that can be generated manually by the content creator or performed in
the background during the authoring process If done, they can be reviewed by the content
creator during the final quality control phase.
FIG. 4C is a block diagram of functional components of a consumer based adaptive audio
environment, under an embodiment. As shown in drawing 450, the system processes the
encoded bitstream 452, which carries both the hybrid object and the channel based audio
stream. The bitstream is processed by the rendering / signal processing block 454. In one
embodiment, at least a portion of this functional block may be implemented in rendering block
312 shown in FIG. The rendering function 454 implements various rendering algorithms for
adaptive audio and certain post-processing algorithms such as up-mixing, processing of direct
versus reflected sound, and the like. The output from the renderer is provided to the speaker 458
through a bi-directional interconnect 456. In one embodiment, the speaker 458 has several
separate drivers that may be arranged in surround sound or similar configuration. The drivers
are individually addressable and may be embodied in individual enclosures or multiple driver
cabinets or arrays. System 450 may also include a microphone 460 that provides measurements
of room characteristics that can be used to calibrate the rendering process. System configuration
settings and calibration functions are provided at block 462. These functions may be included as
part of the rendering component or may be implemented as a separate component functionally
coupled to the renderer. The bi-directional interconnect 456 provides a feedback signal path
from the speaker environment (listening room) back to the calibration component 462.
Distributed / Centralized Rendering In one embodiment, renderer 454 has functional processes
embodied in a central processor associated with a network. Alternatively, the renderer may
include functional processes that are at least partially implemented by circuitry within or coupled
to each driver of the array of individually addressable audio drivers. In the case of a centralized
process, rendering data is sent to the individual drivers in the form of audio signals sent through
the individual audio channels. In the case of distributed processing, the central processor may
not perform rendering, or may perform at least some partial rendering of audio data, and final
rendering may be performed at the driver. In this case, powered speakers / drivers are needed to
enable on-board processing functions. One exemplary implementation is the use of a speaker
with an integrated microphone. Here, rendering is adapted based on the microphone data and
adjustments are made at the speaker itself. This eliminates the need to send the microphone
signal back to the central renderer for calibration and / or configuration purposes.
FIG. 4D shows, under an embodiment, a distributed rendering system in which part of the
rendering function is implemented in the speaker unit. As shown in drawing 470, the encoded bit
stream 471 is input to a signal processing stage 472 that includes partial rendering components.
The partial renderer may perform any suitable proportion of rendering functionality. No
rendering at all or up to 50% or 75%, etc. The original encoded bitstream or partially rendered
bitstream is then transmitted to the speaker 472 through the interconnect 476. In this
embodiment, the speaker is a self-powered unit that includes a driver and a direct power
connection or on-board battery. Speaker unit 472 also includes one or more integrated
microphones. A renderer and an optional calibration function 474 are also integrated into the
speaker unit 472. The renderer 474 performs a final or complete rendering operation on the
encoded bitstream, depending on how much rendering should have been performed by the
partial renderer 472. In a fully distributed implementation, the speaker calibration unit 474 may
perform calibration directly to the speaker driver 472 using sound information generated by the
microphone. In this case, interconnect 476 may simply be a one-way interconnect. In alternative
or partially dispersive implementations, integrated or other microphones may provide sound
information back to an optional calibration unit 473 associated with the signal processing stage
472. In that case, interconnect 476 is a bi-directional interconnect.
Listening Environment The implementation of the adaptive audio system is intended to be
deployed in a variety of different environments. These include three main application areas:
complete cinema or home theater systems, televisions and sound bars and headphones. FIG. 5
illustrates the deployment of an adaptive audio system in an exemplary cinema or home theater
environment. The system of FIG. 5 represents a superset of the components and functions that
may be provided by the adaptive audio system, and certain aspects may be reduced or eliminated
based on the needs of the user, It can provide an improved experience. System 500 includes
various different speakers and drivers in a variety of different cabinets or arrays 504. The
speakers include individual drivers that provide dynamic virtualization of audio using forward,
side and upper launch options as well as certain audio processing techniques. Drawing 500
shows several speakers deployed in a standard 9.1 speaker configuration. These include left and
right height speakers (LH, RH), left and right speakers (L, R), center speakers (shown as modified
center speakers) and left and right surround and rear speakers (LS, RS, LB) And RB; low
frequency effect LFE not shown).
FIG. 5 illustrates the use of a central channel speaker 510 used in a central location of a room or
theater. In one embodiment, the speaker is implemented using a modified center channel or high
resolution center channel 510. Such speakers are front-emitting center channel arrays with
individually addressable speakers that allow discrete panning of audio objects through the array
to match the motion of video objects on the screen. It is also good. Such a speaker may be
embodied as a high-resolution center channel (HRC) speaker as described in International
Application No. PCT / US2011 / 028783, which is incorporated herein by reference. The HRC
speaker 510 may also include side-fired speakers as shown. These can be activated and used
when the HRC speaker is used not only as a central speaker but also as a speaker with sound bar
function. The HRC speakers may be incorporated on and / or laterally of the screen 502 to
provide a two-dimensional, high-resolution pan option for audio objects. The center speaker 510
may also include additional drivers and may implement steerable sound beams with separately
controlled sound zones.
System 500 also includes near field effect (NFE) speakers 512, which may be located directly in
front of or near the front of the listener. With adaptive audio, it is possible to bring an audio
object into a room rather than simply locking the audio object to the edge of the room. Therefore,
moving the object through three-dimensional space is an option. In one example, the object can
be emitted to the L speaker, traverse the room through the NFE speaker, and end with the RS
speaker. A variety of different speakers, such as wireless, battery powered speakers may be
suitable for use as NFE speakers.
FIG. 5 illustrates the use of dynamic speaker virtualization to provide an immersive user
experience in a listening environment. Dynamic speaker virtualization is enabled through
dynamic control of speaker virtualization algorithm parameters based on object spatial
information provided by adaptive audio content. This dynamic virtualization is shown for the L
and R speakers in FIG. It is natural here to think of it as generating the perception of objects
moving along the side of the room. Separate virtualizers may be used for each related object, and
combined signals may be sent to the L and R speakers to create multi-object virtualization effects.
Dynamic virtualization effects are shown for NFE speakers intended to be L and R speakers as
well as stereo speakers (with two independent inputs). This speaker can be used to create a
diffuse or point source near field audio experience, along with audio object size and position
information. Similar virtualization effects can be applied to any or all of the other speakers in the
system. In one embodiment, the camera may provide additional listener position and identity
information, which may provide a more compelling experience that is more faithful to the artistic
intent of the mixer. It can be used by adaptive audio renderers.
The adaptive audio renderer understands the spatial relationship between the mixing and
playback system. In some examples of playback environments, as shown in FIG. 1, discrete
speakers may be available in all relevant areas of the room, including overhead locations. In
these cases where discrete speakers are available at certain locations, the renderer does not
generate phantom images between two or more speakers through the use of pan or speaker
virtualization algorithms, but rather It can be configured to "snap" to close speakers. This slightly
distorts the spatial representation of the mixture, but also allows the renderer to avoid
unintended phantom images. For example, if the angular position of the left speaker in the
mixing stage does not correspond to the angular position of the left speaker in the reproduction
system, enabling this function avoids having a constant phantom image of the initial left channel.
In many cases, some speakers, such as ceiling-mounted overhead speakers, are not available. In
this case, some virtualization techniques are implemented by the renderer to reproduce overhead
audio content through existing floor or wall mounted speakers. In one embodiment, the adaptive
audio system includes modifications to the standard configuration through the inclusion of both
forward and top (or "up") launch functions for each speaker. In traditional household
applications, speaker manufacturers have attempted to introduce new driver configurations other
than forward-fired transducers, and which of the original audio signals (or modifications to them)
are sent to these new drivers We have been faced with the problem of trying to identify what to
do. In an adaptive audio system, there is very specific information as to which audio objects
should be rendered above a standard horizontal plane. In one embodiment, height information
present in the adaptive audio system is rendered using an upper launch driver.
Similarly, side-fired speakers can be used to render certain other content, such as environmental
effects. Side-fired drivers can also be used to render certain types of reflected content, such as
sounds that are reflected from the walls or other surfaces of the listening room.
One advantage of the upper launch driver is that it can be used to reflect sound from a hard
ceiling surface to simulate the presence of overhead / height speakers located on the ceiling. One
attractive attribute of adaptive audio content is that spatially diverse audio is rendered using an
array of overhead speakers. However, as mentioned above, installing overhead speakers is often
too expensive or impractical in a home environment. By simulating a height speaker with
speakers normally located in a horizontal plane, a convincing 3D experience can be created with
speakers that are easy to locate. In this case, the adaptive audio system simulates the upper
launch / height simulated driver in a novel way in that the audio object and its spatial playback
information is used to generate the audio played by the upper launch driver. Using This same
advantage can be realized in an attempt to provide a more immersive experience through the use
of side-emitting speakers that reflect sound from the wall to create some sort of reverberation
FIG. 6 illustrates the use of an upward launch driver that uses the reflected sound to simulate a
single overhead speaker in a home theater. It should be noted that any number of upper launch
drivers may be used in combination to create multiple simulated height speakers. Alternatively,
several upper launch drivers may be configured to transmit sound to substantially the same spot
on the ceiling to achieve some sort of sound intensity or effect. The drawing 600 shows an
example in which the normal listening position 602 is located at a specific position in the room.
The system does not include any height speaker for transmitting audio content including height
cues. Instead, the speaker cabinet or speaker array 604 includes an upper launch driver with
forward launch driver (s). The upper launch driver sends its sound wave 606 to a particular point
on the ceiling 608 at which point it is configured (with respect to position and tilt angle) to
reflect the sound wave back to the listening position 602 below. It is assumed that the ceiling is
made of suitable material and composition to sufficiently reflect sound into the room below.
Relevant characteristics (eg, size, power, position, etc.) of the upper launch driver may be
selected based on ceiling composition, room size and other relevant characteristics of the
listening environment. Although only one upper launch driver is shown in FIG. 6, in some
embodiments, multiple upper launch drivers may be incorporated into the regeneration system.
Although FIG. 6 illustrates an embodiment in which the upper launch speaker is shown, it is
noted that the embodiment is also directed to a system in which side launch speakers are used to
reflect sound from the walls of the room You should keep it.
Speaker Configuration The primary consideration of adaptive audio systems is the speaker
configuration. The system utilizes individually addressable drivers, and an array of such drivers is
configured to provide a combination of both direct and reflected sources. A bi-directional link to
a system controller (eg A / V receiver, set top box) sends audio and configuration data to the
speakers, speaker and sensor information back to the controller, creating an active closed loop
system Allow that.
For the purpose of description, the term "driver" means a single electroacoustic transducer that
produces sound in response to an electrical audio input signal. The driver may be implemented in
any suitable type, geometry and size, and may include horns, cones, ribbon transducers and the
like. The term "speaker" means one or more drivers in a single enclosure. FIG. 7A shows a
speaker with multiple drivers in a first configuration, under an embodiment. As shown in FIG. 7A,
the speaker enclosure 700 has several individual drivers mounted in the enclosure. Typically, the
enclosure includes one or more forward launch drivers 702, such as a woofer, mid-range speaker
or tweeter or any combination thereof. One or more side launch drivers 704 may also be
included. The front and side launch drivers are typically mounted flush with the face of the
enclosure and project sound vertically outward from the vertical face defined by the speakers.
These drivers are usually permanently fixed in the cabinet 700. For adaptive audio systems with
reflected sound rendering capabilities, one or more upwardly inclined drivers 706 are also
provided. These drivers are positioned to project sound at an upward angle towards the ceiling
and allow the sound to bounce back to the lower listener, as shown in FIG. The degree of tilt may
be set depending on room characteristics and system requirements. For example, the upper
driver 706 may be tilted upward between 30 degrees and 60 degrees, and the forward launch
driver 702 in the speaker enclosure 700 to minimize interference with the sound waves
generated from the forward launch driver 702. It may be located on top of The upper launch
driver 706 may be installed at a fixed angle or may be installed such that the tilt angle may be
manually adjusted. Alternatively, a servo mechanism may be used to allow automatic or electrical
control of the tilt angle and projection direction of the upper launch driver. For certain sounds,
such as environmental sounds, the upper launch driver may be directed directly above the top
surface of the speaker enclosure 700 to create what may be referred to as a "top launch" driver.
In this case, a large percentage of the sound may be reflected back onto the speaker, depending
on the acoustic characteristics of the ceiling. However, in most cases, it is customary to use some
tilt angle, as shown in FIG. 6, to help project sound to different or more central locations in the
room through reflections from the ceiling. .
FIG. 7A is intended to show an example of a speaker and driver configuration, and many other
configurations are possible. For example, the upper launch driver may be provided in its own
enclosure to allow use with existing speakers. FIG. 7B shows a speaker system with drivers
distributed in multiple enclosures under an embodiment. As shown in FIG. 7B, the upper launch
driver 712 is provided in a separate enclosure 710, which is in close proximity to the enclosure
714 with the forward and / or side launch drivers 716 and 718 or It can be put on top. The
driver may be enclosed within a speaker sound bar as used in many home theater environments.
Within the sound bar, several small or medium sized drivers are arranged along an axis in a
single horizontal or vertical enclosure. FIG. 7C shows the placement of the drivers in the
soundbar under an embodiment. In this example, the soundbar enclosure 730 is a horizontal
soundbar that includes a side launch driver 734, an upper launch driver 736 and a forward
launch driver (s) 732. FIG. 7C is intended to be merely an exemplary configuration, and any
number of drivers that are realistic for the forward, side and upper launch functions may be
For the embodiment of FIG. 7A-C, the driver may have any suitable shape, size, depending on any
other relevant constraints such as required frequency response characteristics and size, power
rating, component cost etc. It should be noted that it may be of and of type.
In a typical adaptive audio environment, several speaker enclosures will be included in the
listening room.
FIG. 8 shows an exemplary arrangement of speakers with individually addressable drivers,
including an upper launch driver located in the listening room. As shown in FIG. 8, the room 800
includes four individual speakers 806, each having at least one forward, side and upper launch
driver. The room may also include center speakers 802 and subwoofers or fixed drivers used for
surround sound applications such as LFE 804. Depending on the size of the room and the
respective speaker units, as can be seen in FIG. 8, with proper placement of the indoor speakers
806, the ceilings and walls from some of the upper and side launch drivers are It can provide rich
audio environment resulting from sound reflection. The speaker is aimed to provide reflections
from one or more points on the appropriate surface plane, depending on the content, the size of
the room, the listener's position, the acoustic characteristics and other relevant parameters. Can.
The speakers used in the adaptive audio system may use a configuration based on existing
surround sound configurations (e.g. 5.1, 7.1, 9.1, etc.). In this case, several drivers are provided
and defined, as in the known surround sound convention, and additional drivers for sound
components (upper and side emission) reflected with direct (forward emission) components and
A definition is provided.
FIG. 9A shows a speaker configuration for an adaptive audio 5.1 system that utilizes multiple
addressable drivers for reflected audio, under an embodiment. In configuration 900, a standard
5.1 speaker footprint including LFE 901, center speaker 902, L / R forward speakers 904/906
and L / R backward speakers 908/910 is provided with eight additional drivers. A total of 14
addressable drivers are provided. These eight additional drivers are marked "above" and "side"
added to the "front" (or "front") drivers in each speaker unit 902-910. The direct forward driver
is driven by subchannels that include adaptive audio objects and any other components designed
to be highly directional. The upward launch (reflecting) driver can include subchannel content
that is more omnidirectional or nondirectional, but is not limited thereto. Examples include
background music or environmental sounds. If the input to the system is legacy surround sound
content, this content is intelligently factored into direct and reflected subchannels and fed to the
appropriate drivers.
For direct sub-channels, the speaker enclosure includes a driver in which the driver's median axis
bisects the room acoustic center or other optimal listening position ("sweet spot"). The upper
launch driver is positioned such that the angle between the center plane of the driver and the
acoustic center is some angle in the range of 45 to 180 degrees. When the driver is positioned at
180 degrees, the back facing driver can provide sound diffusion by reflecting from the back wall.
This arrangement is an acoustic principle that, after time alignment with the upper launch
driver's direct driver, earlier arriving signal components are coherent while late arriving
components benefit from the natural diffusion provided by the room. Use
In order to achieve the height cues provided by the adaptive audio system, the upper launch
driver can be angled upwards from the horizontal plane, emitting in the extreme case directly
above the enclosure directly above. Can be positioned to reflect from a reflective surface (s) such
as a flat ceiling or acoustic diffuser placed on the. To provide additional directivity, the center
speaker has the ability to steer the sound across the screen to provide a high resolution center
channel (as shown in Figure 7C). Configuration can be used.
The 5.1 configuration of FIG. 9A can be expanded by adding two additional rear enclosures
similar to the standard 7.1 configuration. FIG. 9B shows a speaker configuration for an adaptive
audio 7.1 system that utilizes multiple addressable drivers for reflected audio, under an
embodiment. As shown in configuration 920, two additional enclosures 922 and 924 are placed
in the "left side surround" and "right side surround" positions, with the side speakers halfway
between the existing front and back pairs. Face the side walls in the same manner as the front
enclosure and the top launch driver set to spring back from the ceiling. Such incremental
additions can be made as many times as desired, with additional pairs filling the gaps along the
lateral or rear wall. FIGS. 9A and 9B show some examples of possible configurations of expanded
surround sound speaker layouts that can be used in connection with upper and side-emitting
speakers in an adaptive audio system for a consumer environment. Only, and many others are
As an alternative to the above n.1 configuration, a more flexible pod based system may be
utilized. Thus, each driver can be contained within its own enclosure, which can be placed in any
convenient location. This uses a driver configuration as shown in FIG. 7B. These individual units
may then be clustered in the same manner as in the n. 1 configuration, or may be spread
individually around the room. The pods are not necessarily restricted to being located at the end
of the room, but can also be arranged on any surface in the room (e.g. coffee table, bookshelf
etc). Such a system is easy to expand and allows the user to add more speakers over time to
create a more immersive experience. If the speakers are wireless, the pod system can include the
ability to dock the speakers for charging purposes. In this design, the pods can be docked
together to act as a single speaker while charging, possibly to listen to stereo music, and then of
the adaptive audio content In order to be undocked, it can be located around a room.
Several sensors and feedback devices to notify the renderer of properties that can be used in the
rendering algorithm to improve the configurability and accuracy of the adaptive audio system
using the top-fired addressable driver Can be added to the enclosure. For example, the
microphones installed in each enclosure allow the system to measure the phase, frequency and
reverberation characteristics of the room as well as the relative position of the speakers relative
to one another, using triangulation and the HRTF-like features of the enclosure itself. Inertial
sensors (e.g. gyroscopes, compasses etc) can be used to detect the direction and angle of the
enclosure; optical and visual sensors (e.g. using a laser based infrared range finder) can be the
room Can be used to provide location information for These represent just a few of the additional
sensors that can be used in the system, others are also possible.
Such sensor systems can be further enhanced by allowing the driver's position and / or enclosure
acoustic modifiers to be automatically adjustable via electromechanical servos . This allows the
directionality of the driver to be changed at run time to match the room positioning with the wall
and other drivers ("active steering"). Likewise, any acoustic modification (such as baffles, horns or
waveguides) can be tuned ("active tuning") to provide the correct frequency and phase response
for optimal reproduction in any room configuration. ). Both active steering and active tuning can
be performed during initial room configuration (eg, in the context of an automatic EQ [equalize] /
automatic room configuration system) or during playback in response to the content being
rendered. It can be performed.
Bidirectional Interconnection Once configured, the speakers need to be connected to the
rendering system. Traditional interconnections are typically of two types: speaker level input for
passive speakers and line level input for active speakers. As shown in FIG. 4C, adaptive audio
system 450 includes bi-directional interconnect functionality. This interconnection is embodied
within the set of physical and logical connections between the rendering stage 454 and the
amplifier / speaker 458 and the microphone stage 460. The ability to address multiple drivers in
each speaker cabinet is supported by such intelligent interconnection between sound sources
and speakers. Bidirectional interconnection allows transmission of the signal from the source
(renderer) to the speaker to include both control and audio signals. The signal from the
loudspeaker to the sound source consists of both control and audio signals. Here, the audio
signal in this case is audio from an optional built-in microphone. Power may be provided as part
of a bi-directional interconnect, at least if the speakers / drivers do not receive power separately.
FIG. 10A is a drawing 1000 illustrating the composition of a bi-directional interconnect under an
embodiment. A sound source 1002, which may represent a renderer plus an amplifier / sound
processor chain, is logically and physically coupled to the speaker cabinet 1004 through a pair of
interconnecting links 1006 and 1008. The interconnect 1006 from the sound source 1002 to the
driver 1005 in the speaker cabinet 1004 includes an electrical acoustic signal for each driver,
one or more control signals and optional power. The interconnect 1008 from the speaker cabinet
1004 back to the sound source 1002 includes the sound signal from the microphone 1007 or
other sensor for calibration of the renderer or similar sound processing function. The feedback
interconnect 1008 also includes certain driver definitions and parameters used by the renderer
to modify or process the sound signal set in the driver through interconnect 1006.
In one embodiment, each driver in each cabinet of the system is assigned an identifier (e.g.,
numerical assignment) during system setup. Each speaker cabinet can also be uniquely identified.
This numerical assignment is used by the speaker cabinet to determine which audio signal is sent
to which driver in the cabinet. The assignments are stored in the speaker cabinet in an
appropriate memory device. Alternatively, each driver may be configured to store its own
identifier in a local memory. In further alternatives, such as when the driver / speaker does not
have local storage capacity, the identifier can be stored in a rendering stage or other component
within the sound source 1002. During the speaker discovery process, each speaker (or central
database) is queried for its profile by the sound source. The profile is the number of drivers in
the speaker cabinet or other defined array, the acoustic characteristics of each driver (e.g. driver
type, frequency response etc), x of the center of each driver relative to the center of the front of
the speaker cabinet Define certain driver definitions including y, z position, angle of each driver
with respect to a defined plane (e.g. ceiling, floor, cabinet vertical axis etc) and the number of
microphones and microphone characteristics. Other related driver and microphone / sensor
parameters may also be defined. In one embodiment, driver definitions and speaker cabinet
profiles may be expressed as one or more XML documents used by the renderer.
In one possible implementation, an Internet Protocol (IP) control network is created between the
sound source 1002 and the speaker cabinet 1004. Each speaker cabinet and sound source acts
as a single network endpoint and is given a link-local address upon initialization or power up. An
auto-discovery mechanism such as zero configuration networking (zeroconf) may be used to
allow the sound source to locate each speaker on the network. Unconfigured networking is an
example of a process that automatically creates a usable IP network without manual operator
intervention or a specialized configuration server, and other similar techniques may be used.
Given an intelligent network system, multiple sources may exist as speakers on an IP network.
This allows multiple sources to drive the speakers directly without routing the sound through a
"master" audio source (e.g., a traditional A / V receiver). If another source attempts to address
those speakers, communication will be performed between all sources, which controls whether it
is currently "active", whether it needs to be active and the new sound source Determine if it can
be transitioned. Sources may be pre-assigned priorities during manufacturing based on their
classification. For example, the telecommunication source may have a higher priority than the
entertainment source. In a multi-room environment, such as a typical home environment, all the
speakers in the overall environment may be on a single network, but need not be addressed
simultaneously. During setup and automatic configuration, the sound levels provided back
through interconnect 1008 can be used to determine which speakers are located in the same
physical space. Once this information is determined, the speakers may be grouped together. In
this case, a cluster ID may be assigned and be part of the driver definition. The cluster ID is sent
to each speaker, and each cluster can be addressed simultaneously by the sound source 1002.
As shown in FIG. 10A, an optional power signal can be transmitted through the bi-directional
interconnect. The speakers can be passive (requiring external power from a source) or active
(requiring power from an electrical outlet). If the speaker system consists of an active speaker
without wireless support, the input to the speaker consists of an IEEE 802.3 compliant wired
Ethernet input. If the speaker system comprises an active speaker with wireless support, the
input to the speaker may be an IEEE 802.11 compliant wireless Ethernet input or alternatively a
wireless standard defined by the WISA organization. Passive speakers may be provided directly
with the appropriate power signal provided by the sound source.
Interconnection in a distributed processing embodiment where the configuration, calibration and
/ or rendering functions are performed in a speaker enclosure or other component closely
coupled to the driver or in a listening environment including all or most of the drivers. Links
1006 and 1008 may be embedded within a single unidirectional interconnect, such as
interconnect 476 shown in FIG. 4D. In this case, the sound source sends out the appropriate
audio signals, along with control signals or instructions that cause the configuration and
calibration functions to be performed by the respective processes provided by the speaker
system itself. The direct sound signal from the microphone to these functions in the speaker
essentially constitutes a second channel providing environmental information to the
configuration / calibration function. On the other hand, the link between the sound source and
the driver remains the one-way first channel link. Such an embodiment is shown in FIG. 10B. As
shown in FIG. 10B, system 1010 has a sound source 1012 coupled to driver 1015 in speaker
enclosure 1014 through link 1016. The speaker cabinet 1014 contains several components
including a driver 1015, a circuit 1019 for performing functions and one or more microphones
1017. The functions performed by component 1019 may include calibration, configuration and /
or partial rendering of the audio signal generated by sound source 1012. Link 1016 transmits an
audio signal or speaker feed from the source to driver 1015. The appropriate instructions,
commands or triggers are sent to function block 1019 over this link. Sound information related
to the listening environment is also transmitted from the microphone 1017 to the functional
block 1019. This information is then used to configure or calibrate the driver 1015 for proper
rendering of the audio signal transmitted from the sound source 1012 through the link 1016.
It should be noted that any of the components 1019 and 1017 may be embodied in circuits or
components physically located external to the enclosure 1014 but closely coupled or linked to
the driver 1015.
System Configuration and Calibration As shown in FIG. 4C, the functions of the adaptive audio
system include a calibration function 462.
This function is enabled by the microphone 1007 and interconnect 1008 links shown in FIG. The
function of the microphone components in system 1000 is to measure the responses of the
individual drivers in the room to derive an overall system response. For this purpose, multiple
microphone topologies can be used, including a single microphone or a microphone array. In the
simplest case, a single omnidirectional microphone located at the center of the room is used to
measure the response of each driver. Multiple microphones can be used instead if the room and
playback conditions warrant more sophisticated analysis. The most convenient location for
multiple microphones is in the physical speaker cabinet of the particular speaker configuration
used in the room. The microphones installed in each enclosure allow the system to measure the
response of each driver at multiple locations in the room. An alternative to this topology is to use
multiple omnidirectional measurement microphones placed at possible listener locations in the
The microphone (s) are used to enable automatic configuration and calibration of the renderer
and post-processing algorithm. In an adaptive audio system, the renderer converts hybrid object
and channel-based audio streams into individual audio signals specified for a particular
addressable driver in one or more physical speakers. Be responsible for Post-processing
components may include: delay, equalization, gain, speaker virtualization and upmixing. Speaker
configurations are often critical that renderer components can use to convert hybrid object and
channel based audio streams into individual driver-specific audio signals to provide optimal
playback of audio content. Information. System configuration information is: (1) the number of
physical speakers in the system, (2) the number of individually addressable drivers in each
speaker and (3) individually addressable to the room geometry. Location and orientation of each
driver. Other characteristics are also possible. FIG. 11 illustrates the functionality of the
automatic configuration and system calibration component, under an embodiment. As shown in
drawing 1100, an array of one or more microphones 1102 provides acoustic information to the
configuration and calibration component 1104. This acoustic information captures certain
relevant characteristics of the listening environment. The configuration and calibration
component 1104 then provides this information to the renderer 1106 and any associated postprocessing component 1108 so that the audio signal ultimately sent to the speakers is
conditioned and optimized for the listening environment .
The number of physical speakers in the system and the number of individually addressable
drivers in each speaker are physical speaker attributes. These attributes are sent directly from
the speaker to renderer 454 via bi-directional interconnect 456. The renderer and the speaker
use a common discovery protocol so that when the speaker is connected or disconnected from
the system, the renderer is notified of the change and the system can be reconfigured
The geometry (size and shape) of the listening room is a necessary item of information in the
configuration and calibration process. The geometry can be determined in several different ways.
In the manual configuration mode, a user whose width, length and height of the minimum
bounding cube for the room provide input to the renderer or other processing unit in the
adaptive audio system by the listener or technician Input to the system through the interface. A
variety of different user interface techniques and tools can be used for this purpose. For example,
room geometry can be sent to the renderer by a program that automatically maps and traces
room geometry. Such systems may use a combination of computer vision, sonar and 3D laser
based physical mapping.
The position of the speakers in the room geometry is used to derive an audio signal for each
individually addressable driver, including the direct and reflected (upper launch) drivers. The
direct driver is intended to meet the listening position before most of the dispersion pattern is
diffused by the reflective surface (s) (floor, wall or ceiling etc). The driver being reflected is such
that the majority of the dispersive pattern is aimed to be reflected before crossing the listening
position as shown in FIG. If the system is in manual configuration mode, 3D coordinates for each
direct driver may be input into the system through the UI. For a reflection driver, 3D coordinates
of the primary reflection are input to the UI. A laser or similar technique may be used to visualize
the dispersion pattern of the diffuse driver on the surfaces of the room. The 3D coordinates are
thereby measured and can be manually entered into the system.
Driver location and aiming are typically performed using manual or automatic techniques. In
some cases, an inertial sensor may be incorporated into each speaker. In this aspect, the central
speaker is designated as the "master" and its compass measurement is considered as the
reference. The other speakers then transmit the dispersion pattern and compass position for
each of their individually addressable drivers. Coupled with the room geometry, the difference
between the reference angle of the central speaker and each additional driver provides sufficient
information for the system to automatically determine whether the driver is direct or reflective .
The speaker position configuration can be fully automated if a 3D position (ie, ambisonic)
microphone is used. In this aspect, the system sends a test signal to each driver and records the
response. Depending on the microphone type, the signal may need to be converted to an x, y, z
representation. These signals are analyzed to find the x, y, z components of the dominant first
arrival. Coupled with the room geometry, this typically provides sufficient information to
automatically set the 3D coordinates for all speaker positions, whether the system would be
direct or reflective. Depending on the geometry of the room, a hybrid combination of the three
methods described to construct the speaker coordinates may be more effective than using just
one technique alone.
The speaker configuration information is one component required to configure the renderer.
Speaker calibration information is also needed to configure post processing chain: delay,
equalization and gain. FIG. 12 is a flow chart illustrating process steps for performing automatic
speaker calibration using a single microphone, under an embodiment. In this aspect, the delay,
equalization and gain are calculated automatically by the system using a single omnidirectional
measurement microphone located at the center of the listening position. As shown in drawing
1200, the process begins by measuring the room impulse response for each single driver alone
(block 1202). The delay of each driver is then calculated by finding the offset of the peak of the
cross-correlation of the acoustic impulse response (captured by the microphone) with the directly
captured electrical impulse response (block 1204). At block 1206, the calculated delay is applied
to the directly captured (reference) impulse response. The process then determines wideband
and per-band gain values that lead to the smallest difference between it and the direct acquisition
(reference) impulse response when applied to the measured impulse response (block 1208). It
takes a windowed FFT of the measured impulse response and the reference impulse response,
calculates the per-bin magnitude ratio between the two signals, and applies a median filter to the
per-bin magnitude ratio. Calculate the gain value per band by averaging the gains for all of the
bins that are in a band completely, calculate highband gain by taking the average of all bandwise
gains, highband gain Is derived from the band-by-band gains and by applying a room X curve (-2
dB / octave above 2 kHz). Once these gain values are determined at block 1208, the process
determines the final delay value by subtracting its minimum delay from the other delays. Thus, at
least one driver in the system will always have an additional delay of zero.
For automatic calibration with multiple microphones, delay, equalization and gain are calculated
automatically by the system using multiple omnidirectional measurement microphones. The
process is substantially identical to the single microphone technique, but repeated for each
microphone and the results averaged.
Alternative Applications Instead of implementing the adaptive audio system in the whole room or
theater, aspects of the adaptive audio system in more localized applications such as television,
computers, game consoles or similar devices It is possible to implement This case virtually relies
on speakers arranged in a flat plane corresponding to the viewing screen or monitor surface. FIG.
13 illustrates the use of the adaptive audio system in an exemplary television and sound bar
consumer use case. In general, television use cases are speaker locations that may be limited in
terms of the often degraded quality and spatial resolution of the equipment (TV speakers, sound
bar speakers, etc.) (eg no surround or rear speakers) Based on the configuration (s) presents
difficulties for creating an immersive consumer experience. The system 1300 of FIG. 13 includes
speakers (TV-L and TV-R) in standard television left and right positions and left and right upper
launch drivers (TV-LH and TV-RH). There is. Television 1302 may also include sound bars 1304
or speakers in any type of height array. In general, the size and quality of television speakers is
reduced relative to single or home theater speakers due to cost constraints and design choices.
However, the use of dynamic virtualization can help to overcome these deficiencies. In FIG. 13,
dynamic virtualization effects are shown for TV-L and TV-R speakers. This causes people at a
particular listening position 1308 to hear the horizontal elements associated with the
appropriate audio objects rendered individually in the horizontal plane. In addition, height
elements associated with the appropriate audio objects are rendered correctly through the
reflected audio sent by the LH and RH drivers. The use of stereo virtualization in television L and
R speakers is similar to L and R home theater speakers. Here, a potentially immersive dynamic
speaker virtualization user experience may be possible through dynamic control of speaker
virtualization algorithm parameters based on object spatial information provided by the adaptive
audio content.
This dynamic virtualization can be used to create a perception that objects are moving along the
sides of a room.
The television environment may also include HRC speakers as shown in the sound bar 1304.
Such HRC speakers may be steerable units that allow panning through the HRC array. By having
a front firing center channel with individually addressable speakers that allow discrete panning
of audio objects through an array matching the motion of video objects on the screen (especially
for larger screens) There may be benefits. This speaker is shown to also have side-emitting
speakers. These can be activated and used when the speaker is used as a sound bar. Thereby, the
side-fired driver gives a further immersive feeling due to the lack of surround or rear speakers.
Dynamic virtualization concepts are also shown for HRC / Soundbar Speakers. Dynamic
virtualization is shown for the L and R speakers on the farthest side of the forward launch
speaker array. Again, this can be used to create the perception of objects moving along the plane
of the room. The modified central speaker may also include more speakers and implement a
steerable sound beam with separately controlled sound zones. Also shown in the exemplary
implementation of FIG. 13 is an NFE speaker 1306 located in front of the main listening position
1308. Inclusion of the NFE speaker can provide more envelope than that provided by the
adaptive audio system by moving the sound away from the front of the room and closer to the
For headphone rendering, the adaptive audio system maintains the creator's original intent by
matching the HRTFs to spatial locations. When audio is played through headphones, binaural
spatial virtualization can be achieved by applying Head Related Transfer Function (HRTF) to
process the audio, through standard stereo headphones. Rather, it adds perceptual cues that
create the perception of audio played back in three-dimensional space. The accuracy of spatial
reproduction depends on the selection of the appropriate HRTF. The selection may vary based on
several factors including the audio channel or the spatial position of the object to be rendered.
Using spatial information provided by the adaptive audio system results in the selection of one of
the HRFTs representing 3D space-or a continuously changing number-to significantly improve
the playback experience.
The system also facilitates adding guided, three-dimensional binaural rendering and
virtualization. As in the case of spatial rendering, it is possible to create clues that simulate
sounds coming from both horizontal and vertical axes through the use of three-dimensional
HRTFs, with the new modified speaker type and position. Previous audio formats that only
provided channel and fixed speaker position information rendering were more limited. Using
adaptive audio format information, binaural three-dimensional rendering headphones can be
used to indicate which elements of the audio are suitable to be rendered in both horizontal and
vertical planes Have detailed and useful information. Some content may rely on the use of
overhead speakers to provide a greater sense of envelopment. These audio objects and
information can be used for binaural rendering, which is perceived to be above the listener's
head when using headphones.
FIG. 14 shows a simplified representation of a three-dimensional binaural headphone
virtualization experience for use in an adaptive audio system, under an embodiment. As shown in
FIG. 14, a headphone set 1402 used to play audio from the adaptive audio system includes audio
signals 1404 in the standard x, y and z planes. Thereby, the heights associated with certain audio
objects or sounds are reproduced so that they sound as if they are emanating above or below the
x, y sound.
Metadata Definition In one embodiment, the adaptive audio system includes components that
generate metadata from the original spatial audio format. The methods and components of
system 300 include an audio rendering system configured to process one or more bitstreams
that include both regular channel based audio elements and audio object coding elements. A new
enhancement layer containing audio object coding elements is defined and added to either the
channel based audio codec bitstream or the audio object bitstream. This approach allows the
bitstream containing the enhancement layer to be processed by a renderer for use with existing
speaker and driver designs or next generation speakers utilizing individually addressable drivers
and driver definitions. Do. Spatial audio content from the spatial audio processor includes audio
objects, channel and location metadata. When an object is rendered, the object is assigned to one
or more speakers according to the position metadata and the position of the playback speaker.
Additional metadata may be associated with the object to change the playback position or
otherwise limit the speakers used for playback. Metadata is generated at the audio workstation in
response to the engineer's mixing input to control spatial parameters (eg position, velocity,
intensity, timbre, etc.) and which driver (s) or speakers in the listening environment or speakers
Provide rendering cues specifying whether to play each sound at the show (s). Metadata is
associated with each audio data in the workstation for packaging and transport by the spatial
audio processor.
FIG. 15 is a table that illustrates certain metadata definitions for use in an adaptive audio system
for consumer environments under an embodiment. As shown in Table 1500, metadata definitions
include: calibration content including audio content type, driver definition (number,
characteristics, position, projection angle), control signals for active steering / tuning and room
and speaker information Including.
Features and Functions As noted above, the adaptive audio ecosystem allows content creators to
embed mixed spatial intentions (such as position, size, velocity, etc.) into the bitstream via
metadata. . This allows for incredible flexibility in spatial reproduction of the audio. From a
spatial rendering point of view, the adaptive audio format allows the content creator to
accurately position the loudspeakers in the room to avoid spatial distortions caused by the
geometry of the playback system not being identical to the authoring system. Can be adapted to
suit the In current consumer audio playback where only audio for the speaker channel is sent,
the content creator's intent is not known about indoor locations other than fixed speaker
locations. Under the current channel / speaker paradigm, the only information known is that a
particular audio channel should be sent to a particular speaker with a predefined position in the
room. In an adaptive audio system, using metadata conveyed through the generation and delivery
pipeline, the playback system uses this information to play content in a manner that matches the
content creator's original intent. be able to. For example, relationships between speakers are
known for various audio objects. By providing a spatial location for the audio object, the content
creator's intent is known, which can be "mapped" to the consumer's speaker configuration
including that location. In a dynamic rendering audio rendering system, this rendering can be
updated and improved by adding additional speakers.
The system also makes it possible to add guided three-dimensional spatial rendering. There have
been many attempts to create a more immersive audio rendering experience through the use of
new speaker designs and configurations. These include the use of bipole and dipole speakers,
side launch, rear launch and top launch drivers. In previous channel and fixed speaker position
systems, it was at best a guesswork to decide which elements of the audio should be sent to these
modified speakers. Using the adaptive audio format, the rendering system has detailed and useful
information about which elements of the audio (object or otherwise) are suitable for being sent to
the new speaker configuration. That is, the system allows control of which audio signals are sent
to the forward launch driver and which are sent to the upper launch driver. For example,
adaptive audio cinema content relies heavily on the use of overhead speakers to give a greater
sense of envelopment. These audio objects and information may be sent to the upper launch
driver to provide the reflected audio in consumer space to create a similar effect.
The system also allows adapting the mix to the exact hardware configuration of the playback
system. There are many different possible speaker types and configurations in consumer
rendering facilities such as televisions, home theaters, sound bars, portable music players docks
and the like. When these systems are sent channel-specific audio information (ie left and right
channels or standard multi-channel audio), the system needs to process the audio to match the
capabilities of the rendering facility appropriately. A typical example is when standard stereo
(left, right) audio is sent to a sound bar with three or more speakers. In current consumer
systems where only audio for the speaker channel is sent, the intent of the content creator is
unknown and the more immersive audio experience enabled by the enhanced equipment is
played back on hardware It must be produced by an algorithm that makes assumptions about
how to modify the audio for One example of this is the use of PLII, PLII-z or next-generation
surround that "mixes up" channel-based audio to more speakers than the original number of
channel feeds. In an adaptive audio system that uses metadata conveyed through the generation
and delivery pipeline, the playback system can use this information to play back the content in a
manner that better matches the content creator's original intent. For example, some sound bars
have side-fired speakers to create a wrap around feel. In adaptive audio, spatial and content-type
information (ie, dialog, music) when controlled by a rendering system such as a TV or A / V
receiver that sends only the appropriate audio to these side-emitting speakers , Environmental
effects, etc.) can be used by the sound bar.
The spatial information conveyed by the adaptive audio allows dynamic rendering of the content
with an awareness of the position and type of speakers present. Further, information about the
listener (s) 's relationship to the audio playback facility is now potentially available and may be
used in rendering. Most game consoles include camera accessories and intelligent image
processing that can determine the position and identity of a person in the room. This information
may be used by the adaptive audio system to modify the rendering to more accurately convey the
content creator's creative intent based on the listener's position. For example, in almost all cases,
the audio rendered for consumer playback is at the same location where the listener was often
equidistant from each speaker and where the sound mixer was located during content
generation. Assume that it is located in an ideal "sweet spot". However, in many cases people are
not in this ideal position and their experience does not match the creative intentions of the mixer.
A typical example is when the listener is sitting on a chair or couch on the left side of the room in
the living room. In this case, the sound played from the closer speaker on the left is perceived
more heavily, distorting the spatial perception of the audio mix to the left. By understanding the
position of the listener, the system adjusts the rendering of the audio, reduces the sound level at
the left speaker, raises the level of the right speaker, rebalances the audio mix, and perceptually
You can do it right. It is also possible to delay the audio to compensate for the listener's distance
from the sweet spot. The listener position can be detected through the use of a modified remote
control with some built-in signaling that signals the listener position to the camera or rendering
In addition to using standard speakers and speaker positions to convey sound to the listening
position, using beam steering technology to create sound field "zones" that vary depending on
listener position and content It is possible. Audio beamforming uses an array of speakers
(typically 8 to 16 horizontally spaced speakers) and uses phase manipulation and processing to
produce steerable sound beams. A beamforming speaker array allows for the creation of an audio
zone where the audio is primarily audible, which can be used to direct a particular sound or
object to a particular spatial location by selective processing. One obvious use case is to process
dialogs in a soundtrack using a dialog enhancement post-processing algorithm to direct the beam
of the audio object directly to the hearing impaired user.
Matrix Encoding In some cases, the audio object may be the desired component of the adaptive
audio content, but based on bandwidth limitations, it sends both channel / speaker audio and
audio objects It may not be possible. In the past, matrix encoding has been used to convey more
audio information than is possible for a given delivery system. For example, the film format only
provided stereo audio, although it was at the beginning of movies where multi-channel audio was
generated by a sound mixer. Matrix encoding was used to intelligently downmix multi-channel
audio into two stereo channels. The stereo channel is then processed with some algorithm to
regenerate the close approximation of the multi-channel mix from the stereo audio. Similarly,
intelligently downmix audio objects into the basic speaker channel, extract those objects through
adaptive audio metadata and sophisticated time and frequency sensitive next generation
surround algorithms and consume them It is possible to render correctly spatially with a person
based adaptive audio rendering system.
In addition, when there is bandwidth limitations of the transmission system for audio (eg 3G and
4G wireless applications), by transmitting spatially diverse multichannel beds that are matrix
encoded with individual audio objects There are also benefits. One use case of such a
transmission methodology would be for the transmission of sports broadcasts using two different
audio beds and multiple audio objects. The audio bed can represent multi-channel audio captured
in two different team seat sections, and the audio object can represent different announcers that
may be sympathetic to one or the other team. Using the standard encoding of 5.1 representation
of each bed together with two or more objects may exceed the bandwidth constraints of the
transmission system. In this case, if each of the 5.1 beds is matrix-encoded to a stereo signal, the
two beds originally captured as 5.1 channels are 5.1 + 5 as two-channel bed 1, two-channel bed
2, object 1 and object 2. It can be transmitted as only four channels of audio, rather than 1 + 2 or
12.1 channels.
Position- and Content-Dependent Processing The adaptive audio ecosystem allows the content
creator to create individual audio objects and add information about the content that can be
conveyed to the playback system. This allows a great deal of flexibility in the processing of the
audio before playback. Processing can be adapted to object position and type through dynamic
control of speaker virtualization based on object position and size. Speaker virtualization refers to
a method of processing audio such that a virtual speaker is perceived by a listener. This method
is often used for stereo speaker reproduction when the source audio is multi-channel audio
including surround speaker channel feeds. Virtual speaker processing modifies the surround
speaker channel audio to virtualize the surround audio elements to the side and back of the
listener when played on the stereo speakers, as if there were virtual As if there were speakers.
Currently, the position attribute of the virtual speaker position is static since the intended
position of the surround speaker was fixed. However, in adaptive audio content, the spatial
position of the various audio objects is dynamic and distinct (i.e. unique to each object). Postprocessing such as virtual speaker virtualization now now controls parameters such as speaker
position angle dynamically for each object, and then combines the rendered output of several
virtualized objects It is possible to create a more immersive audio experience that can be
controlled in a more informed way and more closely represent the intent of the sound mixer.
In addition to the standard horizontal virtualization of audio objects, using fixed channels and
perceptual height cues to process dynamic object audio, from a standard pair of stereo speakers
in a normal horizontal position, It is possible to obtain a perception of audio height reproduction.
Certain effects or enhancement processes can be applied to the appropriate type of audio content
based on prudent judgment.
For example, dialog enhancement may be applied only to dialog objects. Dialog enhancement
refers to a method of processing audio that includes dialog so that the dialog's audibility and / or
intelligibility is enhanced and / or improved. In many cases, the audio processing applied to the
dialog is inappropriate for non-dialog audio content (i.e., music, environmental effects, etc.) and
can lead to unpleasant audible artifacts. In adaptive audio, the audio object only includes the
dialog in the piece of content and can be labeled accordingly so that the rendering solution
selectively applies dialog enhancement only to the dialog content. Furthermore, if the audio
object is only a dialog (not a mixture of dialog and other content as is often the case), the dialog
enhancement process can handle only the dialog (thus, any other option) Restrict any processing
being performed on the content of
Similarly, audio response or equalization management can be tailored to particular audio
characteristics. For example, bass management (filtering, attenuation, gain) that targets specific
objects based on object type. Bass management refers to selectively isolating and processing only
the base (or lower) frequencies in a particular piece of content. In current audio systems and
delivery mechanisms, this is a "blind" process applied to all of the audio. With adaptive audio,
specific audio objects for which bass management is appropriate can be identified by the
metadata, and the rendering process can be applied appropriately.
Adaptive audio systems also facilitate object-based dynamic range compression. Traditional audio
tracks have the same duration as the content itself. On the other hand, audio objects may appear
for a limited amount of time in the content. The metadata associated with an object may include
level relationship information for its average and peak signal amplitude and its onset or attack
time (especially for transient material). This information allows the compressor to better adapt its
compression and time constants (attack, release, etc.) to better fit the content.
The system also facilitates automatic loudspeaker-room equalization. Loudspeakers and room
sounds introduce audible features to the sound and thereby play a significant role in affecting the
timbre of the sound being reproduced. Furthermore, the sound is position dependent due to
room reflections and loudspeaker directivity, and this change causes the perceived timbre to be
significantly different for different listening positions. The automatic EQ (automatic room
equalization) function provided in this system is automatic loudspeaker-room spectrum
measurement and equalization, automated time delay compensation (which is proper sound
imaging and possibly least squares Several of these issues through providing relative speaker
position detection based on () and level setting, bass redirection based on the loudspeaker
headroom function and optimal splicing with the main loudspeaker subwoofer (s) It helps to ease
the In a home theater or other consumer environment, the adaptive audio system includes
certain additional features such as: (1) Automated target curve calculation based on playback
room sound, which is used in the home Considered as a pending issue under investigation for
equalization in the listening room), (2) the effect of modal damping control using time-frequency
analysis, (3) governing wrap / spread / source width / intelligibility Understanding the
parameters derived from the measurements and providing the best possible listening experience
by controlling them, (4) a head model for matching timbres between the front and "other"
loudspeakers Directional filtering, (5) detection and spatial remapping of the loudspeaker's
spatial position in a discrete setup for the listener (eg (Summit) · Wireless will become an
example). The tonal mismatch between the loudspeakers is particularly evident on certain
panned content between the forward anchor loudspeakers (e.g. center) and the surround / back /
wide / height loudspeakers.
Overall, the adaptive audio system is convincing audio / video, especially with larger screen sizes
in the home environment, when the reproduced spatial position of some audio elements match
the image elements on the screen It also enables a playback experience. An example is spatially
matching a dialog in a movie or television program with the person or character talking on the
screen. With regular speaker channel based audio, there is no simple way to determine where the
dialog should be spatially positioned to match the position of the person or character on the
screen. With the audio information available in the adaptive audio system, this type of audio /
visual alignment can be easily achieved, even in home theater systems that are becoming
increasingly large sized screens. Visual alignment and spatial alignment of audio can also be used
for non-character / dialog objects such as cars, tracks, animations, etc.
The adaptive audio ecosystem also allows for enhanced content management by allowing the
content creator to create individual audio objects and add information about the content that can
be conveyed to the playback system Do. This allows for great flexibility in audio content
management. From a content management point of view, adaptive audio does various things such
as changing the language of the audio content just by replacing the dialog object, reducing the
content file size and / or reducing the download time. to enable. Movies, television and other
entertainment programs are typically distributed internationally. This often requires that the
language in the piece of content be changed depending on where it is played (French for movies
screened in France, German for television shows screened in Germany, etc.). Today, this often
requires that completely independent audio soundtracks be generated, packaged and distributed
for each language. The inherent notion of adaptive audio systems and audio objects allows the
dialog of a piece of content to be an independent audio object. This allows the language of the
content to be easily changed without updating or changing other elements of the audio
soundtrack such as music, effects and the like. This applies not only to foreign languages, but
also to inappropriate language, targeted advertising, etc. for certain audiences.
The aspects of the audio environment described herein represent the playback of audio or audio
/ visual content through appropriate speakers and playback devices, such as cinemas, concert
halls, outdoor theaters, home or rooms, listening booths, automobiles. It may represent any
environment in which the listener experiences playback of the captured content, such as a game
console, headphones or headset system, public address (PA) system or any other playback
environment. Although the embodiments have mainly been described with respect to examples
and implementations in a home theater environment where spatial audio content is associated
with television content, it is noted that the embodiments may be implemented in other consumer
based systems as well. You should keep it. Spatial audio content, including object-based audio
and channel-based audio, may be used in the context of any related content (related audio, video,
graphics, etc.) or single audio -It may be content. The playback environment may be any suitable
listening environment, from headphones or near field monitors to large and small rooms, cars,
outdoor arenas, concert halls and the like.
The aspects of the system described herein can be implemented in a suitable computer based
sound processing network environment for processing digital or digitized audio files. Parts of the
adaptive audio system may include any desired number of individual machines, including one or
more routers (not shown) that serve to buffer and route data transmitted between the computers.
It may include one or more networks. Such networks may be built on a variety of different
network protocols, and may be the Internet, a wide area network (WAN), a local area network
(LAN), or any combination thereof. In embodiments where the network includes the Internet, one
or more machines may be configured to access the Internet through a web browser program.
One or more of the components, blocks, processes or other functional components described
above may be implemented through a computer program that controls the execution of the
processor-based computing device of the system. The various functions disclosed herein may be
behavioral, register-transferred as data and / or instructions embodied using any combination of
hardware, firmware, and / or embodied in various machine-readable or computer-readable
media. It should be noted that it may be described using, logical components and / or other
characteristics. Computer readable media in which such formatted data and / or instructions may
be embodied include various forms of physical (non-transitory), non-volatile storage media such
as optical, magnetic or semiconductor storage media. Is not limited thereto.
Unless the context clearly requires otherwise, the words "having," "including," etc. shall be
interpreted in an inclusive sense rather than an exclusive or exhaustive sense, throughout this
description and the claims. Do. That is, it means "including but not limited to". Words using the
singular or plural number also include the plural or singular number respectively. Further, the
terms "in this document", "below", "above", "below" and similar words refer to the present
application as a whole, and not to any particular part of the present application. When the word
"or" is used in reference to a list of two or more items, the word covers all of the following
interpretations of the word: any of the items in the list, all of the items in the list And any
combination of items in the list.
Although one or more implementations are described by way of example with specific
embodiments, it is to be understood that one or more implementations are not limited to the
disclosed embodiments. To the contrary, it is intended to cover various modifications and similar
arrangements as would be apparent to those skilled in the art. Therefore, the scope of the
appended claims should be accorded the broadest interpretation so as to encompass all such
modifications and similar arrangements.
Без категории
Размер файла
86 Кб
Пожаловаться на содержимое документа