close

Вход

Забыли?

вход по аккаунту

?

JP2016533045

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016533045
Abstract Embodiments of the present invention relate to adaptive audio content generation. In
particular, a method is provided for generating adaptive audio content. The method includes
extracting at least one audio object from channel-based source audio content and generating
adaptive audio content based at least in part on the at least one audio object. . Corresponding
systems and computer program products are also disclosed.
Surround sound field generation
[0001]
Cross Reference to Related Applications This application claims the benefit of priority from
Chinese Patent Application No. 201310246729.2 filed on June 18, 2013 and US Provisional
Patent Application No. 61 / 839,474 filed on June 26, 2013. It is what you insist. The contents of
both applications are hereby incorporated by reference in their entirety.
[0002]
TECHNICAL FIELD This application relates to signal processing. More specifically, embodiments
of the present invention relate to the generation of a surround sound field.
[0003]
10-05-2019
1
Traditionally, surround sound fields are generated by dedicated surround recording facilities or
by professional sound mixing engineers or software applications that pan sound sources into
different channels. Neither of these two approaches is easily accessible to end users. In the past
few decades, increasingly popular mobile devices, such as cell phones, tablets, media players and
gaming consoles, have become equipped with audio capture and / or processing capabilities.
However, most mobile devices (mobile phones, tablets, media players, game consoles) are only
used to achieve mono audio capture.
[0004]
Several approaches have been proposed for surround sound field generation using mobile
devices. However, those approaches do not take into account the nature of non-business-purpose
mobile devices that either rely strictly on access points or are commonly used. For example,
when creating a surround sound field using an ad hoc network of heterogeneous user devices,
the recording times of different mobile devices may not be synchronized and the locations and
topology of mobile devices may be unknown . Furthermore, the gain and frequency response of
the audio capture device may be different. As a result, currently, it is not possible to generate a
surround sound field effectively and efficiently by using the audio capture device of a daily user.
[0005]
In view of the above, there is a need in the art for a solution that can generate a surround sound
field in an effective and efficient manner.
[0006]
To address the above and other potential problems, embodiments of the present invention
propose methods, apparatus and computer program products for generating a surround sound
field.
[0007]
In one aspect, embodiments of the present invention provide a method of generating a surround
sound field.
10-05-2019
2
The method comprises: receiving audio signals captured by a plurality of audio capture devices;
estimating a topology of the plurality of audio capture devices; at least partially from the received
audio signals Generating a surround sound field based on the selected topology.
Embodiments of this aspect also include a corresponding computer program product having a
computer program tangibly embodied on a machine readable medium for performing the above
method.
[0008]
In another aspect, embodiments of the present invention provide an apparatus for generating a
surround sound field. The apparatus comprises: a receiving unit configured to receive audio
signals captured by a plurality of audio capture devices; a topology estimation unit configured to
estimate the topology of the plurality of audio capture devices; received audio And a generation
unit configured to generate a surround sound field from the signal based at least in part on the
estimated topology.
[0009]
These embodiments of the invention can be implemented to realize one or more of the following
advantages. According to embodiments of the present invention, surround sound may be
generated by the use of an ad hoc network of end user audio capture devices, such as
microphones provided on mobile phones. Thus, the need for expensive and complex professional
equipment and / or human professionals can be eliminated. Furthermore, by dynamically
generating the surround sound field based on the estimation of the topology of the audio capture
device, the quality of the surround sound field can be maintained at a higher level.
[0010]
Other features and advantages of the embodiments of the present invention will be understood
from the following description of the exemplary embodiments when read in conjunction with the
accompanying drawings. The drawings illustrate, by way of example, the spirit and principles of
the present invention.
10-05-2019
3
[0011]
The details of one or more embodiments of the invention are set forth in the accompanying
drawings and the description below. Other features, aspects and advantages of the present
invention will be apparent from the description, the drawings and the claims. FIG. 1 is a block
diagram illustrating a system in which an exemplary embodiment of the invention may be
implemented. FIGS. 6A-6C are schematic diagrams showing some examples of the topology of an
audio capture device according to an exemplary embodiment of the invention. FIG. 6 is a flow
chart illustrating a method of generating a surround sound field according to an exemplary
embodiment of the present invention. FIG. 5 is a schematic diagram showing the polar patterns
for the W channel in the B format process for various frequencies when using an exemplary
mapping matrix. FIG. 7 is a schematic diagram showing the polar patterns for the X channel in Bformat processing for various frequencies when using an exemplary mapping matrix. FIG. 7 is a
schematic diagram showing the polar patterns for the Y channel in B format processing for
various frequencies when using an exemplary mapping matrix. FIG. 10 is a schematic diagram
showing polarity patterns for W channels in B format processing for various frequencies when
using another exemplary mapping matrix. FIG. 10 is a schematic diagram showing polarity
patterns for the X channel in B format processing for various frequencies when using another
exemplary mapping matrix. FIG. 7 is a schematic diagram showing polarity patterns for the Y
channel in B-format processing for various frequencies when using another exemplary mapping
matrix. FIG. 1 is a block diagram of an apparatus for generating a surround sound field according
to an exemplary embodiment of the present invention. FIG. 5 is a block diagram of a user
terminal for implementing an exemplary embodiment of the invention. FIG. 1 is a block diagram
illustrating a system for implementing certain exemplary embodiments of the present invention.
In the drawings, the same or similar reference symbols indicate the same or similar elements.
[0012]
In general, embodiments of the present invention provide methods, apparatus and computer
program products for surround sound field generation. According to embodiments of the present
invention, a surround sound field may be effectively and accurately generated by use of an ad
hoc network of audio capture devices such as end user's mobile phones. Some embodiments of
the present invention are detailed below.
[0013]
10-05-2019
4
Referring first to FIG. In FIG. 1, a system 100 in which embodiments of the present invention can
be implemented is shown. In FIG. 1, system 100 includes a plurality of audio capture devices 101
and a server 102. According to embodiments of the present invention, audio capture device 101
may, among other things, capture, record and / or process audio signals. Examples of audio
capture device 101 include, but are not limited to, cell phones, personal digital assistants (PDAs),
laptops, tablet computers, personal computers (PCs) or other with audio capture capabilities. Any
suitable user terminal may be included. For example, commercially available mobile phones
typically include at least one microphone, and thus can be used as the audio capture device 101.
[0014]
According to embodiments of the present invention, audio capture devices 101 may be arranged
in one or more ad hoc networks or groups 103, each including one or more audio capture
devices. Audio capture devices may be grouped according to a predetermined strategy or
dynamically. This will be described later. Different groups can be located at the same or different
physical locations. Within each group, the audio capture devices are located at the same physical
location and may be located close to one another.
[0015]
FIGS. 2A-2C show some examples of groups of three audio capture devices. In the exemplary
embodiment shown in FIG. 2A-C, the audio capture device 101 comprises a cell phone, a PDA or
other audio capture element 201 such as one or more microphones to capture audio signals. May
be any portable user terminal. In particular, in the exemplary embodiment shown in FIG. 2C, the
audio capture device 101 further comprises a video capture element 202 such as a camera, and
the audio capture device 101 captures video and / or audio while capturing audio signals. Or it
may be configured to capture an image.
[0016]
It should be noted that the number of audio capture devices in a group is not limited to three.
Rather, any suitable number of audio capture devices may be arranged as a group. Furthermore,
within a group, the plurality of audio capture devices may be arranged in any desired topology.
10-05-2019
5
In some embodiments, audio capture devices in a group may communicate with each other by
computer network, Bluetooth, infrared, telecommunication, etc. to name but a few.
[0017]
With continuing reference to FIG. 1, as shown, server 102 is communicatively connected to
groups of audio capture devices 101 via a network connection. The audio capture device 101
and the server 102 may communicate with each other, for example by means of a computer
network such as a local area network (LAN), a wide area network (WAN) or the Internet, a
communication network, a near field communication connection or any combination thereof It
may communicate. The scope of the present invention is not limited in this regard.
[0018]
In operation, generation of the surround sound field may be initiated by the audio capture device
101 or by the server 102. Specifically, in some embodiments, the audio capture device 101 may
log into the server 102 and request the server 102 to generate a surround sound field. In that
case, the audio capture device 101 sending the request becomes the master device and then
sends an invitation to the other capture devices to join the audio capture session. In this regard,
there may be predefined groups to which the master device belongs. In these embodiments,
other audio capture devices in this group receive an invitation from the master device and
participate in an audio capture session accordingly. Alternatively or additionally, another audio
capture device or devices may be identified dynamically and grouped together with the master
device. For example, if a location service such as GPS (Global Positioning Service) is available to
the audio capture device 101, join one or more audio capture devices located in the vicinity of
the master device into the audio capture group It is possible to invite automatically to do. In
some alternative embodiments, audio capture device discovery and grouping may be performed
by server 102.
[0019]
In forming a group of audio capture devices, the server 102 sends capture commands to all audio
capture devices in the group. Alternatively, the capture command may be sent by one of the
audio capture devices 101 in the group, for example by the master device. Each audio capture
device in the group starts capturing and recording audio signals immediately after receiving the
10-05-2019
6
capture command. An audio capture session ends when any audio capture device ceases capture.
During audio capture, audio signals may be recorded locally on the audio capture device 101 and
transmitted to the server 102 after completion of the capture session. Alternatively, captured
audio signals may be streamed to server 102 in real time.
[0020]
According to embodiments of the present invention, audio signals captured by a single group of
audio capture devices 101 are assigned the same group identification (ID), such that the server
102 groups the incoming audio signals into the same group It can identify whether it belongs to
Further, in addition to the audio signal, any information associated with the audio capture
session may be transmitted to server 102. This includes the number of audio capture devices
101 in the group, the parameters of one or more audio capture devices 101, etc.
[0021]
Based on the audio signals captured by a group of capture devices 101, server 102 performs a
series of operations to process the audio signals to generate a surround sound field. In this
regard, FIG. 3 shows a flow chart of a method of generating a surround sound field from audio
signals captured by a plurality of capture devices 101.
[0022]
As shown in FIG. 3, upon receiving audio signals captured by the group of audio capture devices
101 in step S301, the topology of these audio capture devices is estimated in step S302.
Estimating the topology of the position of the audio capture device 101 within a group is
important for the subsequent spatial processing which has a direct influence on the reproduction
of the sound field. According to embodiments of the present invention, the topology of the audio
capture device may be estimated in various ways. For example, in some embodiments, the
topology of the audio capture device 101 may be predefined and thus known to the server 102.
In this case, the server 102 may use the group ID to determine the source group of the audio
signal, and then obtain a predefined topology associated with the determined group as a
topology estimate.
10-05-2019
7
[0023]
Alternatively or additionally, the topology of the audio capture device 101 may be estimated
based on the distance between each pair of audio capture devices 101 in a group. There are
many possible ways in which the distance between a pair of audio capture devices 101 can be
obtained. For example, in embodiments where the audio capture device can play audio, each
audio capture device 101 may be configured to simultaneously play audio pieces and receive
audio signals from other devices in the group. That is, each audio capture device 101 broadcasts
a unique audio signal to the other members of the group. As an example, each audio capture
device may reproduce a linear chirp signal that spans a unique frequency range and / or has any
other unique acoustic features. By recording the times at which the linear chirp signal is received,
the distance between each pair of audio capture devices 101 can be calculated by an acoustic
ranging process. The acoustic ranging process is known to the person skilled in the art and is
therefore not detailed here.
[0024]
Such distance calculations may be performed at server 102, for example. Alternatively, such
distance calculations may be performed on the client side, if the audio capture devices can
communicate directly with one another. In server 102, no additional processing is required if
there are only two audio capture devices 101 in the group. When there are three or more audio
capture devices 101, in some embodiments, multidimensional scaling (MDS) analysis or similar
processes are performed on the captured distances to obtain the audio capture device. The
topology of can be estimated. Specifically, using an input matrix that indicates the distances of
pairs of audio capture devices 101, MDS may be applied to generate the coordinates of audio
capture device 101 in a two-dimensional space. For example, assume that the measured distance
matrix in the three device groups is: Then, the outputs of the two-dimensional (2D) MDS
indicating the topology of the audio capture device 101 are M1 (0, -0.0441), M2 (-0.0750,
0.0220) and M3 (0.0750, 0.0220).
[0025]
It should be noted that the scope of the present invention is not limited to the examples given
above. Any suitable method capable of estimating the distance between a pair of audio capture
devices, now known or later developed, may be used in the context of embodiments of the
present invention. For example, instead of playing back an audio signal, audio capture device 101
10-05-2019
8
may be configured to broadcast electrical and / or optical signals to one another to facilitate
distance estimation.
[0026]
Method 300 then proceeds to step S303. Here, time alignment is performed on the audio signals
received in step S301. The audio signals captured by the different capture devices 101 are
thereby aligned with one another in time. According to embodiments of the present invention,
time alignment of audio signals may be done in many possible ways. In some embodiments,
server 102 may implement a protocol based clock synchronization process. For example,
Network Time Protocol (NTP) provides accurate and synchronized time across the Internet.
When connected to the Internet, each audio capture device 101 may be configured to
synchronize separately with an NTP server while performing audio capture. There is no need to
adjust the local clock. Instead, the offset between the local clock and the NTP server can be
calculated and stored as metadata. Once audio capture is complete, the local time and its offset
are sent to server 102 along with the audio signal. The server 102 then aligns the received audio
signal based on such time information.
[0027]
Alternatively or additionally, time alignment in step S303 may be realized by a peer-to-peer clock
synchronization process. In these embodiments, the audio capture devices may be in peer-to-peer
communication with one another via a protocol such as, for example, Bluetooth or infrared
connection. One of the audio capture devices may be selected as the synchronization master, and
clock offsets of all other capture devices may be calculated relative to the synchronization
master.
[0028]
Another possible implementation is cross-correlation based time alignment. As known, a series of
cross correlation coefficients between a pair of input signals x (i) and y (i) are calculated by
[0029]
10-05-2019
9
Where x and y with 表 わ し represent the average of x (i) and y (i), N represents the length of x
(i) and y (i), and d is the time between two sequences Represents a rug. The delay between the
two signals can be calculated as follows.
[0030]
The signal y (i) can then be time aligned to x (i) by y (k) = y (i-D), using x (i) as a reference.
[0031]
Temporal alignment can be achieved by applying a cross correlation process, but if the search
range is large, this process can be time consuming and prone to errors.
However, in practice, the search range must be quite long to accommodate large network delay
variations. To address this issue, information about the calibration signal emitted by the audio
capture device 101 may be collected and sent to the server 102 to be used to reduce the search
range of the cross-correlation process. As mentioned above, in some embodiments of the present
invention, the audio capture device 101 may broadcast an audio signal to other members in the
group at the beginning of the audio capture. This facilitates the calculation of the distance
between each pair of audio capture devices 101. In these embodiments, the broadcasted audio
signal can be used as a calibration signal to reduce the time taken for signal correlation.
Specifically, considering two audio capture devices A and B in a group, SA is the point at which
device A issues a command to reproduce a calibration signal; SB issues a command to device B to
reproduce a calibration signal R AA is the point at which device A receives the signal transmitted
by device A; R BA is the point at which device A receives the signal transmitted by device B; R BB
is the point at which device B receives It is time to receive the signal sent by device B; R AB is the
time when device B receives the signal sent by device A. One or more of these times may be
recorded by the audio capture device 101 and sent to the server 102 for use in the crosscorrelation process.
[0032]
In general, the acoustic propagation delay from device A to device B is less than the network
delay difference. That is, S B -S A> R AB -S A. Thus, points in time R BA and R BB can be used to
initiate the cross correlation based time alignment process. In other words, only audio signal
10-05-2019
10
samples after time points R BA and R BB are included in the correlation calculation. In this way,
the search range can be reduced, thus improving the efficiency of time alignment.
[0033]
However, the network delay difference may be smaller than the acoustic propagation delay
difference. This can occur when the network has very low jitter and / or the two devices are
located farther apart or both. In this case, points in time S B and S A can be used as starting
points for the cross-correlation process. Specifically, since the audio signal after time points SB
and SA contains the calibration signal, R BA can be used as a starting point of correlation for
device A, and SB + (R BA -SA) for device B. Can be used as a starting point for correlation of
[0034]
It will be appreciated that the above described mechanisms for time alignment may be combined
in any suitable manner. For example, in some embodiments of the present invention, time
alignment can be a three step process. First, coarse time synchronization may be performed
between the audio capture device 101 and the server 102. Next, calibration signals as discussed
above may be used to refine synchronization. Finally, cross correlation analysis is applied to
complete the time alignment of the audio signal.
[0035]
It should be noted that the time alignment in step S303 is optional. For example, if
communication and / or device conditions are good enough, it makes sense to consider that all
audio capture devices 101 receive capture commands almost simultaneously, thus initiating
audio capture at the same time. Furthermore, it will be readily appreciated that in some
applications where the quality of the surround sound field is not very sensitive, some
misalignment of the audio capture start time may be acceptable or negligible. In these situations,
the time alignment in step S303 can be omitted.
[0036]
10-05-2019
11
In particular, it should be noted that step S302 is not necessarily performed prior to S303.
Instead, in some alternative embodiments, time alignment of the audio signal may be performed
prior to or even in parallel with topology estimation. For example, clock synchronization
processes such as NTP synchronization or peer-to-peer synchronization can be performed prior
to topology estimation. Depending on the acoustic ranging approach, such clock synchronization
process may be useful for acoustic ranging in topology estimation.
[0037]
Still referring to FIG. 3, in step S304, the surround sound field from the received audio signal
(possibly in time alignment), based at least in part on the topology estimated in step S302. Is
generated. To this end, according to some embodiments, the mode for processing the audio
signal may be selected based on the number of audio capture devices. For example, if there are
only two audio capture devices 101 in a group, those two audio signals may simply be combined
to produce a stereo output. Optionally, some post-processing may be performed, including but
not limited to stereo sound image widening, multi-channel upmixing, etc. On the other hand,
when there are three or more audio capture devices 101 in a group, Ambisonics or B format
processing may be applied to generate a surround sound field. It should be noted that an
adaptive choice of processing mode is not necessarily required. For example, even if there are
only two audio capture devices, the surround sound field may be generated by processing the
captured audio signal with B format processing.
[0038]
Next, some embodiments of the invention of how to generate a surround sound field are
discussed with reference to ambisonics processing. However, it should be noted that the scope of
the present invention is not limited in this regard. Any suitable technique capable of generating a
surround sound field from the received audio signal based on the estimated topology may be
used in connection with embodiments of the present invention. For example, binaural or 5.1
channel surround sound generation techniques may be utilized.
[0039]
For Ambisonics, this is known as a flexible spatial audio processing technique that provides
sound field and source position recoverability. In Ambisonics, a 3D surround sound field is
10-05-2019
12
recorded as a four-channel signal called B format with a W-X-Y-Z channel. The W channel
contains omnidirectional sound pressure information, while the remaining three channels X, Y
and Z represent sound velocity information measured on three corresponding axes in 3D
Cartesian coordinates. Specifically, given the localized sound source S at azimuth angle φ and
elevation angle θ, the ideal B-format representation of the surround sound field is as follows.
[0040]
For simplicity, in the following discussion of directional pattern generation for B-format signals,
only the W, X and Y channels in the horizontal plane are considered, and the height axis Z is
ignored. This is a reasonable assumption, as there is generally no height information in the way
audio signals are captured by the audio capture device 101 according to embodiments of the
present invention.
[0041]
Given a plane wave, the directivity of the discrete array can be expressed as follows.
[0042]
Here, represents the spatial position of the audio capture device with the distance R to the center
and the angle φ M, the vector α represents the source position α = [cos φ sin φ 0] at the
angle φ.
Furthermore, An (f, r) represents the weight for the audio capture device, which is the product of
the user-defined weight and the gain of the audio capture device at a particular frequency and
angle: An (f, r) = W n (f) r (φ) r (φ) = β + (1-β) cos (φ). Here, β = 0.5 represents a cardioid
polar pattern, β = 0.7 represents a subcardioid polar pattern, and β = 1 represents
omnidirectional.
[0043]
Once the polar pattern and position topology of the audio capture device have been determined,
it can be seen that the weight W n (f) for each captured audio signal affects the quality of the
10-05-2019
13
generated surround sound field. Different weights W n (f) produce different qualities for the B
format signal. The weights for the various audio signals may be expressed as a mapping matrix.
Taking the topology shown in FIG. 2A as an example, a mapping matrix (W) of audio signals M 1,
M 2 and M 3 to W, X and Y channels may be defined as follows.
[0044]
Traditionally, B format signals are generated using a specially designed (often very expensive)
microphone array, such as a professional sound field microphone. In this case, the mapping
matrix may be designed in advance and may remain unchanged during operation. However,
according to an embodiment of the present invention, audio signals are captured by an ad hoc
network of audio capture devices that are dynamically grouped with potentially changed
topology. As a result, existing solutions may not be applicable to generate W, X, Y channels from
such raw audio signals that are captured by user devices that are not specifically designed and
positioned is there. For example, suppose a group includes three audio capture devices 101 with
angles of π / 2, 3π / 4 and 3π / 2 and the same distance 4 cm to the center. FIGS. 4A-4C show
the polar patterns for the W, X and Y channels, respectively, for the various frequencies when
using the original mapping matrix as described above. As can be seen, the outputs of the X and Y
channels are not correct. This is because they are no longer orthogonal to one another.
Furthermore, even if the W channel is as low as 1000 Hz, problems occur. Therefore, it is
desirable that the mapping matrix be able to be flexibly adapted to ensure high quality of the
generated surround sound field.
[0045]
Toward this end, according to embodiments of the present invention, weights for each audio
signal represented by the mapping matrix may be dynamically adapted based on the topology of
the audio capture device estimated in step S303. Continuing to consider the above exemplary
topology where the three audio capture devices 101 have the same distance 4 cm to the corners
and centers of π / 2, 3π / 4 and 3π / 2, the mapping matrix follows this particular topology, eg
Better results can be achieved if adapted as This can be seen from FIGS. 5A-5C which show the
polarity patterns for the W, X and Y channels respectively for the various frequencies in this
situation.
[0046]
10-05-2019
14
According to some embodiments, it is possible to select weights for the audio signal on the basis
of the estimated topology of the audio capture device. Alternatively or additionally, the
adaptation of the mapping matrix may be realized based on a predefined template. In these
embodiments, server 102 may maintain a repository that stores a set of predefined topology
templates. Each topology template corresponds to a pre-tuned mapping matrix. For example, the
topology template may be represented by coordinates and / or positional relationships of the
audio capture device. For a given estimated topology, a template may be determined that
matches the estimated topology. There are many ways to identify matched topology templates.
As an example, in one embodiment, the Euclidean distance between the estimated coordinates of
the audio capture device and the coordinates in the template is calculated. The topology template
with the smallest distance is determined as the matching template. Thus, a pre-tuned mapping
matrix corresponding to the determined matched topology template is selected for use in
generating a surround sound field in the form of a B-format signal.
[0047]
In some embodiments, in addition to the determined topology template, the weights of the audio
signals captured by each device can be selected based further on the frequency of those audio
signals. Specifically, for higher frequencies, it is observed that spatial aliasing begins to appear
due to the relatively large spacing between the audio capture devices. In order to further improve
the performance, the choice of mapping matrix in B format processing may be made based on
audio frequency. For example, in some embodiments, each topology template may correspond to
at least two mapping matrices. In determining the location topology template, the frequency of
the received audio signal is compared to a predefined threshold, and based on the comparison,
one of the mapping matrices corresponding to the determined topology template is selected and
used. Can be Using the selected mapping matrix, B format processing is applied to the received
audio signal, thereby generating a surround sound field as discussed above.
[0048]
It should be noted that although the surround sound field is shown to be generated based on
topology estimation, the scope of the invention is not limited in this regard. For example, in some
alternative embodiments where clock synchronization and distance / topology estimation are not
available or known, the sound field may be generated directly from the cross-correlation process
applied to the captured audio signal . For example, if the topology of the audio capture device is
known, perform a cross-correlation process to achieve some time alignment of the audio signal
10-05-2019
15
and generate the sound field simply by applying a fixed mapping matrix in B format processing It
is possible. In this way, time delay differences for dominant sources between different channels
can be essentially eliminated. As a result, the sensor distance of the array of audio capture
devices may be shortened, thereby producing a coincident array.
[0049]
Optionally, method 300 proceeds to step S305 of estimating a direction of arrival (DOA) of the
generated surround sound to the rendering device. The surround sound field is then rotated
based at least in part on the estimated DOA in step S306. Rotating the generated surround sound
field according to DOA is mainly to improve the spatial rendering of the surround sound field.
When performing B-format based spatial rendering, there is a nominal front, ie 0 azimuth,
between the left and right audio capture devices. A sound source from this direction is perceived
as coming from the front during binaural playback. It is desirable for the target sound source to
come from the front. This is because this is the most natural listening condition. However, due to
the positioning nature of the audio capture device within the ad hoc group, it is not possible to
require the user to always point the left and right devices towards the main target sound source,
eg the performance stage. To address this issue, DOA estimation may be performed using multichannel input to rotate the surround sound field according to the estimated angle θ. In this
regard, Generalized Cross Correlation with Phase Transform (GCC), Directional Controlled
Response Power-Phase Transform (SRP-PHAT), Multiple Signal Classification (GCC-PHAT)
Multiple signal classification (MUSIC) or any other suitable DOA estimation algorithm may be
used in the context of embodiments of the present invention. Sound field rotation can then be
easily achieved for the B format signal using a standard rotation matrix as follows:
[0050]
In some embodiments, in addition to DOA, the sound field may be further rotated based on the
energy of the generated sound field. In other words, it is possible to find the most dominant
sound source in terms of both energy and duration. The goal is to find the best listening angle for
the user in the sound field. Let θ n and E n represent the short-term estimated DOA and energy
for frame n of the generated sound field, respectively. The total number of frames for the entire
generated sound is N. Furthermore, the central plane is at 0 degrees, and the angle is measured
counterclockwise. The frame then corresponds to the point (θ n, E n) using polar coordinates
representation. In one embodiment, the rotation angle θ ′ can be determined, for example, by
maximizing the following objective function:
10-05-2019
16
[0051]
Next, method 300 proceeds to optional step S307 where the generated sound field may be
converted to any target format suitable for playback on a rendering device. Next, consider an
example in which a surround sound field is generated as a B format signal. It will be readily
appreciated that, once the B format signal is generated, the W, X, Y channels can be converted to
various formats suitable for spatial rendering. Ambisonics decoding and playback depend on the
speaker system used for spatial rendering. In general, decoding of the Ambisonics signal to a set
of speaker signals is performed by the "virtual" Ambisonics signal recorded at the geometric
center of the speaker array when the decoded speaker signal is being reproduced. Based on the
assumption that it should be identical to the Ambisonics signal used. This can be expressed as:
where L = {L 1, L 2,..., L n} <T> represents a set of speaker signals, B = {W, X, Y, Z} < T> represents
a "virtual" ambisonics signal which is assumed to be identical to the input ambisonics signal for
decoding, C is defined by the geometrical definition of the loudspeaker array, ie the azimuth,
elevation of each loudspeaker Known as the "re-encoding" matrix. For example, a square speaker
array in which the speakers are placed horizontally at azimuths {45 °, -45 °, 135 °, -135 °}
and elevation angles {0 °, 0 °, 0 °, 0 °} Given C, this defines C as
[0052]
Based on this, the speaker signal can be derived as follows.
[0053]
Here, D represents a decoding matrix typically defined as a pseudoinverse of C.
[0054]
According to some embodiments, binaural rendering may be desired where the audio is played
through a pair of earphones or headphones.
This is because the user is expected to listen to the audio file on the mobile device.
The B format to binaural conversion can be approximately achieved by summing the speaker
array feeds, each filtered by HRTF matching the speaker position. In spatial listening, directional
10-05-2019
17
sources travel two different propagation paths to reach the left and right ears, respectively. The
result is a difference in arrival time and intensity between the two ear entrance signals, which the
human auditory system utilizes to achieve a localized hearing. These two propagation paths can
be well modeled by a pair of direction-dependent acoustic filters called head-related transfer
functions. For example, given a sound source S located in the direction φ, the ear entrance
signals S left and S right can be modeled as:
[0055]
Here, H left, φ and H right, φ represent HRTFs in the direction φ. In practice, HRTFs in a given
direction can be measured using a probe microphone inserted into the subject's (human or
dummy head) ear that picks up a response from an impulse or known stimulus located in that
direction. .
[0056]
These HRTF measurements can be used to synthesize virtual ear entry signals from monophonic
sources. This source is filtered with a pair of HRTFs corresponding to a certain direction, and the
resulting left and right signals are presented to the listener via headphones or earphones, so that
a virtualized virtualized in the desired direction is obtained. A sound field with a sound source
can be simulated. Using the four-speaker array described above, the W, X, Y channels can be
converted to binaural signals as follows.
[0057]
Here, H left, n represents the transfer function from the n-th speaker to the left ear, and H right,
n represents the transfer function from the n-th speaker to the right ear. This can be extended to
more speakers.
[0058]
Here, n represents the total number of speakers.
[0059]
10-05-2019
18
After converting the generated surround sound field to a suitable format of signal, server 102
may send such signal to a rendering device for display.
In some embodiments, the rendering device and the audio capture device may be co-located on
the same physical terminal.
[0060]
Method 300 ends in step S307.
[0061]
Reference is now made to FIG.
FIG. 6 shows a block diagram illustrating an apparatus for generating a surround sound field
according to an embodiment of the present invention. According to an embodiment of the
present invention, the device 600 may be at the server 102 shown in FIG. 6 or otherwise
associated with the server 102 to perform the method 300 described above with reference to
FIG. It may be configured to
[0062]
As shown, in accordance with an embodiment of the present invention, apparatus 600 includes a
receiving unit 601 configured to receive audio signals captured by a plurality of audio capture
devices. The apparatus 600 also comprises a topology estimation unit 602 configured to
estimate the topology of the plurality of audio capture devices. Furthermore, the apparatus 600
comprises a generation unit 603 configured to generate a surround sound field from the received
audio signal based at least in part on the estimated topology.
[0063]
In some exemplary embodiments, the estimation unit 602 is a distance acquisition unit
configured to acquire the distance between each pair of the plurality of audio capture devices;
10-05-2019
19
multidimensional scaling with respect to the acquired distance It may have an MDS unit
configured to estimate the topology by performing (MDS).
[0064]
In some exemplary embodiments, generation unit 603 may include a mode selection unit
configured to select a mode for processing an audio signal based on the number of the plurality
of audio capture devices. .
Alternatively or additionally, in some exemplary embodiments, the generation unit 603 is
configured to determine a topology template that matches the estimated topology of the plurality
of audio capture devices; A weight selection unit configured to select weights for the audio signal
based at least in part on the determined topology template; and processing the audio signal to
generate a surround sound field using the selected weights And the signal processing unit
configured as described above. In some exemplary embodiments, the weight selection unit may
comprise a unit configured to select weights based on the determined topology template and the
frequency of the audio signal.
[0065]
In some exemplary embodiments, the apparatus 600 may further include a time alignment unit
604 configured to perform time alignment on the audio signal. In some exemplary embodiments,
time alignment unit 604 is configured to apply at least one of a protocol based clock
synchronization process, a peer to peer clock synchronization process and a cross correlation
process.
[0066]
In some exemplary embodiments, the apparatus 600 is further at least partially estimated: a DOA
estimation unit 605 configured to estimate a direction of arrival (DOA) of the generated surround
sound field to the rendering apparatus; And a rotation unit 606 configured to rotate the
generated surround sound field based on the DOA. In some exemplary embodiments, the rotation
unit may comprise a unit configured to rotate the surround sound field generated based on the
estimated DOA and energy of the generated surround sound field .
10-05-2019
20
[0067]
In some exemplary embodiments, the apparatus 600 may further include a conversion unit 607
configured to convert the generated surround sound field to a target format for playback on a
rendering device. . For example, a B format signal may be converted to a binaural signal or a 5.1
channel surround sound signal.
[0068]
It should be noted that the various units in the device 600 correspond to the steps of the method
300 described above with reference to FIG. As a result, everything that has been said with regard
to FIG. 3 also applies to the device 600, which will not be detailed here.
[0069]
FIG. 7 is a block diagram illustrating a user terminal 700 for implementing an exemplary
embodiment of the invention. User terminal 700 may operate as audio capture device 101 as
discussed herein. In some embodiments, the user terminal 700 may be embodied as a mobile
phone. However, mobile phones only exemplify one type of device that would benefit from
embodiments of the present invention, and thus should not be construed as limiting the scope of
the embodiments of the present invention.
[0070]
As shown, user terminal 700 includes antenna (s) 712 in operative communication with
transmitter 714 and receiver 716. User terminal 700 further includes at least one processor or
controller 720. For example, controller 720 may be comprised of a digital signal processor, a
microprocessor and various analog to digital converters, digital to analog converters, and other
support circuits. The control and information processing functions of the user terminal 700 are
assigned among these devices according to the respective functions. The user terminal 700 may
include a ringer generator 722, an output device such as an earphone or speaker 724, one or
more microphones 726 for capturing audio, a display 728 and a keyboard 730, a joystick or
10-05-2019
21
other user input interface Also have user interfaces including user input devices such as, all of
which are coupled to controller 720. The user terminal 700 is further configured to provide
power to the various circuits required to operate the user terminal 700 and optionally provide
mechanical vibration as a detectable output. Like battery 734.
[0071]
In some embodiments, user terminal 700 includes a media capture element, such as a camera,
video and / or audio module, in communication with controller 720. The media capture element
may be any means of capturing images, video and / or audio for storage, display or transmission.
For example, in an exemplary embodiment where the media capture element is a camera module
736, the camera module 736 may include a digital camera capable of forming a digital image file
from the captured image. When embodied as a mobile phone, the user terminal 700 may further
include a universal identify module (UIM) 738. The UIM 738 is typically a memory device in
which the processor is embedded. The UIM 738 may be, for example, a subscriber identity
module (SIM), a universal integrated circuit card (UICC), a universal subscriber identity module
(USIM), a removable user identity module (US). R-UIM: removable user identity module) and the
like. UIM 738 typically stores information elements related to the subscriber.
[0072]
The user terminal 700 may comprise at least one memory. For example, user terminal 700 may
include volatile memory 740, such as volatile random access memory (RAM), which includes
cache area for temporary storage of data. User terminal 700 may also include other non-volatile
memory 742 that may be embedded and / or removable. Non-volatile memory 742 may
additionally or alternatively include EEPROM, flash memory, and the like. The memory may store
any number of pieces of information, programs and data that the user terminal 700 uses to
implement the functionality of the user terminal 700.
[0073]
Referring to FIG. 8, there is a block diagram illustrating an exemplary computer system 800 for
implementing an embodiment of the present invention. For example, computer system 800 may
function as server 102 described above. As shown, central processing unit (CPU) 801 executes
programs stored in read only memory (ROM) 802 or programs loaded into random access
10-05-2019
22
memory (RAM) from storage section 808 and thus various processes . In the RAM 803, data and
the like required when the CPU 801 executes various processes are also stored as needed. The
CPU 801, the ROM 802 and the RAM 803 are connected to one another via a bus 804. An input
/ output (I / O) interface 805 is also connected to the bus 804.
[0074]
The following components are connected to the I / O interface: an input unit 806 including a
keyboard, a mouse, etc .; an output unit 807 including a display or speaker such as a cathode ray
tube (CRT) or a liquid crystal display (LCD); And a communication unit 809 including a network
interface card such as a LAN card, a modem or the like. A communication unit 809 executes a
communication process via a network such as the Internet. The drive 810 is also connected to
the I / O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical
disk, a magneto-optical disk, a semiconductor memory, etc. is mounted on the drive 810 as
necessary, whereby the computer program read therefrom is stored as required. Will be installed
on
[0075]
When the above-described steps and processes (for example, method 300) are implemented by
software, a program configuring the software is installed from a network such as the Internet or
a storage medium such as the removable medium 811.
[0076]
In general, the various illustrative embodiments of the present invention may be implemented in
hardware or special purpose circuitry, software, logic or any combination thereof.
Some aspects may be implemented in hardware while other aspects may be implemented in
firmware or software that may be executed by a controller, microprocessor or other computing
device. Although various aspects of the exemplary embodiments of the present invention are
illustrated and described as block diagrams, flowcharts, or other pictorial representations, the
blocks, devices, systems, techniques, or methods described herein may It will be appreciated that
the invention may be implemented in hardware, software, firmware, special purpose circuits or
logic, general purpose hardware or controllers or other computing devices, or any combination
thereof, as a non-limiting example.
10-05-2019
23
[0077]
For example, apparatus 600 described above may be implemented as hardware, software /
firmware, or any combination thereof. In some embodiments, one or more units in device 600
may be implemented as software / modules. Alternatively or additionally, some or all of those
units may be integrated circuits (ICs), application specific integrated circuits (ASICs), system on
chip (SOCs), field programmable gate arrays (FPGAs), etc. May be implemented using a hardware
module such as The scope of the present invention is not limited in this regard.
[0078]
In addition, multiple combinations constructed to perform the various blocks shown in FIG. 3 as
method steps and / or as operations resulting from operations of computer program code and /
or to perform related function (s) It can be viewed as an integrated logic circuit element. For
example, embodiments of the present invention include a computer program product having a
computer program tangibly embodied on a machine readable medium, the computer program to
perform the method 300 as detailed above. Contains configured program code.
[0079]
In the context of the present disclosure, a machine-readable medium may be any tangible
medium that can include or store a program for use by or in connection with an instruction
execution system, apparatus or device. The machine readable medium may be a machine
readable signal medium or a machine readable storage medium. The machine readable medium
may comprise an electronic, magnetic, optical, electromagnetic, infrared or semiconductor
system, apparatus or device or any suitable combination of the above. More specific examples of
machine readable storage media are electrical connections with one or more wires, portable
computer diskettes, hard disks, random access memory (RAM), read only memory (ROM),
erasable It includes programmable read only memory (EPROM or flash memory), optical fiber,
portable compact disc read only memory (CD-ROM), optical storage device, magnetic storage
device or any suitable combination of the above.
[0080]
10-05-2019
24
Computer program code for performing the methods of the present invention may be written in
any combination of one or more programming languages. These computer program codes may
be provided to a processor of a general purpose computer, a special purpose computer or other
programmable processing device, whereby the program code is adapted to process the computer
or other programmable data processing. When executed by the processor of the device, causes
the functions / operations defined in the flowcharts and / or block diagrams to be implemented.
The program code may be completely on the computer, partially as a stand-alone software
package on the computer, partially on the computer partially on the remote computer, or
completely on the remote computer or server May be performed.
[0081]
Further, although the actions are drawn in a particular order, this may be shown to be performed
in a particular order or order in which such acts are shown, or to achieve a desired result. It
should not be understood as requiring that all actions be performed. In certain circumstances,
multitasking and parallel processing may be advantageous. Similarly, although some specific
implementation details are included in the above discussion, these should not be construed as
limitations on the scope of any invention or which may be claimed, but rather a specification of a
particular invention. It should be interpreted as a description of matters that may be specific to
the embodiment of. Certain features that are described herein in the context of separate
embodiments can also be implemented in combination in a single embodiment. Conversely,
various features that are described in the context of a single embodiment can also be
implemented in multiple embodiments separately or in any suitable subcombination.
[0082]
Various modifications and adaptations to the above illustrative embodiments of the present
invention will be apparent to those skilled in the art in view of the above description when read
in conjunction with the accompanying drawings. Any and all modifications are still within the
scope of the non-limiting, exemplary embodiments of the present invention. Furthermore, other
embodiments of the invention described herein will occur to those skilled in the art having the
benefit of the teachings presented in the foregoing description and drawings.
[0083]
10-05-2019
25
Thus, the present invention may be embodied in any of the forms described herein. For example,
the following enumerated example implementation (EEE) describes the structure, features, and
functions of some aspects of the present invention. [EEE1] A method of generating a surround
sound field: receiving audio signals captured by a plurality of audio capture devices; audio signals
received by applying a cross-correlation process to the received audio signals Performing a time
alignment of: generating a surround sound field from the time aligned audio signal. [EEE2]
receiving information about a calibration signal emitted by the plurality of audio capture devices;
and reducing the search range of the cross-correlation process based on the received information
about the calibration signal. , The method described in EEE1. [EEE3] The method according to
EEE 1 or 2, wherein generating the surround sound field comprises: generating the surround
sound field based on predefined topology estimates of the plurality of audio capture devices.
[EEE4] The method according to any one of EEE 1-3, wherein generating the surround sound
field comprises: selecting a mode for processing the audio signal based on the number of the
plurality of audio capture devices. . [EEE5] estimating the direction of arrival (DOA) of the
generated surround sound field with respect to the rendering device; rotating the generated
surround sound field based at least in part on the estimated DOA. The method according to any
one of EEE 1-4, further comprising [EEE6] rotating the generated surround sound field includes:
rotating the generated surround sound field based on the estimated DOA and energy of the
generated surround sound field the method of. [EEE7] The method according to any one of EEE
1-6, further comprising converting the generated surround sound field into a target format for
reproduction on a rendering device. [EEE 8] An apparatus for generating a surround sound field:
a first receiving unit configured to receive audio signals captured by a plurality of audio capture
devices; applying a cross-correlation process to the received audio signals An apparatus,
comprising: a time alignment unit configured to perform time alignment of received audio
signals; and a generation unit configured to generate a surround sound field from the time
aligned audio signals.
[EEE 9] a second receiving unit configured to receive information about a calibration signal
emitted by the plurality of audio capture devices; reducing the search range of the crosscorrelation process based on the information about the calibration signal An apparatus according
to EEE 8, having a reduction unit configured as follows. [EEE10] The apparatus according to EEE
8 or 9, wherein the generation unit comprises: a unit configured to generate the surround sound
field based on a predefined estimate of the topology of the plurality of audio capture devices.
[EEE11] the generating unit comprises: a mode selection unit configured to select a mode for
processing the audio signal based on a number of the plurality of audio capture devices,
according to any one of EEE 8 to 10 apparatus. [EEE 12] a DOA estimation unit configured to
estimate a direction of arrival (DOA) of the generated surround sound field with respect to a
rendering device; the surround sound generated based at least in part on the estimated DOA The
10-05-2019
26
apparatus according to any one of EEEs 8 to 11, further comprising: a rotating unit configured to
rotate the field. [EEE13] The apparatus according to EEE 12, wherein the rotation unit comprises:
a unit configured to rotate the generated surround sound field based on the estimated DOA and
energy of the generated surround sound field. [EEE14] The device according to any one of EEE 8
to 13, further comprising a conversion unit configured to convert the generated surround sound
field into a target format for reproduction on a rendering device.
[0084]
It is understood that the embodiments of the present invention are not limited to the disclosed
individual embodiments, and that modifications and other embodiments are intended to be
included within the scope of the appended claims. I will. Although specific terms are used in this
article, they are used in a general, descriptive sense only and not for limitation.
10-05-2019
27
Документ
Категория
Без категории
Просмотров
0
Размер файла
46 Кб
Теги
jp2016533045
1/--страниц
Пожаловаться на содержимое документа