close

Вход

Забыли?

вход по аккаунту

?

JP2011239036

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2011239036
An audio signal conversion apparatus capable of converting an audio signal for a multi-channel
system without generating noise caused by a discontinuous point is provided. A voice signal
conversion apparatus (exemplified by a voice signal processing unit 113) is a transform unit that
performs discrete Fourier transform on input voice signals of two channels, and voice signals of
the two channels after discrete Fourier transform by the transform unit. , A correlation signal
extraction unit that extracts a correlation signal by ignoring a DC component, the correlation
signal extracted by the correlation signal extraction unit or the correlation signal and the noncorrelation signal, or the correlation signal or the correlation signal An inverse transform unit
that performs discrete Fourier inverse transform on a voice signal generated from the
uncorrelated signal, and a noise removal unit 122 that removes waveform discontinuity points
from the voice signal after discrete Fourier inverse transform by the inverse transform unit. And
[Selected figure] Figure 12
Audio signal conversion device, method, program, and recording medium
[0001]
The present invention relates to an audio signal conversion apparatus, method, program, and
recording medium for converting an audio signal for multi-channel reproduction system.
[0002]
Among the sound reproduction methods that have been proposed conventionally, there are a
stereo (2 ch) method, a 5.1 ch surround method (ITU-R BS. 775-1), and the like, and are widely
04-05-2019
1
used for consumer use.
The 2ch system is a system for generating different audio data from the left speaker 11L and the
right speaker 11R as schematically shown in FIG. The 5.1ch surround system is, as schematically
illustrated in FIG. 2, a left front speaker 21L, a right front speaker 21R, a center speaker 22C
disposed between them, a left rear speaker 23L, a right rear speaker 23R, and This is a method
of inputting and outputting different audio data to subwoofers dedicated to a low frequency band
(generally 20 Hz to 100 Hz) not shown.
[0003]
In addition to the 2ch and 5.1ch surround methods, various sound reproduction methods such as
7.1ch, 9.1ch and 22.2ch have been proposed. In any of the above-mentioned systems, the
speakers are arranged on a circle or on a sphere centered on the listener (listener), and ideally,
listening positions (listening positions) equidistant from the speakers (so-called sweet) It is
considered preferable to listen at the spot. For example, it is preferable to listen at the sweet spot
12 in the 2ch system and at the sweet spot 24 in the 5.1 ch surround system. When listening to
the sweet spot, the synthetic sound image by the balance of sound pressure is localized as the
producer intended. On the contrary, when listening at a position other than the sweet spot, the
sound image and the sound quality generally deteriorate. Hereinafter, these systems are
collectively referred to as a multi-channel reproduction system.
[0004]
On the other hand, there is also a sound source object-oriented reproduction method apart from
the multi-channel reproduction method. This method is a method in which all sounds are sounds
emitted by any sound source object, and each sound source object (hereinafter, referred to as
“virtual sound source”). ) Contains its own position information and voice signal. Taking music
content as an example, each virtual sound source includes the sound of the respective instrument
and the position information at which the instrument is disposed. The sound source objectoriented reproduction method is usually reproduced by a reproduction method (i.e., wavefront
synthesis reproduction method) in which the wave front of sound is synthesized by a linear or
planar speaker group. Among these wavefront synthesis and reproduction methods, the Wave
Field Synthesis (WFS) method described in Non-Patent Document 1 is one of the practical
mounting methods using a linear speaker group (hereinafter referred to as a speaker array). Has
been actively researched in recent years.
04-05-2019
2
[0005]
Unlike the above-described multi-channel reproduction method, such a wave-field synthesis
reproduction method is different from the above-described multi-channel reproduction method,
and, as schematically illustrated in FIG. 3, to the listener listening at any position in front of the
arranged speaker group 31. Also, it has the feature that both good sound image and sound
quality can be presented simultaneously. That is, the sweet spot 32 in the wavefront synthesis
and reproduction method is wide as illustrated. In addition, a listener who is listening to the
sound facing the speaker array in the acoustic space provided by the WFS method is a sound
source in which the sound radiated from the speaker array is virtually present behind the
speaker array. It feels as if it were emitted from a (virtual sound source).
[0006]
In this wave-field synthesis reproduction system, an input signal representing a virtual sound
source is required. In general, one virtual sound source needs to include an audio signal for one
channel and position information of the virtual sound source. Taking the above-mentioned music
content as an example, it means, for example, an audio signal recorded for each instrument and
position information of the instrument. However, the audio signal of each virtual sound source
does not necessarily have to be for each instrument, but the arrival direction and magnitude of
each sound intended by the content producer needs to be expressed using the concept of virtual
sound source .
[0007]
Here, since the most widely spread system among the above-mentioned multi-channel systems is
the stereo (2 ch) system, music content of the stereo system is considered. As shown in FIG. 4, by
using two speakers 41L and 41R, audio signals of L (left) channel and R (right) channel in stereo
music content were installed at the speaker 41L installed at the left and at the right, respectively.
It reproduces | regenerates with the speaker 41R. When such reproduction is performed, as
shown in FIG. 4, the vocal voice and the bass sound are heard from the middle position 42b only
when listening from the speakers 41L and 41R at points that are equidistant, ie, the sweet spot
43. The sound image is localized and sounds as intended by the manufacturer, such as the
position 42a of the left side of the piano sound and the position 42c of the right side of the drum
04-05-2019
3
sound. It is considered that such content is reproduced by the wavefront synthesis reproduction
method, and sound image localization as intended by the content producer is provided to the
listener at any position, which is a feature of the wavefront synthesis reproduction method. For
this purpose, as in the sweet spot 53 shown in FIG. 5, it is necessary to be able to perceive a
sound image when listening in the sweet spot 43 of FIG. 4 from any viewing position. That is,
with the wide sweet spot 53, the vocal voice and the bass sound can be heard from the middle
position 52b, and the piano sound is the left position 52a and the drum sound by the linear or
planar speaker group 51. The sound image should be localized and heard as the producer
intended, such as position 52c on the right.
[0008]
To solve the problem, for example, as shown in FIG. 6, the case where the sound of the L channel
and the sound of the R channel are arranged as virtual sound sources 62a and 62b will be
considered. In this case, since each of the L / R channels does not independently represent one
sound source but generates a synthetic sound image by two channels, even if it is reproduced by
the wavefront synthesis reproduction method, the sweet spot 63 is still The sound image is
localized as shown in FIG. 4 only at the sweet spot 63. That is, in order to realize such sound
image localization, it is necessary to separate stereo data of 2 ch into sound for each sound image
by some means and generate virtual sound source data from each sound.
[0009]
To solve this problem, the method described in Patent Document 1 separates 2-channel stereo
data into a correlated signal and a non-correlated signal based on the correlation coefficient of
the power of the signal for each frequency band, and for the correlated signal, synthetic sound
image direction And generate virtual sound sources from the results.
[0010]
European Patent Application Publication No. 176 11 110
[0011]
A. J. Berkhout, D. de Vries, and P. Vogel, “Acoustic control by wave field synthesis”, J. Acoust.
04-05-2019
4
Soc.
Am. Volume 93 (5), Acoustical Society of America, May 1993, pp. 2764-2778
[0012]
However, in the method described in Patent Document 1, when the original speech signal is
analyzed, the DC components of the left and right channels after discrete Fourier transform are
ignored. FIG. 7 is a schematic view showing an example of the result of discrete Fourier
transform of an audio signal. In FIG. 7, the axis in the vertical direction represents the real part,
the axis in the front direction represents the imaginary part, and the reference numeral 71
represents a DC component. In the method described in Patent Document 1, since the direct
current component 71 is ignored, the continuity of the waveform between segments after inverse
Fourier transform is not ensured, and the waveform becomes discontinuous at the boundary of
the segments. Especially in the content where many low band signals are included, the generated
sound signal waveform contains many discontinuities, which are perceived as noise.
[0013]
This noise will be described with the example of the music content 80 shown in FIG. When the
audio signal 81 of the left channel and the audio signal 82 of the right channel in the music
content 80 are converted into, for example, five channels using the method described in Patent
Document 1, the result is as the music content 90 shown in FIG. . The music content 90 will have
five channels of audio signals 91-95. FIG. 10 is an enlarged view of the vicinity of 9 seconds in
the audio signal 93 of the third channel from the top of FIG. 9, but in the audio signal 100 shown
in FIG. Is occurring. Since many such discontinuities are included, they are perceived as offensive
noise.
[0014]
Such a problem is not limited to the case of converting an audio signal for multi-channel system
into an audio signal for reproduction by wavefront synthesis and reproduction system, but also
for multi-channel system (although the number of channels is the same. It may also occur when
converting into an audio signal which may be different. This is because the above-described
04-05-2019
5
discrete Fourier transform / inverse transform may be applied to such a transform and the DC
components of the left and right channels may be ignored.
[0015]
The present invention has been made in view of the above-described actual situation, and an
object thereof is to generate an audio signal for a multichannel system such as 2ch or 5.1ch
without generating noise due to a discontinuous point. Abstract: An audio signal conversion
apparatus, method, program, and recording medium that can be converted.
[0016]
In order to solve the problems as described above, a first technical means of the present
invention is an audio signal converter for converting multi-channel input audio signals into audio
signals to be reproduced by a speaker group, wherein A transformation unit for performing
discrete Fourier transform on input speech signals of one channel, and a correlation signal
extraction unit for extracting correlation signals by ignoring direct-current components of speech
signals of two channels after discrete Fourier transformation by the transformation unit; The
correlation signal extracted by the correlation signal extraction unit or the correlation signal and
the non-correlation signal, or the voice signal generated from the correlation signal, or generated
from the correlation signal and the non-correlation signal An inverse transform unit that
performs discrete Fourier inverse transform on the audio signal; and a removal unit that removes
a discontinuous point of the waveform from the audio signal after discrete Fourier inverse
transform by the inverse transform unit; It is obtained characterized by including.
[0017]
In a second technical means according to the first technical means, the removing unit adds a DC
component to the voice signal after the discrete Fourier inverse transform so as to maintain the
differential value of the waveform at the boundary of the processing segment. And removing the
discontinuous point.
[0018]
A third technical means is characterized in that, in the second technical means, the removing unit
reduces the magnitude of the amplitude of the DC component to be added in proportion to the
elapsed time from the addition time point. is there.
[0019]
A fourth technical means is characterized in that, in the third technical means, the removing unit
04-05-2019
6
changes the proportional constant for decreasing according to the magnitude of the amplitude of
the DC component obtained for adding. The
[0020]
The 5th technical means is the 4th technical means, wherein the removing unit is located at a
place other than a place where the number of times the waveform of the audio signal after the
discrete Fourier inverse transform crosses 0 within a predetermined time. It is characterized in
that the addition of the DC component is performed.
[0021]
A sixth technical means according to any one of the second to fifth technical means, wherein the
removing unit sets the DC component only when the amplitude of the DC component obtained
for the addition is less than a predetermined value. It is characterized in that addition is
performed.
[0022]
A seventh technical means according to any one of the first to third technical means, wherein the
removing unit determines that the number of times the waveform of the audio signal after the
discrete Fourier inverse transform crosses 0 within a predetermined time. The present invention
is characterized in that the removal of the discontinuous point is performed except at the place
where the above exists.
[0023]
An eighth technical means according to any one of the first to seventh technical means, wherein
the audio signal after the discrete Fourier inverse transform to be processed by the removing
unit is the correlation signal or the correlation signal and the absence signal. The correlation
signal is subjected to scaling processing in a time domain or a frequency domain to obtain an
audio signal after the scaling processing.
[0024]
A ninth technical means according to any one of the first to eighth technical means, wherein the
multi-channel input audio signal is an input audio signal of three or more channels, any of the
multi-channel input audio signals An audio signal to be reproduced by the speaker group by
removing the discontinuous points by the converter, the correlation signal extractor, the inverse
converter, and the remover with respect to two input audio signals. , And the speech signal
conversion apparatus further includes an addition unit that adds the input speech signal of the
04-05-2019
7
remaining channel to the generated speech signal.
[0025]
A tenth technical means according to any one of the first to ninth technical means, a digital
content input unit for inputting digital content including the multi-channel input audio signal, a
decoder unit for decoding digital content, and An audio signal extraction unit for separating an
audio signal from digital content decoded by the decoder unit; and multi-channel audio of three
or more channels and different from the input audio signal from the audio signal extracted by the
audio signal extraction unit. The audio signal processing unit further includes an audio signal
processing unit that converts the signal into a signal, and the audio signal processing unit
includes the conversion unit, the correlation signal extraction unit, the inverse conversion unit,
and the removal unit.
[0026]
An eleventh technical means relates to the tenth technical means, wherein the digital content
input unit is a recording medium for storing digital content, a server for distributing digital
content via a network, or digital content from a broadcasting station for broadcasting digital
content It is characterized by inputting.
[0027]
According to a twelfth technical means, in any one of the first to eleventh technical means, there
is further provided a switching unit for switching whether or not the process in the audio signal
processing unit is to be performed according to a user operation. It is a feature.
[0028]
A thirteenth technical means is an audio signal conversion method for converting multi-channel
input audio signals into audio signals for reproduction by a speaker group, wherein the
conversion unit performs discrete Fourier transform on input audio signals of two channels. A
correlating signal extraction unit, extracting an correlating signal by ignoring a direct current
component with respect to audio signals of two channels after discrete Fourier transformation in
the conversion step, an inverse transformation unit, With respect to the correlation signal
extracted in the extraction step or with respect to the correlation signal and the non-correlation
signal, or to the audio signal generated from the correlation signal, or to the audio signal
generated from the correlation signal and the non-correlation signal And an inverse
transformation step of performing discrete Fourier inverse transformation, and a removal unit,
from the voice signal after the discrete Fourier inverse transformation in the inverse
transformation step. It is obtained by comprising: the removal step of removing the
04-05-2019
8
discontinuities in the form, the.
[0029]
A fourteenth technical means ignores DC components in the conversion step of performing
discrete Fourier transformation on input speech signals of two channels in the computer, and the
speech signals of two channels after discrete Fourier transformation in the conversion step.
Extracting the correlation signal, extracting the correlation signal extracted in the extraction step
or the correlation signal and the non-correlation signal, or generating an audio signal generated
from the correlation signal, or the correlation signal and the correlation signal Applying inverse
discrete Fourier transform to a voice signal generated from the uncorrelated signal, and
removing the waveform discontinuity point from the discrete Fourier inverse transform voice
signal in the inverse transform step; Is a program to execute.
A fifteenth technical means is a computer readable recording medium recording the program
according to the fourteenth technical means.
[0030]
According to the present invention, it is possible to convert audio signals for multi-channel
systems such as 2ch and 5.1ch without generating noise caused by discontinuous points.
[0031]
It is a schematic diagram for demonstrating 2ch system.
It is a schematic diagram for demonstrating a 5.1ch surround system.
It is a schematic diagram for demonstrating a wave field synthetic | combination reproduction |
regeneration system.
It is a schematic diagram which shows a mode that the music content in which the sound of a
vocal, a base, a piano, and a drum was recorded by the stereo system is reproduced | regenerated
04-05-2019
9
using two speakers on either side.
It is a schematic diagram which shows the mode of an ideal sweet spot at the time of reproducing
| regenerating the music content of FIG. 4 by a wave field synthetic | combination reproduction |
regeneration system.
It is a schematic diagram which shows the mode of an actual sweet spot at the time of setting a
virtual sound source to the position of a left / right speaker, respectively, and reproducing |
regenerating by the wave field synthetic | combination reproduction system the audio signal of
the left / right channel in the music content of FIG. .
It is a schematic diagram which shows an example of a result when discrete Fourier transform of
an audio | voice signal.
It is a figure which shows an example of the waveform of the music content which consists of an
audio | voice signal of a left channel and a right channel.
It is a figure which shows the waveform as a result of converting the music content of FIG. 8 into
five channels using the conventional method.
It is the figure which expanded a part of audio signal of one channel among the music contents
of FIG.
It is a block diagram showing an example of 1 composition of an audio data reproduction device
provided with an audio signal converter concerning the present invention.
It is a block diagram which shows one structural example of the audio | voice signal processing
part (audio | voice signal converter which concerns on this invention) in the audio | voice data
reproduction | regeneration apparatus of FIG.
It is a flowchart for demonstrating an example of the audio | voice signal processing in the audio
| voice signal isolation | separation extraction part in the audio | voice signal processing part of
04-05-2019
10
FIG. 12, and a noise removal part.
It is a figure which shows a mode that audio | voice data are stored in a buffer in the audio |
voice signal processing part of FIG.
It is a schematic diagram for demonstrating the example of the positional relationship of a
listener, a speaker on either side, and a synthetic sound image. It is a schematic diagram for
demonstrating the example of the positional relationship of the speaker group and virtual sound
source which are used by a wave field synthetic | combination reproduction | regeneration
system. It is a schematic diagram for demonstrating the example of the positional relationship of
the virtual sound source of FIG. 16, a listener, and a synthetic sound image. When discrete
Fourier-transforming the audio | voice signal of a left-right channel and ignoring the directcurrent component of a left-right channel, it is a schematic diagram for demonstrating the
discontinuous point of the waveform which arises in the segment boundary after discrete Fourier
inverse transform. It is a schematic diagram for demonstrating an example of the discontinuous
point removal process which concerns on this invention. FIG. 21 is a diagram showing a
waveform of a result obtained by converting certain music content consisting of audio signals of
left and right channels into five channels by applying the discontinuity point removal processing
of FIG. 19; FIG. 21 is a diagram showing a waveform of a result obtained by converting the same
music content as the music content targeted in FIG. 20 into five channels by applying another
discontinuous point removal processing according to the present invention. FIG. 21 shows the
result of converting music content having a sharp change in the audio signal waveform different
from the music content targeted in FIGS. 20 and 21 into five channels by applying the same
discontinuous point removal processing as in FIG. FIG. FIG. 23 is a diagram showing a waveform
of a result of converting the same music content as the music content targeted in FIG. 22 into
five channels by applying another discontinuous point removal process according to the present
invention. It is a figure which shows the waveform as a result of converting the music content of
FIG. 8 into five channels, applying the discontinuous point removal process of FIG. It is the figure
which expanded a part of audio signal of one channel among the music contents of FIG. When
reproducing | regenerating the audio | voice signal of 5.1ch by a wave field synthetic |
combination reproduction | regeneration system, it is a schematic diagram for demonstrating the
example of the positional relationship of the speaker group and virtual sound source to be used.
It is a figure which shows the structural example of the television apparatus provided with the
audio | voice data reproduction | regeneration apparatus of FIG. It is a figure which shows the
other structural example of the television apparatus provided with the audio | voice data
reproduction | regeneration apparatus of FIG. It is a figure which shows the other structural
example of the television apparatus provided with the audio | voice data reproduction |
regeneration apparatus of FIG. It is a figure which shows the structural example of the imaging |
video projection system provided with the audio | voice data reproduction | regeneration
04-05-2019
11
apparatus of FIG. It is a figure which shows the example of a vehicle equipped with the audio
data reproduction apparatus of FIG. 11 which shows the structural example of the system which
consists of a television board provided with the audio data reproduction apparatus of FIG. 11,
and a television apparatus. It is a figure which shows the example of the speaker of reproduction
| regeneration object in the audio | voice data reproduction | regeneration apparatus of FIG.
[0032]
An audio signal conversion apparatus according to the present invention is an apparatus for
converting an audio signal for multi-channel reproduction system into an audio signal for
reproduction with a speaker group having the same or different number of channels or an audio
signal for wavefront synthesis reproduction system. It can be called an audio signal processing
device, an audio data conversion device, etc., and can be incorporated into an audio data
reproduction device. The audio signal is, of course, not limited to a so-called audio recording
signal, and can also be called an acoustic signal. Further, the wave-field synthesis reproduction
method is a reproduction method in which the wave front of sound is synthesized by the speaker
group arranged in a straight line or a plane as described above.
[0033]
Hereinafter, with reference to the drawings, a configuration example and a processing example of
the audio signal conversion device according to the present invention will be described. Further,
in the following description, first, an example will be described in which the audio signal
conversion device according to the present invention generates an audio signal for a wavefront
synthesis reproduction method by conversion. FIG. 11 is a block diagram showing a
configuration example of an audio data reproduction device provided with an audio signal
conversion device according to the present invention, and FIG. 12 is an audio signal processing
unit in the audio data reproduction device of FIG. Audio signal conversion device).
[0034]
The audio data reproduction apparatus 110 illustrated in FIG. 11 includes a decoder 111, an
audio signal extraction unit 112, an audio signal processing unit 113, a D / A converter 114, an
amplifier group 115, and a speaker group 116. The decoder 111 decodes the content of only
audio or video with audio, converts the content into a format that can be processed, and outputs
04-05-2019
12
the converted signal to the audio signal extraction unit 112. The content is acquired by
downloading digital broadcast content transmitted from a broadcast station or a server that
distributes digital content via a network from the Internet or reading from a recording medium
such as an external storage device. Thus, although not shown in FIG. 11, the audio data
reproduction apparatus 110 includes a digital content input unit for inputting digital content
including multi-channel input audio signals. The decoder 111 will decode the digital content
input here. The audio signal extraction unit 112 separates and extracts an audio signal from the
obtained signal. Here, it is a 2ch stereo signal. The signals for the two channels are output to the
audio signal processing unit 113.
[0035]
The audio signal processing unit 113 generates, from the obtained two-channel signals, multichannel audio signals (described as signals corresponding to the number of virtual sound sources
in the following example) having three or more channels and different from the input audio
signal. . That is, the input speech signal is converted to another multi-channel speech signal. The
audio signal processing unit 113 outputs the audio signal to the D / A converter 114. The
number of virtual sound sources may be determined in advance as long as it is a predetermined
number or more, but there is no problem in performance, but the larger the number of virtual
sound sources, the larger the amount of calculation. Therefore, it is desirable to determine the
number in consideration of the performance of the mounted device. In this example, the number
is described as five.
[0036]
The D / A converter 114 converts the obtained signal into an analog signal, and outputs each
signal to the amplifier 115. Each amplifier 115 amplifies the input analog signal and transmits it
to each speaker 116, and the speaker 116 outputs it as sound into space.
[0037]
The detailed configuration of the audio signal processing unit in this figure is shown in FIG. The
audio signal processing unit 113 includes an audio signal separation and extraction unit 121, a
noise removal unit 122, and an audio output signal generation unit 123.
04-05-2019
13
[0038]
The audio signal separation and extraction unit 121 generates an audio signal corresponding to
each virtual sound source from signals of two channels, and outputs the audio signal to the noise
removal unit 122. The noise removing unit 122 removes a portion that is perceptually noise
from the obtained voice signal waveform, and outputs the voice signal after noise removal to the
voice output signal generating unit 123. The audio output signal generation unit 123 generates
an output audio signal waveform corresponding to each speaker from the obtained audio signal.
The audio output signal generation unit 123 performs processing such as wavefront synthesis
reproduction processing, for example, assigns the obtained audio signal for each virtual sound
source to each speaker, and generates an audio signal for each speaker. A part of the wavefront
synthesis and reproduction process may be borne by the audio signal separation and extraction
unit 121.
[0039]
Next, an example of audio signal processing in the audio signal separation and extraction unit
121 and the noise removal unit 122 will be described according to FIG. FIG. 13 is a flowchart for
explaining an example of audio signal processing in the audio signal separation and extraction
unit and the noise removal unit in the audio signal processing unit in FIG. 12, and FIG. 14 is an
audio signal processing unit in FIG. It is a figure which shows a mode that data are stored in a
buffer.
[0040]
First, the audio signal separation and extraction unit 121 reads out audio data of half the length
of one segment from the extraction result of the audio signal extraction unit 112 in FIG. 11 (step
S131). Here, speech data refers to a discrete speech signal waveform sampled at a sampling
frequency such as 48 kHz. A segment is an audio data section consisting of a sample point group
of a certain length, and in this case, refers to a section length to be subjected to discrete Fourier
transform later. The value is, for example, 1024. In this example, 512 points of audio data which
is half the length of one segment are to be read out.
[0041]
04-05-2019
14
The read 512-point audio data is stored in the buffer 140 as exemplified in FIG. This buffer is
designed to hold the voice signal waveform of the previous one segment, and discards the
segments from the past. The last half segment data and the latest half segment data are
connected to create one segment of audio data, and the process proceeds to window function
calculation (step S132). That is, all sample data will be read twice by the window function
operation.
[0042]
In the window function calculation in step S132, one segment worth of audio data is multiplied
by the conventionally proposed next Hann window. Here, m is a natural number, and M is an
even number with one segment length. Assuming that the stereo input signals are xL (m) and xR
(m) respectively, the audio signals x'L (m) and x'R (m) after the window function multiplication
are
[0043]
It is calculated that x'L (m) = w (m) xL (m), x'R (m) = w (m) xR (m) (2). When this Hann window is
used, for example, an input signal xL (m0) at a sample point m0 (where M / 2 ≦ m0 <M) is
multiplied by sin <2> ((m0 / M) π). And in the next reading, the same sample point is read as
m0-M / 2.
[0044]
Is multiplied. Here, since sin <2> ((m0 / M) π) + cos <2> ((m0 / M) π) = 1, the signal read
without any correction is shifted by half a segment. By adding them, the original signal is
completely restored.
[0045]
The voice data thus obtained is discrete Fourier transformed as shown in the following equation
04-05-2019
15
(3) to obtain voice data in the frequency domain (step S133). Here, DFT represents a discrete
Fourier transform, k is a natural number, and 0 ≦ k <M. XL (k) and XR (k) are complex numbers.
XL(k)=DFT(x′L(n)) 、 XR(k)=DFT(x′R(n)) (3)
[0046]
Next, the obtained voice data in the frequency domain is divided into small bands, and the
processes of steps S135 to S138 are executed for each of the divided bands (steps S134a and
S134b). The individual processes will be specifically described.
[0047]
First, for the division method, the equivalent rectangular band (ERB) is used to divide an ERB
bandwidth from 0 Hz to a half of the sampling frequency. Here, the number of divisions up to the
upper limit fmax [Hz] of a given frequency is given by ERB, that is, the maximum value I of the
index of each band divided by ERB is given by the following equation. I = floor (21.4 log 10 (0.00
437 f max + 1)) (4) where floor (a) is a floor function and represents the maximum value of
integers not exceeding real number a.
[0048]
Then, the central frequency Fc <(i)> (1 ≦ i ≦ I) [Hz] of each ERB band (hereinafter, a small band)
is given by the following equation.
[0049]
Further, the bandwidth b <(i)> [Hz] of the ERB at that time can be obtained by the following
equation.
b <(i)> = 24.7 (0.00437Fc <(i)> + 1) (6) Therefore, by shifting the center frequency to the low
band side and the high band side by the frequency width of ERB / 2 respectively Boundary
frequencies FL <(i)> and FU <(i)> on both sides of the i-th small band can be obtained. Therefore,
the i-th small band includes the KU <(i)>-th line spectrum from the KL <(i)>-th line spectrum.
Here, KL <(i)> and KU <(i)> are expressed by the following equations (7) and (8), respectively. KL
04-05-2019
16
<(i)> = ceil (21.4 log 10 (0.00437 FL <(i)> + 1)) (7) KU <(i)> = floor (21.4 log 10 (0.00437 FU
<(i)> + 1) (8) where ceil (a) is a ceiling function and represents the minimum value of integers not
smaller than the real number a. Further, the line spectrum after discrete Fourier transform is
symmetrical with M / 2 (where M is an even number), except for the direct current component,
for example, XL (0). That is, XL (k) and XL (M−k) have a complex conjugate relationship in the
range of 0 <k <M / 2. Therefore, in the following, the range of KU <(i)> ≦ M / 2 is considered as
the object of analysis, and the range of k> M / 2 is treated the same as a symmetrical line
spectrum in a complex conjugate relationship.
[0050]
Specific examples of these are shown. For example, if the sampling frequency is 48000 Hz, then I
= 49, which results in division into 49 sub-bands. However, although there are line spectrum
components that correspond to frequencies above the highest minor band interval, they have
little aural impact and, because they are usually minimal, they are the highest minor band. It does
not matter if it is included in the section.
[0051]
Next, in each small band determined in this manner, the correlation coefficient is obtained by
obtaining the normalized correlation coefficient of the left channel and the right channel by the
following equation (step S135).
[0052]
The normalized correlation coefficient d <(i)> represents how much the audio signals of the left
and right channels are correlated, and takes a real value between 0 and 1.
It will be 1 if the signals are exactly the same and 0 if the signals are completely uncorrelated.
Here, when the powers PL <(i)> and PR <(i)> of the audio signals of the left and right channels are
both 0, extraction of the correlation signal and the non-correlation signal is impossible for the
small band, and processing And move on to the next small band processing. In addition, when
one of PL <(i)> and PR <(i)> is 0, although it can not be calculated by Equation (9), the normalized
correlation coefficient d <(i)> = 0 And continue processing that small band.
04-05-2019
17
[0053]
Next, using this normalized correlation coefficient d <(i)>, transform coefficients for separating
and extracting each of the correlation signal and the non-correlation signal from the audio
signals of the left and right channels are determined (step S136). The correlation signal and the
non-correlation signal are separated and extracted from the audio signals of the left and right
channels using the acquired conversion coefficients (step S137). Both the correlated signal and
the uncorrelated signal may be extracted as an estimated speech signal.
[0054]
An exemplary process of steps S136 and S137 will be described. Here, as in Patent Document 1,
a signal of each of the left and right channels is composed of an uncorrelated signal and a
correlated signal, and a model in which the same signal is output from the left and right is
adopted for the correlated signal. The direction of the sound image synthesized by the
correlation signal output from the left and right is determined by the balance of the sound
pressure of each of the left and right of the correlation signal. According to the model, the input
signals xL (n) and xR (n) are represented by the following equation: xL (m) = s (m) + nL (m), xR
(m) = αs (m) + nR (m) (13) Be done. Where s (m) is the left and right correlation signal, nL (m) is
the left channel audio signal minus the correlation signal s (m), which can be defined as the
uncorrelated signal (left channel), nR (m) is the sound signal of the right channel minus the
correlation signal s (m) and can be defined as the uncorrelated signal (of the right channel). In
addition, α is a positive real number representing the degree of the left and right sound pressure
balance of the correlation signal.
[0055]
The audio signals x′L (m) and x′R (m) after the window function multiplication described
above in equation (2) are expressed by equation (14) below by equation (13). Here, s' (m), n'L (m)
and n'R (m) are respectively s (m), nL (m) and nR (m) multiplied by the window function. x 'L (m)
= w (m) {s (m) + n L (m)} = s' (m) + n 'L (m), x' R (m) = w (m) {alpha s (m) + N R (m)} = αs '(m) + n'
R (m) (14)
[0056]
04-05-2019
18
By discrete Fourier transforming equation (14), the following equation (15) is obtained. However,
S (k), NL (k) and NR (k) are respectively the discrete Fourier transforms of s' (m), n'L (m) and n'R
(m). XL (k) = S (k) + NL (k), XR (k) = αS (k) + NR (k) (15)
[0057]
Therefore, the audio signal XL <(i)> (k), XR <(i)> (k) in the i-th small band is XL <(i)> (k) = S <(i)>
(k) + NL <(i)> (k), XR <(i)> (k) = α <(i)> S <(i)> (k) + NR <(i)> (k) where KL <(i) It is expressed as:)>
≦ k ≦ KU <(i)> (16) Here, α <(i)> represents α in the i-th small band. Thereafter, the
correlation signal S <(i)> (k) and the uncorrelated signal NL <(i)> (k) and NR <(i)> (k) in the i-th
small band are respectively S <(i) > (K) = S (k), NL <(i)> (k) = NL (k), NR <(i)> (k) = NR (k) where KL
<(i)> ≦ k ≦ It is assumed that KU <(i)> (17).
[0058]
From Equation (16), sound pressures PL <(i)> and PR <(i)> in Equation (12) are PL <(i)> = PS <(i)>
+ PN <(i)>, PR < (I)> = [α <(i)>] <2> PS <(i)> + PN <(i)> (18) Here, PS <(i)>, PN <(i)> are the powers
of the correlated signal and the uncorrelated signal in the i-th small band, respectively, and are
expressed as follows. Here, it is assumed that the sound pressures of the left and right
uncorrelated signals are equal.
[0059]
Further, equation (9) can be expressed by the following equations (10) to (12). However, in this
calculation, it is assumed that the power when S (k), NL (k) and NR (k) are orthogonal to each
other, and is multiplied, is zero.
[0060]
By solving equation (18) and equation (20), the following equation is obtained.
[0061]
04-05-2019
19
These values are used to estimate the correlated and uncorrelated signals in each subband.
The estimated value est (S <(i)> (k)) of the correlation signal S <(i)> (k) in the i-th small band is
calculated using the parameters .mu.1 and .mu.2 as est (S <(i) If (k)) = μ1 XL <(i)> (k) + μ2 × R
<(i)> (k) (23), the estimation error ε is ε = est (S <(i)> (k)) It is expressed as -S <(i)> (k) (24).
Here, est (A) represents an estimated value of A. Then, when the square error ε <2> is
minimized, if ε and XL <(i)> (k) and XR <(i)> (k) are orthogonal to each other, then E [ε · XL <
(I)> (k)] = 0, E [ε · XR <(i)> (k)] = 0 (25) Using Equations (16), (19), and (21) to (24), the
following simultaneous equations can be derived from Equation (25). (1-.mu.1-.mu.2.alpha. <(I)>)
PS <(i)>-. Mu.1 PN <(i)> = 0.alpha. <(I)> (1-.mu.1-.mu.2.alpha. <(I)>) PS <(i) > −μ2 PN <(i)> = 0
(26)
[0062]
By solving equation (26), each parameter can be obtained as follows. Here, power Pest (S) <(i)> of
estimated value est (S <(i)> (k)) obtained in this manner is obtained by squaring both sides of
equation (23). (S) <(i)> = (μ1 + α <(i)> μ2) <2> PS <(i)> + (μ1 <2> + μ2 <2>) PN <(i)> (28)
needs to be satisfied Therefore, the estimated value is scaled from this equation as follows. Note
that est '(A) represents a scaled estimate of A.
[0063]
[0064]
Then, estimated values est (NL <(i)> (k)), est for uncorrelated signals NL <(i)> (k) and NR <(i)> (k)
of left and right channels in the i-th small band (NR <(i)> (k)) is respectively est (NL <(i)> (k)) =
μ3 XL <(i)> (k) + μ4 × R <(i)> (k) (30) est (30) By setting NR <(i)> (k) =. Mu.5 XL <(i)> (k) +.
Mu.6XR <(i)> (k) (31), the parameters .mu. μ6 is
[0065]
Can be asked.
04-05-2019
20
The estimated values est (NL <(i)> (k)) and est (NR <(i)> (k)) obtained in this manner are also
scaled according to the following equations in the same manner as described above.
[0066]
[0067]
The parameters μ 1 to μ 6 shown in the equations (27), (32) and (33) and the scaling
coefficients shown in the equations (29), (34) and (35) correspond to the conversion coefficients
obtained in step S136. Do.
Then, in step S137, the correlation signal and the uncorrelated signal (the uncorrelated signal of
the right channel, the left channel) are estimated by the calculation (Equations (23), (30), (31))
using these transform coefficients. And uncorrelated signal).
[0068]
Next, an assignment process to a virtual sound source is performed (step S138).
First, in this assignment processing, as a pre-processing, the direction of the synthetic sound
image generated by the correlation signal estimated for each small band is estimated. The
estimation process will be described with reference to FIGS. FIG. 15 is a schematic diagram for
explaining an example of the positional relationship between a listener, left and right speakers,
and a synthesized sound image, and FIG. 16 is an example of the positional relationship between
a speaker group used in the wavefront synthesis reproduction method and a virtual sound
source. FIG. 17 is a schematic diagram for explaining an example of the positional relationship
between the virtual sound source of FIG. 16 and the listener and the synthesized sound image.
[0069]
Now, as in the positional relationship 150 shown in FIG. 15, a line drawn from the listener to the
center point of the left and right speakers 151L and 151R and a line drawn from the listener 153
04-05-2019
21
to the center of any one of the speakers 151L / 151R. Let θ0 be a spread angle, and θ be a
spread angle made by a line drawn from the listener 153 to the position of the estimated
synthesized sound image 152. Here, when the same sound signal is output from the left and right
speakers 151L and 151R with different sound pressure balances, the direction of the synthetic
sound image 152 produced by the output sound is the next using the above-mentioned
parameter α representing the sound pressure balance. It is generally known that it can be
approximated by the following equation (hereinafter referred to as the sine law in stereophonic
sound).
[0070]
[0071]
Here, the audio signal separation and extraction unit 121 illustrated in FIG. 12 converts the 2-ch
signal into a multi-channel signal so that the 2-ch stereo audio signal can be reproduced by the
wavefront synthesis reproduction method.
For example, assuming that the number of channels after conversion is five, it is regarded as
virtual sound sources 162a to 162e in the wave-field synthesis reproduction method as in the
positional relationship 160 shown in FIG. Deploy. The intervals between the virtual sound
sources 162a to 162e and the adjacent virtual sound sources are equal. Therefore, the
conversion here is to convert the 2ch audio signal into an audio signal with the number of virtual
sound sources. As described above, the audio signal separation and extraction unit 121 first
separates the audio signals of 2 ch into one correlated signal and two uncorrelated signals for
each small band. The audio signal separation and extraction unit 121 must further decide in
advance how to assign those signals to virtual sound sources (five virtual sound sources here)
having the number of virtual sound sources. The assignment method may be user-settable from
among a plurality of methods, or may be presented to the user by changing the selectable
method according to the number of virtual sound sources.
[0072]
The following method is taken as an example of the allocation method. First, with respect to left
and right uncorrelated signals, they are respectively assigned to both ends (virtual sound sources
162a and 162e) of five virtual sound sources. Next, the synthetic sound image generated by the
04-05-2019
22
correlation signal is assigned to two adjacent virtual sources of five. With regard to which two
adjacent virtual sound sources are assigned, it is assumed that the synthetic sound image
generated by the correlation signal is inside the ends of the five virtual sound sources (virtual
sound sources 162a and 162e), that is, 2ch stereo reproduction It is assumed that five virtual
sound sources 162a to 162e are disposed so as to be within the spread angle formed by the two
speakers at the time. Then, from the estimated direction of the synthesized sound image, two
adjacent virtual sound sources sandwiching the synthesized sound image are determined, the
assignment of sound pressure balance to the two virtual sound sources is adjusted, and synthesis
is performed by the two virtual sound sources An assignment method of reproducing so as to
generate a sound image is adopted.
[0073]
Therefore, as in the positional relationship 170 shown in FIG. 17, the spread angle formed by the
line drawn from the listener 173 to the midpoint between the virtual sound sources 162a and
162e at both ends and the line drawn to the virtual sound source 162e at the end is θ ' 0, the
spread angle between the listener 173 and the line drawn to the synthetic sound image 171 is θ
′. Furthermore, a line drawn from the listener 173 to the middle point of the two virtual sound
sources 162c and 162d sandwiching the synthesized sound image 171 and a line drawn from
the listener 173 to the middle points of the virtual sound sources 162a and 162e at both ends
(from the listener 173 A spread angle formed by the line drawn to the virtual sound source 162c
is φ0, and a spread angle formed by the line drawn from the listener 173 to the synthesized
sound image 171 is φ. Here, φ0 is a positive real number. A method of assigning the
synthesized sound image 152 (corresponding to the synthesized sound image 171 in FIG. 17) of
FIG. 15 whose direction has been estimated as described in equation (36) to virtual sound
sources using these variables will be described.
[0074]
First, scaling based on the spread angle difference is performed as follows. θ ′ = (θ′0 / θ0)
θ (37) Thus, the difference in the spread angle due to the arrangement of the virtual sound
sources is taken into consideration. However, the values of θ ′ 0 and θ 0 may be adjusted at
the time of system installation of the audio data reproduction apparatus, and no problem occurs
even if the values of θ ′ 0 and θ 0 are not equal. Description will be made as = π / 6 [rad] and
θ ′ 0 = π / 4 [rad].
04-05-2019
23
[0075]
Next, assuming that the direction θ <(i)> of the ith synthesized sound image is estimated by
equation (36), for example θ <(i)> = π / 15 [rad], θ is obtained from equation (37) It becomes'
<(i)> = π / 10 [rad]. Then, when there are five virtual sound sources, as shown in FIG. 17, the
synthetic sound image 171 is positioned between the third virtual sound source 162c and the
fourth virtual sound source 162d, counting from the left. When there are five virtual sound
sources, φ 0 00.078 [rad] from θ ′ 0 = π / 4 [rad] between the third virtual sound source 162
c and the fourth virtual sound source 162 d. If φ in the i-th small band is φ <(i)>, then φ <(i)> =
θ ′ <(i)> − φ0 ≒ 0.022π [rad]. In this way, the direction of the synthesized sound image
generated by the correlation signal in each small band is represented by the relative angle from
the direction of the two virtual sound sources that sandwich it. Then, as described above, it is
considered to generate the synthesized sound image by the two virtual sound sources 162c and
162d. For that purpose, the sound pressure balance of the output sound signals from the two
virtual sound sources 162c and 162d may be adjusted, and as the adjustment method, the sine
law in stereophonic sound used as Equation (36) is used again.
[0076]
Here, the scaling coefficient for the third virtual sound source 162c is g1 and the scaling
coefficient for the fourth virtual sound source 162d is g2 among the two virtual sound sources
162c and 162d sandwiching the synthesized sound image generated by the correlation signal in
the i-th small band. Then, from the third virtual sound source 162c, g1 · est ′ (S <(i)> (k)), and
from the fourth virtual sound source 162d, g2 · est ′ (S <(i)> (k)) Will output an audio signal of
Then, g1 and g2 may satisfy the following equation according to the sine law in stereophonic
sound.
[0077]
On the other hand, if g1 and g2 are normalized so that the total of the power from the third
virtual sound source 162c and the fourth virtual sound source 162d is equal to the power of the
correlation signal of the original 2ch stereo, g1 <2> + g2 < 2> = 1 + [α <(i)>] <2> (39)
[0078]
It is sought by putting these together.
04-05-2019
24
G1 and g2 are calculated by substituting the above-mentioned φ <(i)> and φ0 into the equation
(40). Based on the scaling factor calculated in this manner, as described above, the third virtual
sound source 162c is an audio signal of g1 · est '(S <(i)> (k)) from the fourth virtual sound source
162d Assigns an audio signal of g2 · est '(S <(i)> (k)). Then, as described above, the uncorrelated
signal is assigned to the virtual sound sources 162a and 162e at both ends. That is, est '(NL <(i)>
(k)) is assigned to the first virtual sound source 162a, and est' (NR <(i)> (k)) is assigned to the
fifth virtual sound source 162e.
[0079]
Unlike this example, if the estimated direction of the synthesized sound image is between the
first and second virtual sound sources, the first virtual sound source is g1 · est ′ (S <(i)> (k ) And
est '(NL <(i)> (k)) will be assigned. Also, if the estimated direction of the synthesized sound image
is between the fourth and fifth virtual sound sources, then the fifth virtual sound source is g2 ·
est ′ (S <(i)> (k)) and est Both '(NR <(i)> (k)) will be assigned.
[0080]
As described above, allocation of the correlation signal and the non-correlation signal of the left
and right channels for the i-th small band in step S138 is performed. This is performed for all the
small bands by the loop of steps S134a and S134b. As a result, assuming that the number of
virtual sound sources is J, output sound signals Y1 (k),..., YJ (k) in the frequency domain for each
virtual sound source (output channel) are obtained.
[0081]
Then, the processing of steps S140 to S142 is executed for each of the obtained output channels
(steps S139a and S139b). The processes of steps S140 to S142 will be described below.
[0082]
04-05-2019
25
First, the output speech signal y′j (m) in the time domain is determined by subjecting each
output channel to discrete Fourier inverse transform (step S140). Here, DFT <-1> represents
discrete Fourier inverse transform. y′j (m) = DFT <−1> (Yj (k)) (1 ≦ j ≦ J) (41) Here, as
described in equation (3), the discrete Fourier transformed signal has a window function Since
the signal is a signal after multiplication, the signal y′j (m) obtained by the inverse conversion is
also in the state of being multiplied by the window function. The window function is a function as
shown in equation (1), and reading is performed while shifting by half segment length, so as
described above, adding to the output buffer while shifting by half segment length from the head
of the segment processed one before The converted data is obtained by doing.
[0083]
However, as it is, as described above as the prior art, many discontinuous points as shown by the
center 101 in FIG. 10 are included in the converted data, and they are perceived as noise during
reproduction. As described above, such a discontinuity is due to not considering the line
spectrum of the DC component. FIG. 18 is a graph of a waveform schematically showing it. More
specifically, FIG. 18 is for explaining the discontinuities of the waveform generated at the
segment boundary after the inverse discrete Fourier transform when discrete Fourier transform
is performed on the audio signals of the left and right channels and the DC components of the
left and right channels are ignored. It is a schematic diagram. In the graph 180 shown in FIG. 18,
the horizontal axis represents time, and for example, the symbol (M-2) <(l)> indicates that it is the
M-2 sample point of the l th segment There is. The vertical axis of the graph 180 is the value of
the output signal for those sample points. As can be seen from the graph 180, discontinuities
occur in the portion from the end of the l-th segment to the beginning of the (l + 1) -th segment.
[0084]
In order to solve the problem as described in FIG. 18, the speech signal conversion device
according to the present invention is configured as follows. That is, the audio signal conversion
device according to the present invention includes a conversion unit, a correlation signal
extraction unit, an inverse conversion unit, and a removal unit. The transformation unit performs
discrete Fourier transform on the input speech signals of the two channels. The correlation signal
extraction unit extracts correlation signals by ignoring DC components of the audio signals of the
two channels after discrete Fourier transform in the conversion unit. That is, the extraction unit
extracts correlation signals of input audio signals of two channels. The inverse transformation
unit performs (a1) on the correlation signal extracted by the correlation signal extraction unit, or
(a2) on the correlation signal and the non-correlation signal (signal excluding the correlation
04-05-2019
26
signal), or (b1) A discrete Fourier inverse transform is performed on the speech signal generated
from the correlation signal or (b2) the speech signal generated from the correlation signal and
the uncorrelated signal. In this example, the inverse transform unit is an example of the audio
signal of (b2) above, where discontinuities are removed from the audio signal after assignment to
the virtual sound source for the wave-field synthesis reproduction method. Listed, but not limited
to this. For example, with respect to the audio signal before assignment to the virtual sound
source as an example of the above (a1) or (a2), that is, with respect to the extracted correlated
signal or the extracted correlated signal and uncorrelated signal, discontinuity points May be
removed and then assigned.
[0085]
And a removal part removes the discontinuous point of a waveform from the audio | voice signal
after discrete Fourier inverse transformation in an inverse transformation part. That is, in the
removal unit, for the correlation signal or the speech signal generated therefrom, the
discontinuities of the waveform are removed from the signal after the inverse discrete Fourier
transform. In the example of the audio signal processing unit 113 in FIG. 12, the abovementioned conversion unit, correlation signal extraction unit, and inverse conversion unit are
included in the audio signal separation and extraction unit 121, and the above removal unit is the
noise removal unit 122. It can be illustrated.
[0086]
Such processing for solving the problem as described in FIG. 18 will be specifically described
with reference to FIG. FIG. 19 is a schematic diagram for explaining an example of discontinuous
point removal processing according to the present invention, in the case where discrete Fourier
transform is performed on audio signals of left and right channels and direct current components
of left and right channels are ignored. FIG. 7 is a schematic view for explaining a method of
removing waveform discontinuities that occur at segment boundaries of FIG.
[0087]
In the discontinuous point removal process according to the present invention, as the removal
example for the graph 180 of FIG. 18 is shown in the graph 190 of FIG. 19, the derivative value
of the last waveform of the l th segment and the beginning of the (l + 1) th segment Make the
04-05-2019
27
derivative values match. Specifically, the noise removing unit 122 sets the waveform of the (l + 1)
th segment so that the value of the top of the (l + 1) th segment is maintained such that the slope
of the last two points of the lth segment is maintained. Add a DC component (bias). As a result,
the output voice signal y ′ ′ j (m) after processing becomes y ′ ′ j (m) = y ′ j (m) + B (42). B
is a constant representing a bias, and after the previous output audio signal and the output audio
signal of this process are added in the output buffer, the waveform is determined to be
continuous as shown in the graph 190 of FIG. .
[0088]
Thus, the noise removing unit 122 adds the DC component to the audio signal (correlation signal
or the audio signal generated therefrom) after the inverse discrete Fourier transform so as to
maintain the differential value of the waveform at the boundary of the processing segment.
Preferably, the discontinuities are removed. Although the negative bias is applied in this example,
it is natural that the positive bias may be applied in order to make the differential values
coincide.
[0089]
As described above, according to the present invention, an audio signal for reproducing an audio
signal for a multichannel system such as 2ch or 5.1ch in a wavefront synthesis reproduction
system without generating noise caused by a discontinuous point Can be converted to And
thereby, it is possible to receive the effect of providing sound image localization as intended by
the content producer to listeners at any position, which is a feature of the wavefront synthesis
reproduction system.
[0090]
In addition, the audio signal after discrete Fourier inverse transform to be processed by the noise
removing unit 122 is subjected to scaling processing in the time domain or frequency domain
with respect to the correlation signal or the correlation signal and the non-correlation signal, as
illustrated in each equation. And the audio signal after the scaling process may be performed.
That is, scaling processing may be performed on the correlation signal or the non-correlation
signal, and the discontinuity point may be removed on the correlation signal or the noncorrelation signal after the scaling processing.
04-05-2019
28
[0091]
A more preferable example of the present invention will be described with reference to FIGS. 20
and 21. FIG. 20 is a diagram showing a waveform of a result obtained by converting a certain
music content consisting of audio signals of left and right channels into five channels by applying
the discontinuity point removal processing of FIG. 19; FIG. 21 is a diagram showing a waveform
of a result of converting the same music content as the music content targeted in FIG. 20 into
five channels by applying the other discontinuous point removal processing. That is, FIG. 21
illustrates a method of removing the discontinuity point of the waveform generated at the
segment boundary after the inverse discrete Fourier transform when discrete Fourier transform
is performed on the audio signals of the left and right channels and the DC components of the
left and right channels are ignored. FIG.
[0092]
The bias component may be accumulated only by the discontinuous point removing process
described with reference to FIG. 19, and the amplitude of the waveform may overflow. In the
converted music content 200 illustrated in FIG. 20, accumulation of bias components is
frequently seen in the audio signals 202 and 203 of the second and third channels among the
audio signals 201 to 205 of the five channels, It can be seen that the audio signal 203 has
overflowed.
[0093]
Therefore, in the present invention, it is preferable to converge by decreasing the magnitude of
the amplitude of the added bias component (direct current component) temporally as in the
following equation. Note that “to decrease temporally” means to decrease in proportion to the
elapsed time from the addition point, for example, the elapsed time from the start point of each
processing segment or the start point of the discontinuity point. y ′ ′ j (m) = y ′ j (m) + B ×
((M−mσ) / M) (43) where σ is a parameter for adjusting the degree of decrease, and is, for
example, 0.5. In order to decrease, both B and σ are positive. Furthermore, when the absolute
value of the value of the bias obtained for addition exceeds a certain value, σ may be
dynamically increased or decreased according to the value. The timing to increase or decrease
may be the next processing segment. The feedback function works by changing (changing) σ
04-05-2019
29
corresponding to the proportional constant to be reduced according to the absolute value of the
bias value (the magnitude of the amplitude of the DC component). The same effect is obtained.
However, these methods do not guarantee that the amplitude of the speech waveform does not
overflow.
[0094]
Therefore, for example, when the bias value becomes a certain value (predetermined value) or
more, a process of not adding the bias term of the second term of Expression (43) may be added
as a function of the safety valve. That is, it is preferable that the noise removing unit 122 execute
the addition of the DC component (execute the removal of the discontinuous point) only when
the amplitude of the DC component obtained for the addition is less than the predetermined
value. By adopting this method, the output result output as the music content 200 of FIG. 20
becomes an output result such as the music content 210 shown in FIG. 21, and the bias
component is not accumulated. In particular, the audio signals corresponding to the audio signals
202 and 203 of the music content 200, that is, the audio signals 212 and 213 of the second and
third highest channels among the audio signals 211 to 215 of the five channels of the music
content 210, It can be seen that no bias component is accumulated.
[0095]
A more preferable example of the present invention will be described with reference to FIGS. 22
and 23. FIG. 22 shows the result of converting the music content having a sharp change in the
audio signal waveform different from the music content targeted in FIGS. 20 and 21 into five
channels by applying the same discontinuous point removal processing as FIG. FIG. FIG. 23 is a
diagram showing a waveform of a result of converting the same music content as the music
content targeted in FIG. 22 into five channels by applying another discontinuous point removal
process according to the present invention . That is, FIG. 23 shows waveform discontinuities that
occur at segment boundaries after discrete Fourier inverse transform, when discrete Fourier
transform is performed on voice signals of left and right channels with severe changes in voice
signal waveform and DC components of the left and right channels are ignored. It is a schematic
diagram for demonstrating the method to remove.
[0096]
04-05-2019
30
For example, when the audio signal is close to white noise, such as a consonant portion of audio,
there are cases where the waveform of the audio signal changes sharply and the original
waveform is already close to discontinuity. When the discontinuous point removal processing of
the present invention is applied when converting such left and right channel audio signals into
audio signals for the wavefront synthesis reproduction system, the waveform may be distorted.
In other words, applying the discontinuity point removal process of the present invention to an
audio signal in which the original waveform is close to discontinuity, this process forces the
waveform close to such an originally discontinuous state to be forced continuous. Conversely, the
waveform may be distorted. An example is shown in FIG. In the music content 220 after
conversion shown in FIG. 22, distortion is particularly large at the portions indicated by the
arrows in the first and fifth audio signals 221 and 225 among the audio signals 221 to 225 of
the five channels, Be perceived.
[0097]
In order to solve this problem, it is preferable to adopt the following method in the discontinuous
point removal process in the audio signal conversion process according to the present invention.
That is, when a signal such as a consonant part of speech is close to white noise, the number of
times the waveform of the input speech signal crosses 0 within a predetermined time (for
example, in a processing segment or in half thereof) is compared with the other parts. Use
extreme increase. In addition, it may be decided arbitrarily where to take 0. Therefore, the
number of times the output speech signal (at least the speech signal after discrete Fourier inverse
transform) crosses 0 in the half segment length is counted, and if it is not less than a
predetermined value (predetermined number of times), It is assumed that the segment of is
present at a predetermined number of times or more, and that the bias term of the second term
on the right side in equation (42) or equation (43) is not added in the subsequent segment
processing. That is, the discontinuous point removal process is performed only at other places.
Note that counting may be performed on a speech waveform for a fixed time regardless of
segment boundaries, or may be performed on speech waveforms for a plurality of segment
processing, and in any case, the next counting result is It may be decided whether or not to add a
bias term in segment processing.
[0098]
By adopting such a method, the portions indicated by the arrows in the audio signals 221 and
225 of the music content 220 in particular correspond to the audio signals 231 to 235 of the
five channels of the music content 230 after conversion shown in FIG. As in the portions
indicated by the arrows in the audio signals 231 and 235, distortion is eliminated and noise does
not occur.
04-05-2019
31
[0099]
The effect of the more preferable discontinuous point removal process described with reference
to FIG. 23 will be described in comparison with FIG.
FIG. 24 is a diagram showing the result of converting the music content of FIG. 8 into five
channels by applying the discontinuous point removal processing of FIG. 23, and FIG. 25 shows
the music content of FIG. 24 (after conversion) Is an enlarged view of part of the audio signal of
one of the channels. When the music content 80 shown in FIG. 8 is an input audio signal as a
result of the discontinuous point removal processing (noise removal processing) as described
above, the audio signals 241 to 245 of five channels in the music content 240 shown in FIG.
Converted to In particular, in the audio signal 243 of the third channel from above in the music
content 240, as shown by the audio signal 250 in FIG. 25, the discontinuous point is eliminated
and becomes continuous as indicated by the audio signal 250 in FIG. I understand that In this
way, the discontinuity can be eliminated and noise can be eliminated. Although FIG. 24 and FIG.
25 have been described as the result in the case where the preferable discontinuous point
removal processing described with reference to FIG. 23 is applied, some processing such as
equation (42) or equation (43) is also possible. Although there is a difference, as shown by the
audio signal 250, it becomes a continuous audio signal.
[0100]
The audio signal conversion process according to the present invention has been described above
by way of an example in which the input audio signal is an audio signal of 2 ch. Next, it will be
described that the present invention is applicable to audio signals of other multichannels. Do.
Here, the input sound signal of 5.1 ch is taken as an example while referring to FIG. 26, but the
present invention can be similarly applied to input sound signals of other multi-channels.
[0101]
FIG. 26 is a schematic diagram for explaining an example of the positional relationship between
the speaker group to be used and the virtual sound source when the 5.1 ch audio signal is
reproduced by the wavefront synthesis reproduction method. It is considered to apply the audio
04-05-2019
32
signal conversion processing according to the present invention to 5.1ch input audio. The
arrangement method of the 5.1ch speakers is generally arranged as shown in FIG. 2, and three
speakers 21L, 22C and 21R are arranged in front of the listener. And, particularly in contents
such as movies, the so-called center channel in the front center is often used for applications
such as human speech. That is, there are not many places where sound pressure control is
performed to generate a synthesized sound image between the center channel and the left
channel, or between the center channel and the right channel.
[0102]
Using this property, as in the positional relationship 260 shown in FIG. 26, the input audio
signals to the 5.1 ch front left and right speakers 262a and 262c are converted by this method
(audio signal conversion processing according to the present invention) For example, after
assigning to five virtual sound sources 263a to 263e, the sound signal of the center channel
(channel for center speaker) is added to the virtual sound source 263c in the middle. As such, the
output sound signal is reproduced by the speaker array 261 as a sound image for the virtual
sound source by the wavefront synthesis reproduction method. Then, with regard to the input
sound signals for the rear left and right channels, the speakers 262d and 262e are installed at
the rear as in the case of the 5.1 ch, and the output may be performed without any modification
from there.
[0103]
Thus, assuming that the multi-channel input audio signal is an input audio signal of three or
more channels, the present invention can be applied to any two of the multi-channel input audio
signals according to the present invention. The audio signal conversion processing as described
above is performed to generate an audio signal to be reproduced by the wave-field synthesis
reproduction method, and the generated audio signal is added to the input audio signal of the
remaining channel and output. Good. For this addition, for example, an addition unit may be
provided in the audio output signal generation unit 123.
[0104]
Next, the implementation of the present invention will be briefly described. The present invention
can be used, for example, in a device accompanied by video such as a television. Various
04-05-2019
33
examples of apparatuses to which the present invention can be applied will be described with
reference to FIGS. FIGS. 27 to 29 each show a configuration example of a television set provided
with the audio data reproduction device shown in FIG. 11, and FIGS. 30 and 31 each show a
video projection system provided with the audio data reproduction device shown in FIG. FIG. 32
shows an example of the configuration, FIG. 32 shows an example of the configuration of a
system consisting of a television board equipped with the audio data reproduction device of FIG.
11 and the television device, and FIG. 33 shows the audio data reproduction device of FIG. It is a
figure which shows the example of a motor vehicle. In addition, although the example which
arranged eight speakers shown by LSP1-LSP8 as a speaker array is given in any of FIG. 27-FIG.
33, the number of speakers should just be plural.
[0105]
The audio signal conversion device according to the present invention and the audio data
reproduction device provided with the same can be used for a television device. The arrangement
of these devices in a television set may be determined freely. As in the television set 270 shown
in FIG. 27, a speaker group 272 in which the speakers LSP1 to LSP8 in the audio data
reproduction apparatus are linearly arranged may be provided below the television screen 271.
As in the television apparatus 280 shown in FIG. 28, a speaker group 282 in which the speakers
LSP1 to LSP8 in the audio data reproduction apparatus are linearly arranged may be provided
above the television screen 281. As in a television set 290 shown in FIG. 29, a speaker group 292
in which transparent film-type speakers LSP1 to LSP8 in the audio data reproduction apparatus
may be linearly arranged may be embedded in the television screen 291.
[0106]
Further, the audio signal conversion device according to the present invention and the audio data
reproduction device provided with the same can be used for a video projection system. As in the
video projection system 300 shown in FIG. 30, the speaker group 302 of the speakers LSP1 to
LSP8 may be embedded in the projection screen 301b on which the video is projected by the
video projector 301a. As in the video projection system shown in FIG. 31, a speaker group 312 in
which the speakers LSP1 to LSP8 are arranged may be arranged behind a sound transmission
type screen 311b for projecting a video by the video projector 311a. Besides, the audio signal
conversion device according to the present invention and the audio data reproduction device
provided with the same can be embedded in a television stand (television board). As in a system
(home theater system) 320 shown in FIG. 32, a speaker group 322b in which the speakers LSP1
to LSP8 are arranged may be embedded in a television stand 322a for mounting the television
04-05-2019
34
apparatus 321. Furthermore, the audio signal conversion device according to the present
invention and the audio data reproduction device provided with the same can be applied to car
audio. As in a car 330 shown in FIG. 33, a speaker group 332 in which the speakers LSP1 to
LSP8 are arranged in a curve may be embedded in a dashboard in the car.
[0107]
When the audio signal conversion process according to the present invention is applied to the
apparatus as described with reference to FIGS. 27 to 33, the listener performs this conversion
process (in the audio signal processing unit 113 in FIG. 11 or 12). It is also possible to provide a
switching unit that switches whether or not to perform processing) by a button operation
provided on the apparatus body or a user operation performed by remote controller operation or
the like. When this conversion process is not performed, a virtual sound source may be disposed
as shown in FIG. 6 for reproduction of 2ch audio data by the wavefront synthesis reproduction
method. Alternatively, as in the positional relationship 340 shown in FIG. 34, reproduction may
be performed using only the speakers 341L and 341R at both ends of the array speaker 341.
Similarly, 5.1 ch audio data may be assigned to three virtual sound sources, or may be
reproduced using only one or two speakers at both ends and the middle.
[0108]
In addition, as a wavefront synthesis and reproduction method applicable to the present
invention, any method may be used as long as it is provided with a speaker array (a plurality of
speakers) as described above and is outputted from those speakers as a sound image for a virtual
sound source. Other than the WFS method described in Patent Document 1, various methods
such as a method using a preceding sound effect (Heath effect) as a phenomenon related to
human sound image perception may be mentioned. Here, with the preceding sound effect, the
same sound is reproduced from a plurality of sound sources, and when there is a small time
difference in each sound reaching each listener from each sound source, the sound image is
localized in the sound source direction of the sound reached earlier. Point to the effect of By
using this effect, it is possible to make a virtual sound source position perceive a sound image.
However, it is difficult to make the sound image clearly perceptible only by the effect. Here,
human beings also have the property of perceiving the sound image in the direction in which the
sound pressure is felt highest. Therefore, in the audio data reproduction apparatus, it is possible
to combine the above-mentioned preceding sound effect and the effect of the maximum sound
pressure direction perception so that a sound image can be perceived in the direction of the
virtual sound source even with a small number of speakers.
04-05-2019
35
[0109]
The above description has been made on the premise that the audio signal conversion device
according to the present invention converts an audio signal for multi-channel system into an
audio signal for reproduction by the wavefront synthesis reproduction system, but the present
invention The present invention is similarly applicable to the case of converting into an audio
signal for multi-channel system (the number of channels may be the same or different). The
audio signal after conversion may be any audio signal for reproduction by a speaker group
including at least a plurality of speakers, regardless of the arrangement. This is because, in the
case of such conversion, the discrete Fourier transform / inverse transform as described above
may be applied and the DC component may be ignored to obtain a correlation signal. As a
method of reproducing the audio signal converted in this way, for example, one speaker is made
to correspond to each one of the signals extracted for each virtual sound source, and the output
is normally reproduced instead of the wavefront synthesis reproduction method. Is considered.
Furthermore, various reproduction methods can be considered, such as reproduction methods in
which uncorrelated signals on both sides are assigned to different side and rear speakers,
respectively.
[0110]
Further, for example, each component of the audio signal conversion device according to the
present invention such as each component in the audio signal processing unit 113 illustrated in
FIG. 12 or each component of the audio data reproduction device including the device is, for
example, a microprocessor (Or DSP: Digital Signal Processor) hardware such as a memory, a bus,
an interface, and a peripheral device, and software that can be executed on the hardware. Part or
all of the hardware can be mounted as an integrated circuit / IC (Integrated Circuit) chip set, in
which case the software may be stored in the memory. In addition, all of the components of the
present invention may be configured by hardware, and in that case, it is also possible to mount
part or all of the hardware as an integrated circuit / IC chip set. .
[0111]
In addition, a recording medium recording the program code of software for realizing the
functions in the various configuration examples described above is supplied to a device such as a
04-05-2019
36
general-purpose computer serving as an audio signal conversion device, and a microprocessor or
DSP in the device is provided. Execution of the program code also achieves the object of the
present invention. In this case, the program code itself of the software implements the functions
of the various configuration examples described above, and even this program code itself or a
recording medium (external recording medium or internal storage device) recording the program
code The present invention can be configured by the control side reading and executing the code.
Examples of the external recording medium include various media such as an optical disk such as
a CD-ROM or a DVD-ROM, and a nonvolatile semiconductor memory such as a memory card.
Examples of the internal storage device include various devices such as hard disks and
semiconductor memories. The program code can also be downloaded and executed from the
Internet and can be received and executed from broadcast waves.
[0112]
The audio signal conversion apparatus according to the present invention has been described
above, but as the flow of processing is illustrated in the flowchart, the present invention converts
a multi-channel input audio signal into an audio signal for reproduction by the speaker group A
form as an audio signal conversion method can also be adopted.
[0113]
This speech signal conversion method has the following conversion step, extraction step, inverse
conversion step, and removal step.
The transformation step is a step in which the transformation unit performs discrete Fourier
transformation on the input speech signals of the two channels. The extraction step is a step in
which the correlation signal extraction unit extracts correlation signals by ignoring DC
components of the audio signals of the two channels after discrete Fourier transform in the
conversion step. In the inverse transformation step, the inverse transformation unit is applied to
the correlation signal or the correlation signal and the non-correlation signal extracted in the
extraction step, or to the audio signal generated from the correlation signal, or from the
correlation signal and the non-correlation signal This is a step of performing discrete Fourier
inverse transform on the generated voice signal. The removing step is a step in which the
removing unit removes discontinuities in the waveform from the audio signal after discrete
Fourier inverse transform in the inverse transforming step. The other application examples are as
described for the audio signal conversion device, and the description thereof is omitted.
04-05-2019
37
[0114]
The program code itself is, in other words, a program for causing a computer to execute the
audio signal conversion method. That is, this program ignores DC components of the speech
signals of the two channels after the discrete Fourier transformation after the discrete Fourier
transformation in the transformation step of performing discrete Fourier transformation on the
input speech signals of the two channels in the computer and correlating signals Extracting,
extracting the correlation signal or the correlation signal and the non-correlation signal extracted
in the extraction step, or generating an audio signal generated from the correlation signal, or
generated from the correlation signal and the non-correlation signal A program for executing an
inverse transform step of performing discrete Fourier inverse transform on an audio signal, and a
removal step of removing waveform discontinuous points from the audio signal after discrete
Fourier inverse transform in the inverse transform step. .
[0115]
DESCRIPTION OF SYMBOLS 110 ... Audio | voice data reproduction apparatus, 111 ... Decoder,
112 ... Audio signal extraction part, 113 ... Audio signal processing part, 114 ... D / A converter,
115 ... Amplifier, 116 ... Speaker, 121 ... Audio signal separation extraction part, 122 ... Noise
removing unit, 123: voice output signal generating unit.
04-05-2019
38
Документ
Категория
Без категории
Просмотров
0
Размер файла
62 Кб
Теги
jp2011239036
1/--страниц
Пожаловаться на содержимое документа