close

Вход

Забыли?

вход по аккаунту

?

JP2010041484

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010041484
The present invention provides a sound localization technique that does not cause a sense of
incongruity even in a scene where background sounds such as sound effects and BGM flow. A
video / audio output device analyzes a video and a video analysis unit (11) for specifying a
position of a speaker, and values of a speaker voice localization parameter so as to localize a
voice at the specified position of the speaker The loudness of the background sound is
determined based on the speaker voice localization parameter setting unit 12 for setting the
sound, the voice analysis unit 13 for analyzing the voice and calculating the feature amount of
the background sound, and the feature amount of the analyzed background sound. Speaker voice
for adjusting the value of the speaker voice localization parameter so as to make the change of
the localization position smaller than the value of the set speaker voice localization parameter
when it is determined that the size is equal to or larger than the predetermined threshold value A
localization parameter adjustment unit 14; a localization processing unit 15 for performing
localization change of the voice according to the value of the adjusted speaker voice localization
parameter; and a voice output unit 17 for outputting the voice whose localization is changed by
the localization processing unit 15. Prepare. [Selected figure] Figure 1
Video and audio output device
[0001]
The present invention relates to an audio / video output apparatus that outputs content data
including video and audio, and more particularly, to an audio / video output apparatus that
determines audio localization according to a speaker position of an image and performs audio
output control.
09-05-2019
1
[0002]
When program content such as television broadcasting is received and video is displayed on the
display and audio is output from the speaker, in monaural audio, human voice can be heard from
the position of the speaker.
Also, in stereo / surround sound, in many cases, the human voice is localized at the center of the
screen so that the human voice can be heard from the center of the screen.
[0003]
However, in general, it is known that when the human voice is localized at the speaker position
on the display, the sense of reality is increased. Conventionally, therefore, the speaker position is
specified by video analysis, and the speaker position is determined. A speech localization
technique for localizing speech is disclosed.
[0004]
For example, in Patent Document 1, the position of the speaker is detected, and the volume of the
sound output from the plurality of speakers is controlled according to the detected position.
In Patent Document 2, the position of the speaker is specified, effects and volume adjustment are
performed according to the specified position, and audio data is output from the optimum
speaker.
[0005]
JP, 11-313272, A JP, 2007-110582, A
[0006]
However, in the above-described conventional technology, since the voice is localized at the
speaker position without considering the contents of the scene, depending on the scene, stress
may be felt rather than enhancing the sense of realism.
09-05-2019
2
For example, even in a scene where background sound such as sound effects and BGM flows, the
background sound such as sound effects and BGM is output from the speaker position, so the
viewer who is watching the scene feels stress instead. There is a problem of
[0007]
As described above, in the prior art, since the voice is uniformly localized at the speaker position
without considering the contents of the scene, in a scene where background sounds such as
sound effects and BGM flow, it is possible to enhance the sense of presence. On the contrary,
there is a problem that a sense of incongruity arises.
[0008]
The present invention has been made in view of the above circumstances, and an example of the
problem is a voice localization technique for specifying a speaker position and localizing a sound
at a specified speaker position, such as sound effects and BGM. It is an object of the present
invention to provide an audio / video output apparatus which does not cause a sense of
incongruity even in a scene in which background sound flows.
[0009]
According to a first aspect of the present invention, there is provided an audio / video output
apparatus for controlling audio localization on the basis of audio localization parameters. The
voice localization parameter setting means for setting the value of the voice localization
parameter so as to cause the voice to be localized at the position of the speaker identified by the
speaker location identification means; When it is determined that the size of the background
sound is equal to or greater than a predetermined threshold based on the background sound
analysis unit that calculates the feature amount of the background sound and the feature amount
of the background sound analyzed by the background sound analysis unit A voice localization
parameter adjustment for adjusting the value of the voice localization parameter so as to make
the change of the localization position smaller than the value of the voice localization parameter
set by the voice localization parameter setting means. And stage, by the voice localization
parameter adjustment means performs a localization process of changing the sound according to
the adjustment value of the sound localization parameters, characterized in that it comprises a
localization change output means for outputting video and audio, a.
[0010]
Hereinafter, embodiments of the present invention will be described using the drawings.
09-05-2019
3
[0011]
FIG. 1 is a schematic configuration diagram of a video and audio output device 1 according to an
embodiment of the present invention.
The video / audio output device 1 is a device that outputs a voice in a voice localization matched
to the speaker position while taking into consideration the background sound of voice data, and
more specifically, the video analysis unit 11 and the speaker voice localization parameter setting
unit 12 The voice analysis unit 13, the speaker voice localization parameter adjustment unit 14,
the localization processing unit 15, the video display unit 16, and the voice output unit 17 are
provided.
[0012]
Here, the video and audio output device 1 may be any device as long as it has a function of
reproducing content data including video and audio input from the outside and outputting the
content data to the outside, for example, specifically, A television (TV), a DVD player and
recorder, a BD player and recorder, a personal computer (PC), etc. are assumed.
Also, “background sound” refers to sounds other than human speech, such as sound effects
and BGM.
Also, "speaker" refers to a person speaking in video data (on the screen), and "speaker position"
refers to the position of the speaker on the screen, but more accurately the speaker's The
position near the face (especially the mouth).
Also, “output the voice with voice localization matched to the speaker position” means, for
example, when the speaker is on the left side of the screen, the volume of the voice output from
the speaker provided on the left side of the screen is large. It means that the voice is output so
that the voice can be heard from the position of the speaker.
09-05-2019
4
[0013]
The video analysis unit 11 outputs the input video data to the video display unit 16 (delays the
video data as needed and outputs the same to the video display unit 16 in order to synchronize
with the audio data), and also the input video data The speaker position is specified from the The
method of identifying the speaker position is performed using a known technique. For example,
the speaker may be specified by detecting the area of the human face from the video data and
detecting the movement of the mouth in the face. At this time, in the detection of the movement
of the mouth, a person having a mouth area with the largest value of the calculated feature
amount is calculated using the video data of several frames before and after to calculate the
difference such as the brightness of the mouth area. If it is determined that the speaker is a
speaker, the speaker can be identified even if a plurality of faces are detected.
[0014]
In addition, the video analysis unit 11 is configured to output the identified speaker position to
the speaker voice localization parameter setting unit 12.
[0015]
The speaker voice localization parameter setting unit 12 is configured to set a value of a
parameter (hereinafter, referred to as a speaker voice localization parameter) for causing voice
data to be localized at the speaker position input from the video analysis unit 11. .
Here, "the value of the parameter for localizing voice data at the speaker position" refers to the
value of a parameter for outputting voice so that sound can be heard from the speaker position,
for example, a plurality of speakers Among the above, the parameter value (volume setting value
for each of a plurality of speakers) relating to volume adjustment is meant to increase the volume
of a speaker installed near the speaker position and decrease the volume of other speakers.
[0016]
The speaker voice localization parameter setting unit 12 is configured to output the set speaker
voice localization parameter value to the speaker voice localization parameter adjustment unit
14.
09-05-2019
5
[0017]
The voice analysis unit 13 outputs the input voice data to the localization processing unit 15, and
analyzes the input voice data to calculate the feature amount of the background sound.
As a method of calculating the feature of background sound, in the present embodiment, a
method of converting voice data into a frequency domain by Fourier transform and calculating
the feature of background sound from a frequency spectrum (power spectrum) is adopted. There
is. The background sound is generally localized in the high band or the low band, so the power
(energy) of the high band or the low band is used as the feature amount of the background
sound. For example, power existing in a frequency range of 80 Hz or less or 3000 Hz or more
may be used as the feature amount of the background sound.
[0018]
Further, the voice analysis unit 13 is configured to output the calculated background feature
amount to the speaker voice localization parameter adjustment unit 14.
[0019]
The speaker speech localization parameter adjustment unit 14 inputs the value of the speaker
speech localization parameter set by the speaker speech localization parameter setting unit 12
and also inputs the feature amount of the background sound output by the speech analysis unit
13. The value of the set speaker voice localization parameter is adjusted.
Specifically, it is determined whether the feature amount of the background sound is equal to or
greater than a predetermined threshold, and the feature amount of the background sound is
equal to or greater than the predetermined threshold, and it is determined that the background
sound is large. The value of the speaker's voice localization parameter is adjusted (corrected) so
as to reduce the amount of voice localization change that causes the voice to be localized at the
speaker position.
[0020]
09-05-2019
6
Here, "adjusting the value of the speaker voice localization parameter so as to reduce the amount
of voice localization change to localize the voice at the current speaker position" means, for
example, that one person is on the screen while the background sound is flowing. Taking the case
of moving from left to right as an example, when background sound is not considered at all, the
value P1 of the speaker voice localization parameter set so as to output the volume of the right
speaker with the magnitude of A1, It means adjusting to the value P2 of the speaker's voice
localization parameter set so as to output the volume of the right-side speaker at the level of A2
(<A1). That is, in this case, even if the speaker position moves from the left to the right, the value
of the speaker voice localization parameter does not change extremely following the speaker
position, for example, the voice at the screen center position etc. Adjust the value of the speaker's
voice localization parameter so as to make a gradual change, such as As a result, even if the
speaker position is changed while a large amount of background sound is being output, the
viewer does not feel uncomfortable because the localization position of the background sound is
not extremely changed. In order to reduce the amount of voice localization change, the amount of
voice localization change may be reduced with respect to the input voice data (usually, voice data
often localized at the center of the screen). Alternatively, the amount of voice localization change
may be smaller than the value of the speaker voice localization parameter set immediately
before.
[0021]
Further, the speaker voice localization parameter adjustment unit 14 is configured to output the
value of the adjusted speaker voice localization parameter to the localization processing unit 15.
[0022]
The localization processing unit 15 inputs voice data and also inputs the value of the speaker
voice localization parameter output from the speaker voice localization parameter adjustment
unit 14, and based on the input value of the speaker voice localization parameter, It performs
localization change processing of data.
Further, the localization processing unit 15 is configured to output the audio data subjected to
localization change processing to the audio output unit 17.
[0023]
The video display unit 16 is configured to output the video data output from the video analysis
unit 11 for display on a display or the like.
09-05-2019
7
[0024]
The audio output unit 17 is configured to output the audio data subjected to localization change
processing to a speaker.
[0025]
い。
[0026]
Next, with reference to FIG. 2, the function of the speaker speech localization parameter
adjustment unit 14, that is, adjustment of the speaker speech localization parameter when the
speaker moves in a scene with a large background sound will be specifically described.
[0027]
In addition, in the specific example shown in FIG. 2, it demonstrates using a coordinate system as
shown in FIG.
That is, in the image size of 1440 × 1080, the coordinate system is configured in pixel units
with the upper left of the screen as the origin, the horizontal direction as the X axis, and the
vertical direction as the Y axis.
Here, the position of the speaker SP specified on the screen is the position of the face, and in the
present embodiment, the coordinates of the four corners of the rectangular face area F are the
position of the speaker SP.
Specifically, the upper left vertex S0 (X0, Y0) of the face area F, the upper right vertex S1 (X1,
Y1), the lower left vertex S2 (X2, Y2), and the lower right vertex S3 (X3, Y3) Thus, the position of
the speaker SP is identified.
[0028]
09-05-2019
8
Further, in the specific example shown in FIG. 2, the above-described speaker voice localization
parameter is described as a speaker voice localization position P (Px, Py), and the voice is
adjusted so that the voice can be heard from the speaker voice localization position P. Output.
In the specific example shown in FIG. 2, the speaker voice localization position P is set at the
center position of the face area F of the specified speaker in the normal state, and the speaker
position changes in a scene with a large background sound. When the speaker voice localization
position P is set to the center position of the screen.
[0029]
FIG. 2A shows the case where the speaker A exists at the left position on the screen. Specifically,
as shown in FIG. 2A, the face area F of the speaker A is S0 (200, 220), S1 (580, 220), S2 (200,
600), S3 (580, 600). Therefore, the speaker's voice localization position P is P1 (390, 410) which
is the center of the face area F.
[0030]
On the other hand, FIG. 2B shows a case where the speaker A moves from the left side to the
right side on the screen and exists on the right side. Specifically, as shown in FIG. 2B, the face
area F of the speaker A is S0 (860, 220), S1 (1240, 220), S2 (860, 600), S3 (1240, 600). Since
the center of the face area F is P2 (1050, 410), the speaker voice localization position P is P3
(720, 540) which is the center position of the screen.
[0031]
As described above, when the speaker A moves and the speaker position is changed in a scene in
which the background sound is flowing greatly, the voice (speaker voice and background sound)
is localized at the center position of the screen and the viewer is Not to cause discomfort. If the
background sound is not considered, the speaker's voice is localized at the speaker's position
following the speaker's position, so the speaker's voice localization position P becomes P2 (1050,
410).
09-05-2019
9
[0032]
That is, when the speaker voice localization position P is determined in consideration of the
background sound, the speaker voice localization position P is changed from P1 (390, 410) to P3
(720, 540), but the background sound is taken into consideration. When the speaker voice
localization position P is determined without being changed, the speaker voice localization
position P is changed from P1 (390, 410) to P2 (1050, 410). Here, the change in position of P1
(390, 410) → P3 (720, 540) is smaller than the change in position of P1 (390, 410) → P2 (1050,
410), which means that These specifically show “adjust the value of the speaker's voice
localization parameter so as to reduce the amount of voice localization change” described
above.
[0033]
Next, with reference to FIG. 4, the video and audio output processing of the video and audio
output device 1 according to the present embodiment will be described. FIG. 4 is a flow chart
showing a flow of video / audio output processing for performing audio localization control in
consideration of the background sound of the video / audio output device 1.
[0034]
First, the video analysis unit 11 of the video and audio output apparatus 1 analyzes the input
video data, and specifies the speaker position of the video data (step S10).
[0035]
Next, the speaker voice localization parameter setting unit 12 of the video and audio output
device 1 sets the value of the speaker voice localization parameter based on the identified
speaker position (step S20).
[0036]
Next, the audio analysis unit 13 of the video and audio output device 1 analyzes the input audio
data, converts the audio data into a frequency domain, and calculates the feature amount of the
background sound (step S30).
09-05-2019
10
For example, energy existing in a frequency region of 80 Hz or less or 3000 Hz or more is
calculated as a background feature amount.
[0037]
Next, the speaker voice localization parameter adjustment unit 14 of the video and audio output
device 1 determines whether the energy of the low band or the high band of the input voice data
is high based on the calculated feature amount of the background sound. It determines (step
S40).
That is, it is determined whether the calculated feature amount of the background sound is equal
to or more than a predetermined threshold.
[0038]
When it is determined that the feature amount of the background sound is equal to or more than
the threshold and the energy of the low band or high band of the input audio data is high (step
S40: YES), it is determined that the background sound is large (step S50) When it is determined
that the feature amount of the background sound is less than a certain threshold and the energy
of the low band and the high band of the input audio data is low (step S40: NO), it is determined
that the background sound is not loud (step S40). Step S60).
[0039]
When it is determined that the background sound is large (step S50), the speaker voice
localization parameter adjustment unit 14 of the video and audio output device 1 performs
speaker voice localization so that the amount of change in voice localization to the speaker
position becomes small. The value of the parameter is adjusted (step S110).
[0040]
Next, the localization processing unit 15 of the video and audio output apparatus 1 changes the
localization of audio data according to the set value of the speaker audio localization parameter
(step S120).
09-05-2019
11
That is, when it is determined that the background sound is large (step S50), the voice
localization change of the voice data is performed with the value of the speaker voice localization
parameter adjusted so that the voice localization change amount to the speaker position becomes
small. When it is determined that the background sound is not large (step S60), the voice
localization change of the voice data is performed with the value of the speaker voice localization
parameter set in step S20.
[0041]
Next, the video display unit 16 of the video and audio output device 1 outputs video data, and the
audio output unit 17 outputs audio data for which the audio localization change has been
performed.
[0042]
FIG. 5 is a flowchart showing a modification of the flow of background sound determination
processing of the video and audio output processing of the video and audio output device 1
according to the present embodiment, showing steps replacing step S30 to step S60 of FIG. .
In this modification, when the voice data is stereo voice, usually the speaker's voice is localized at
the center of the screen and there is no difference between the left and right voice signals, but
the background sound has a difference between the left and right voice signals. The feature
amount of the background sound is calculated using this difference.
That is, the voice analysis unit 13 uses the difference between the left and right voice signals as
the feature amount of the background sound.
[0043]
Specifically, first, the audio analysis unit 13 of the video and audio output device 1 analyzes the
input audio data, and calculates a difference signal between the left signal and the right signal of
the audio data as the feature amount of the background sound (step S70). ).
[0044]
09-05-2019
12
Next, the speaker voice localization parameter adjustment unit 14 of the video and audio output
device 1 determines whether the difference signal of the input voice data is equal to or more
than a predetermined threshold value based on the calculated feature amount of background
sound. Is determined (step S80).
[0045]
If the feature amount of the background sound is greater than or equal to a certain threshold
(step S80: YES), it is determined that the background sound is large (step S90) because the left
and right audio data have a difference of a certain amount or more. If the amount is less than the
threshold (step S80: NO), it is determined that the background sound is not large because there is
no difference between the left and right audio data (step S100).
[0046]
As described above, according to the video and audio output device 1 according to the abovedescribed embodiment, the video analysis unit 11 analyzes the video and identifies the position
of the speaker, and the speaker identified by the video analysis unit 11 A speaker voice
localization parameter setting unit 12 that sets a value of a speaker voice localization parameter
so that a voice is localized at a position; a voice analysis unit 13 that analyzes voice and
calculates a feature of background sound; voice analysis When it is determined that the
magnitude of the background sound is equal to or greater than a predetermined threshold based
on the feature amount of the background sound analyzed by the unit 13, the speaker voice set by
the speaker voice localization parameter setting unit 12 The speaker speech localization
parameter adjustment unit 14 adjusts the value of the speaker speech localization parameter so
as to make the change of the localization position smaller than the value of the localization
parameter, and the speaker speech localization parameter adjustment unit 14 Since the
localization processing unit 15 for changing the localization of voice according to the value of the
voice localization parameter of the person and the voice output unit 17 for outputting the voice
whose localization is changed by the localization processing unit 15, voice is specified at the
specified speaker position. In a device equipped with voice localization technology for
localization, even if the speaker position changes in a scene where background sounds such as
sound effects and BGM flow, the viewer does not feel discomfort.
For example, even if one speaker moves from the left side to the right side on the screen in the
middle of a scene where BGM flows, the BGM does not follow the movement of the person, and
the viewer does not feel uncomfortable.
09-05-2019
13
[0047]
Further, the voice analysis unit 13 analyzes the frequency characteristics of voice and calculates
the energy contained in the frequency higher than the first frequency which is the high
frequency or the second frequency or lower which is the low frequency, and the calculated
energy is It may be a feature of sound.
In this case, the background sound can be easily extracted from the audio data.
[0048]
Note that, as a method of calculating the feature amount of the background sound, the difference
signal of the left signal and the right signal of the sound may be calculated, and the calculated
difference signal may be used as the feature amount of the background. The sound can be easily
extracted.
[0049]
Further, the speaker voice localization parameter adjustment unit 14 may adjust the value of the
speaker voice localization parameter so that the voice is localized at the position in the center
direction of the display screen.
Even if the speaker position changes in a scene where background sounds such as sound effects
and BGM flow, the audio is localized at the screen center, so the viewer can view the content
comfortably without feeling discomfort. .
[0050]
Although the embodiments of the present invention have been described above, the present
invention is not limited to the above-described embodiments, and various modifications can be
made to the embodiments of the present invention without departing from the scope of the
present invention. It is to be understood that variations and modifications of the invention can be
made, and those accompanied by such variations and changes are also included in the technical
scope of the present invention.
09-05-2019
14
[0051]
FIG. 1 is a schematic configuration diagram of a video and audio output device according to an
embodiment of the present invention.
FIG. 6 is a diagram showing how the speaker position changes in the video data input to the
video and audio output device according to the embodiment of the present invention.
It is an example of the video data input to the video and audio output device according to the
embodiment of the present invention. It is a flowchart which shows the flow of the audio-video
output processing which performs audio localization control in consideration of the background
sound of the audio-video output device concerning an embodiment of the invention. It is a
flowchart which shows the modification of the flow of the background sound determination
processing in the audio-video output processing which performs audio localization control in
consideration of the background sound of the audio-video output device concerning an
embodiment of the invention.
Explanation of sign
[0052]
DESCRIPTION OF SYMBOLS 1 video / audio output device 11 video analysis unit 12 speaker
voice localization parameter setting unit 13 voice analysis unit 14 speaker voice localization
parameter adjustment unit 15 localization processing unit 16 video display unit 17 voice output
unit
09-05-2019
15
Документ
Категория
Без категории
Просмотров
0
Размер файла
25 Кб
Теги
jp2010041484
1/--страниц
Пожаловаться на содержимое документа