close

Вход

Забыли?

вход по аккаунту

?

JP2008048342

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2008048342
[PROBLEMS] To provide a sound collection device that precisely converts the speech speed of the
speaker present at an arbitrary position around the device from the collected sound and does not
convert the speech speed of the background sound. An audio signal processing unit adds a
predetermined delay to an audio signal collected by each microphone and forms a sound
collection beam around the microphone. The controller 8 generates information (speaker
position information) indicating the area where the speaker is present based on the area
corresponding to the sound collection beam having the highest level, and outputs the information
to the storage unit 3 for recording. The sound collection beam corresponding to the speaker
position information is output to the speech speed conversion unit 5 as a speaker voice signal,
and the other sound collection beam is output to the mixer 6 as a background sound signal. As a
result, only the voice of the speaker can be converted to speech speed, and the background
sound can be voiced and recorded without speech speed conversion. [Selected figure] Figure 1
Sound pickup device
[0001]
The present invention relates to a sound collection device which is used for a meeting or the like
and collects speech voices of meeting participants.
[0002]
2. Description of the Related Art Conventionally, there has been proposed an apparatus that
makes it easy to listen to the contents of an utterance by expanding an input voice signal along a
04-05-2019
1
time axis and converting the speech speed.
However, when the input voice signal is decompressed, sounds other than the voice of the
speaker (for example, BGM) are also decompressed simultaneously. In addition, BGM is also
extended when the speaker's voice is not input. If the listener is listening to the BGM at the same
time (in parallel) with the voice of the speaker, if the BGM is expanded, the problem arises that
the atmosphere of the original musical tone can not be felt.
[0003]
Therefore, an apparatus has been proposed which analyzes the input voice signal and performs
speech speed conversion processing only when it is determined to be a speaker voice (for
example, see Patent Document 1).
[0004]
In addition, a device has been proposed in which a plurality of microphones are installed, and
collected voices (of the same phase) collected from points at equal distances from the
microphones are uttered voices and other collected voices are separated as background sound
(for example, Patent Document 2).
[0005]
There is also proposed an apparatus configured to handle speech and background sound with a
plurality of independent channels and perform speech speed conversion processing only on the
speech channel (see, for example, Patent Document 3).
JP 2000-152394 JP 2005-208173 JP JP 2004-244081
[0006]
However, in the device of Patent Document 1, there is a problem that the background speed
picked up at the same timing as the uttered voice is converted to the speech speed in the same
manner as the uttered voice.
[0007]
04-05-2019
2
Further, in the device of Patent Document 2, only voice from a point where the distance from
each microphone is equal can be processed as speech voice, so if there is a speaker other than
this point, the speech speed of the speaker can not be converted. There was a problem with that.
[0008]
Further, in the device of Patent Document 3, when recording, it is necessary to record the speech
sound and the background sound on different channels, and it is necessary for the speaker to
utter to the microphone assigned to the specific channel.
[0009]
It is an object of the present invention to provide a sound pickup apparatus which precisely
converts the speech speed of the speaker present at an arbitrary position around the apparatus
from the picked up speech and does not change the speech speed of the background sound. .
[0010]
The sound collection device according to the present invention forms a sound collection beam for
a plurality of user directions and a microphone array formed by arranging a plurality of
microphones, and identifies the speaker direction by comparing the sound collection beam
intensity. A sound collection control unit for selecting the sound collection control unit; sound
signal selection means for selecting a sound collection beam of the speaker orientation as the
speech signal and selecting a sound collection beam other than the sound collection beam of the
speaker orientation as a background sound signal; A speech speed converting means for
converting the speech speed of the speech signal, and a mixer for mixing the speech speech
signal converted by the speech speed conversion means and the background speech signal
selected by the speech signal selection means It is characterized by
[0011]
In the present invention, a predetermined delay is given to the collected voice signal of each
microphone, and a plurality of collected voice beams having strong directivity in a specific
direction are formed.
The speaker orientation is identified by comparing the levels of these collected beams.
04-05-2019
3
For example, the direction of the sound collection beam with the highest level is set as the
speaker direction.
The sound collection beam of the speaker orientation is converted to speech speed as a speaker
voice signal and output to the mixer, and the sound collection beam in the other direction is
output as it is to the mixer without speech speed conversion.
[0012]
Further, in the sound collection device according to the present invention, the sound signal
selection means selects a sound collection beam having a predetermined level or more for a
direction other than the sound collection beam selected as the speech sound signal. Is selected as
the background audio signal.
[0013]
According to the present invention, when a high-level sound collection beam exists in a direction
other than the direction in which the speaker is determined to be present, the sound collection
beam in that direction is used as the background sound signal on the assumption that the sound
source of the background sound is present in that direction. Output.
By this, it is possible to accurately collect background sound as well.
[0014]
Further, according to the sound collection device of the present invention, the sound signal
selection unit may further include: a sound collection beam selected as the speech sound signal;
and a sound collection beam in a direction adjacent to the sound collection beam selected as the
speech sound signal. The difference signal of is input to the speech speed conversion means as a
speech signal.
[0015]
In the present invention, the sound collection beam in the adjacent direction is subtracted from
the sound collection beam selected as the speaker voice signal.
04-05-2019
4
As a result, it is possible to reduce the level of the background sound included in the sound
collection beam selected as the speaker voice signal, and to more accurately convert the speech
speed of only the speaker's voice.
[0016]
Further, the sound collection device according to the present invention further includes a speech
sound signal extraction unit that extracts speech signals of speech from the plurality of sound
collection beams formed by the sound collection control unit, and the sound collection control
unit comprises: Among the collected sound beams, the speech sound signal extraction unit
determines the direction of the collected sound beam from which the speech signal of the speech
sound is extracted as the speaker direction.
[0017]
In the present invention, speech signals of speech are extracted from the respective sound
collecting beams.
For example, the voice feature amount of the sound collection beam is extracted and compared
with the voice feature amount of the speech voice stored in advance, and if it matches, the speech
voice is estimated.
Since the sound collection control unit selects the sound collection beam including the sound
signal estimated to be the highest level and the uttered voice as the utterer's voice signal, the
voice speed of only the utterer's voice can be converted more appropriately. Can.
[0018]
According to the present invention, the direction of the speaker is determined by the sound
collection beam formed by the microphone array, the speech speed is converted only for the
sound collection beam with respect to the direction of the speaker, and the sound collection
beam in the other direction is output as it is. Thus, it is possible to accurately convert the speech
speed of the speaker's voice, and to pick up the sound without converting the speech speed of
the background sound.
04-05-2019
5
[0019]
A sound emission and collection device according to an embodiment of the present invention will
be described with reference to the drawings.
This sound emitting and collecting device is used as a loudspeaker, a recorder, etc. in a
conference.
FIG. 1 is a block diagram showing the configuration of the sound emission and collection device.
As shown in the figure, the sound emitting and collecting apparatus includes a speaker 1, a
plurality of microphones 2A to 2M, a storage unit 3, an audio signal processing unit 4, a speech
speed conversion unit 5, a mixer 6, a recording and reproduction unit 7, and a controller. 8 and
an input / output I / F 9 are provided.
[0020]
The plurality of microphones 2A to 2M are arranged in a straight line (or in a matrix form, a
honeycomb form) at regular intervals to constitute a microphone array. Each microphone 2
generally uses a dynamic microphone, but may use another format such as a condenser
microphone. Further, the number of microphones arranged and the arrangement interval are
appropriately set according to the environment in which the sound emitting and collecting
apparatus is installed, the required frequency band and the like.
[0021]
When sound is emitted at a certain position around the microphones 2A to 2M, each microphone
2 picks up the sound. The microphone 2 outputs an audio signal from the collected voice to the
audio signal processing unit 4. In FIG. 1, an amplifier at the front end and an A / D converter for
converting an analog audio signal into a digital audio signal are omitted. The audio signals output
from the microphones 2 are synthesized by the audio signal processing unit 4 and output to the
speech speed conversion unit 5 or the mixer 6. The audio signal processing unit 4 selectively
04-05-2019
6
outputs the audio signal output from each of the microphones 2 in accordance with the
instruction of the controller 8. When voices are picked up by the respective microphones 2, the
voices are propagated at propagation time according to the distance between the respective
microphones 2 and the sound source, so that the microphones 2 have a time difference in the
pick-up timing.
[0022]
Here, for example, when sound waves arrive at all the microphones 2 from the front at the same
timing, the audio signals output from the respective microphones 2 are strengthened by
synthesis. On the other hand, when sound waves arrive from directions other than this, the sound
signals output from the respective microphones 2 are weakened by being synthesized because
they have different phases. Therefore, the sensitivity of the array microphone is narrowed in a
beam shape to form the main sensitivity (sound collection beam) only in front.
[0023]
The audio signal processing unit 4 can direct the sound collection beam obliquely by giving
predetermined delay times to the audio signals output from the respective microphones 2. When
the sound collection beam is made to be oblique, it is set such that an audio signal is sequentially
output from the adjacent microphone 2 each time a predetermined time elapses from one end
microphone 2. For example, when the sound source is present in front of one end of the
microphone array, the sound wave comes from the end closest to the sound source and the
sound wave comes last to the opposite end. A delay time is added to the audio signal of each
microphone 2 so as to correct the propagation time difference, and then the signal is synthesized.
Thereby, the voice signal in this direction is enhanced by synthesis. Therefore, the sound
collection beam is inclined according to the delay time by sequentially delaying the audio signals
output from the microphones 2 arranged in a line from one end to the other end.
[0024]
Also, it is possible to form a plurality of sound collection beams simultaneously. FIG. 2 is a block
diagram showing the configuration of the main part of the audio signal processing unit 4
connected to the microphone 2. The microphones 2A to 2M are connected to the digital filters
41A to 41M of the audio signal processing unit 4, respectively. The voices collected by the
04-05-2019
7
microphones 2A to 2M are input to the digital filters 41A to 41M as digital voice signals.
Although FIG. 2 shows a detailed block diagram of only the digital filter 41A among the digital
filters 41A to 41M, the other digital filters 41B to 41M have the same structure and perform the
same operation. It is.
[0025]
The digital filter 41A includes a delay buffer 42A having a plurality of stages of outputs. The
delay amount of each stage of the delay buffer 42A is set according to the arrangement of the
microphones 2 of the microphone array and the area in front of the microphone array (the area
for detecting a speaker). In this example, the delay buffer 42A has four stages of outputs, and
these output signals are input to the FIR filters 431A to 434A.
[0026]
The delay buffer 42A buffers, at each stage, audio signals obtained by adding different delay
times to the audio signals output from the microphone 2A, and outputs each delayed audio signal
to the FIR filters 431A to 434A. Here, the delayed voice signals output to the FIR filters 431A to
434A correspond to the respective regions in front of the microphone array. FIG. 3 is a diagram
showing an example of a sound source direction detection method. The figure (A) is the figure
which showed the relationship between the positional relationship of a sound source and a
microphone, and the delay at the time of the sound generated from a sound source being
collected by each microphone, and the figure (B) and (C) These are figures which show the
formation concept of the delay correction amount based on the delay of the audio | voice signal
collected.
[0027]
As shown in the figure, in this sound emitting and collecting apparatus, four partial areas 101 to
104 are set in front of the microphone array. The sound generated in the partial area 101 is
picked up first by the closest microphone 2A. Then, in accordance with the distance between the
partial area 101 and the microphones 2, the sound is collected by the microphones in order, and
the sound is collected finally by the farthest microphone (the microphone 2L in the figure). On
the other hand, the sound generated in the partial area 104 is first collected by the closest
microphone 2L, and is sequentially collected by each microphone according to the distance
04-05-2019
8
between the partial area 104 and the microphone 2, and collected last in the farthest microphone
2A. Be heard. As described above, the sound generated in each area is picked up in a delay time
(delay) according to the distance to the microphone.
[0028]
Here, with respect to the partial area 101, as shown in FIG. 3B, the audio signals collected by the
microphones 2A to 2L are subjected to delay processing. That is, the corresponding delay
correction amount is set to correct the delay shown in FIG. On the other hand, for the partial area
104, as shown in FIG. 3C, the audio signals collected by the microphones 2A to 2L are subjected
to delay processing.
[0029]
A delayed sound signal for forming a sound collection beam corresponding to the partial region
101 is generated in the delay buffer 42A and output to the FIR filter 431A. Further, a delayed
sound signal for forming a sound collection beam corresponding to the partial region 102 is
output to the FIR filter 432A. Similarly, a delayed voice signal for forming a collected beam
corresponding to partial region 103 is output to FIR filter 433A, and a delayed voice signal for
forming a collected beam corresponding to partial region 104 to FIR filter 434A. It is output. The
amount of delay of these delayed audio signals is set according to the distance between the
microphone 2 and each area as shown in FIG. For example, the delay voice signal corresponding
to partial region 101 has a large delay amount because the distance between microphone 2A and
partial region 101 is short, and the delay voice signal corresponding to partial region 104 has a
distance between microphone 2A and partial region 104 The delay amount is small because it is
the farthest.
[0030]
In FIG. 2, all of the FIR filters 431A to 434A have the same configuration, and filter and output
delayed speech signals input to each of them. The FIR filters 431A to 434A can set detailed delay
times that can not be realized by the delay buffer 42A. That is, by setting the sampling period
and the number of taps of the FIR filter to desired values, for example, when the sampling period
in the delay buffer 42A is an integral part of the delay time, the decimal point part of the delay
time can be realized. it can.
04-05-2019
9
[0031]
The delayed voice signals output from the FIR filters 431A to 434A are amplified by the
respective amplifiers 441A to 444A and input to the adders 45A to 45D. The other digital filters
41B to 41M also have the same configuration as the digital filter 41A, and output delayed voice
signals to the adders 45A to 45D in accordance with delay conditions set in advance.
[0032]
The adder 45A synthesizes the delayed speech signals input from the digital filters 41A to 41M
to generate a sound collection beam corresponding to the partial area 101 in FIG. Similarly, the
adder 45B synthesizes the delayed voice signals input from the digital filters 41A to 41M to
generate a sound collection beam corresponding to the partial area 102 in FIG. 3, and the adder
45C transmits each digital filter The delayed voice signals input from 41A to 41M are
synthesized to generate a sound collection beam corresponding to the partial area 103 in FIG.
Further, the adder 45D synthesizes the delayed speech signals input from the digital filters 41A
to 41M to generate a sound collection beam corresponding to the partial area 104 in FIG.
[0033]
The sound collection beam output from each of the adders 45A to 45D is output to a band pass
filter (BPF) 46. The BPF 46 filters each sound collection beam and outputs the sound collection
beam of a predetermined frequency band to the level determination unit 47. Here, the BPF 46
takes advantage of the fact that the frequency bands to be beamed differ according to the width
of the microphone array and the installation interval of the microphones 2, and pass the
frequency band corresponding to the voice to be picked up by each collected sound beam Set For
example, if the voice to be picked up is the speech of the speaker, the frequency band
corresponding to the voice band of a person may be set as the pass band.
[0034]
The level determination unit 47 outputs information indicating the level of each sound collection
04-05-2019
10
beam to the controller 8. The controller 8 compares the level of each of the input sound
collecting beams, and selects the sound collecting beam having the highest level. When the level
of the sound collection beam is high, the sound source (speaker) is present in the region
corresponding to the sound collection beam, and the speaker presence region in the case of
being divided into four regions shown in FIG. It can be detected.
[0035]
Here, the controller 8 refers to information indicating a speaker's presence region (hereinafter
referred to as speaker position information) based on the region corresponding to the sound
collection beam having the highest level. Generate). When the level (absolute level) of the sound
collecting beam having the highest level is less than a predetermined threshold (for example, the
level of a typical speech sound), the controller 8 assumes that the speaker does not exist and the
speaker position information is It may not be generated.
[0036]
The controller 8 causes the signal selection unit 48 to select the sound collection beam
corresponding to the speaker position information based on the generated speaker position
information and to output this to the speech speed conversion unit 5 as a speaker voice signal.
Set to In addition, the controller 8 sets the signal selection unit 48 to select any one of the sound
collection beams corresponding to directions other than the area indicated by the speaker
position information and to output it to the mixer 6 as a background sound signal. Do. The
controller 8 may set the signal selection unit 48 to select a plurality of sound collection beams
corresponding to directions other than the area indicated by the speaker position information,
combine them, and output them to the mixer 6. . Of course, all the collected sound beams
corresponding to directions other than the region indicated by the speaker position information
may be synthesized and output to the mixer 6.
[0037]
Here, depending on the level of each sound collection beam, the following two patterns can be
considered for the speaker voice signal to be output and the background voice signal. (1) When
the Background Sound is a Point Sound Source In this case, one showing a high level is included
in any one of the sound collection beams corresponding to directions other than the area
04-05-2019
11
indicated by the speaker position information. Therefore, as a result of comparing the levels of
the respective collected beams, the controller 8 determines that at least one of the collected
beams corresponding to a direction other than the area indicated by the speaker position
information has a predetermined value or higher When a level below the threshold of (1) is
detected, the signal selection unit 48 is set to output the sound collection beam in this direction
as a background sound signal. (2) When the Background Sound is Unlocalized In this case, high
levels are shown for a plurality of sound collection beams corresponding to directions other than
the area indicated by the speaker position information. Therefore, as a result of comparing the
levels of the respective sound collection beams, the controller 8 collects the sound collection
beams corresponding to the directions other than the area indicated by the speaker position
information, the level of the predetermined value or more (for example, a majority or more)
However, when a value smaller than the above-mentioned predetermined threshold is detected,
the signal selection unit 48 is set to output the sound collecting beam having the highest level as
the background sound signal. At this time, since the component of the background sound is also
included in the sound collection beam corresponding to the speaker position information, the
controller 8 controls the sound collection beam corresponding to the speaker position
information and the adjacent sound collection beam. The signal selection unit 48 is set to output
the difference as a speaker voice signal.
[0038]
As described above, the audio signal processing unit 4 can separate the speech of the speaker
and the other speech and output the same to the subsequent stage.
[0039]
Although FIG. 2 shows an example in which four partial areas 101 to 104 are set in front of the
microphone array to form a sound collection beam for each area, the number of output stages of
the delay buffer 42 shown in FIG. By setting an FIR filter, an amplifier, and an adder for the
number of output stages of the delay buffer 42, it is possible to form a sound collection beam for
a larger number of regions.
In addition, microphone arrays are arranged in two rows and the audio signal processing unit
shown in FIG. 2 is connected to each row to form a sound collection beam in the front direction
of each microphone array, and both surfaces of the microphone array are formed. It is also
possible to form a sound collection beam in a direction (ie approximately 360 degrees direction).
04-05-2019
12
[0040]
Further, the controller 8 may extract voice feature amounts from the respective collected sound
beams, and distinguish between uttered voice and musical tone voice (for example, singing voice
and the like are also included). The speech feature amount typically represents the formant,
pitch, etc. of the speaker, and is extracted from a frequency spectrum (power spectrum) obtained
by subjecting voice data to Fourier transform, and a cepstrum obtained by subjecting the power
spectrum to logarithmic transform and inverse Fourier transform. The voice feature amount of
the uttered voice and the voice feature amount of the tone voice are stored in the storage unit 3
in advance, and if the voice feature amount of each collected beam matches the voice feature
amount of the uttered voice, the speaker A voice signal may be selected, and if it matches the
voice feature amount of the tone voice, it may be selected as a background voice signal. In
addition, when there are a plurality of sound collection beams having high levels, the sound
feature amount of each sound collection beam may be analyzed, and a sound collection feature of
the speech may be determined as a sound collection beam of the speaker.
[0041]
It should be noted that, prior to the meeting, the chairperson operates the sound emission and
collection device, and each participant in the meeting makes a statement so that the speaker
position information is generated in advance and recorded in the storage unit 3. It is also good. In
this case, the controller 8 selects the sound collection beam corresponding to the speaker
position information in the signal selection unit 48 based on the speaker position information
stored in the storage unit 3 during the conference. It is set to output to the speech speed
conversion unit 5 as a speaker voice signal. Further, the controller 8 causes the signal selection
unit 48 to select any one of the sound collection beams corresponding to directions other than
the area indicated by the speaker position information stored in the storage unit 3 as a
background sound signal. It is set to output to the mixer 6.
[0042]
Next, the speech speed conversion unit 5 performs speech speed conversion processing on the
input speaker voice signal according to the instruction of the controller 8. The speech speed
conversion process is not simply performed at low speed but performed as follows. That is, in the
speech speed conversion process, the speech signal is divided into a waveform of one period, a
new periodic waveform is generated by combining one period before and after each periodic
04-05-2019
13
waveform, and a newly synthesized periodic waveform is generated between the periodic
waveforms. By inserting the signal, the number of periodic waveforms of the signal is increased
to extend the time axis of the signal while keeping the pitch.
[0043]
FIG. 4A is a flowchart showing the procedure of the decompression process. Further, FIG. 7B is a
view for explaining the extension method. In FIG. 6A, first, the number of samples (sampling
frequency × 1 / signal frequency) in one period of the head portion of the input speech signal is
detected (S91). Two periodic waveforms which are sample data for one period are taken out, and
as shown in FIG. 6B, an attenuation wave is created by multiplying the first periodic waveform A
by an attenuation gain coefficient, An increase wave is created by multiplying the second periodic
waveform B by the increase gain coefficient (S92). Then, by adding and combining these, a
periodic waveform having an intermediate shape between A and B is combined (S 93). As shown
in FIG. 5A, this synthetic waveform is inserted between periodic waveform A and periodic
waveform B and output (S94) to perform acoustically natural time axis expansion.
[0044]
In the case of compressing voice data, as shown in FIG. 5B, the synthetic waveform of the shape
between A and B synthesized at S93 is output instead of the periodic waveforms A and B. Audio
data can be compressed by half in the time axis direction.
[0045]
Further, by defining the cycle of performing the speech speed conversion process, the conversion
speed can be made variable.
For example, as shown in FIG. 5C, voice data can be expanded twice in the time axis direction by
combining two periodic waveforms for each period and inserting them between each periodic
waveform, As shown in (D) of the figure, by combining two periodic waveforms every two cycles,
it can be expanded by 3/2 times.
[0046]
04-05-2019
14
Also, the speech speed conversion decompresses only the head portion (for example, 700 msec)
of the voice section and outputs the subsequent portion at the normal velocity so as not to
decompress more than necessary. Note that the head portion may be decompressed and the
subsequent portion may be compressed. The distinction between the voice section and the noise
section may be determined from the periodicity of the voice signal. For example, the correlation
value is calculated by dividing the audio signal into a predetermined length and multiplying or
subtracting the corresponding sample data. As shown in FIG. 6, when this correlation value is
lower than a predetermined threshold value, it is judged as a noise section, and when it is high, it
is judged as a voice section. In the case of an audio signal having high periodicity such as speech,
the correlation value is high, and in the case of an audio signal having low periodicity such as
noise, the correlation value is low.
[0047]
In the present embodiment, an example is shown in which the speech speed conversion is
performed for the first 700 msec of the voice section. However, speech conversion may be
performed for a longer section length or speech speed conversion may be performed for a short
section length. May be In addition, the extension rate may be changed during the speech speed
conversion interval. For example, in the case where the section length is 700 msec, the speech
speed may be converted at an extension rate such as double extension of the first 600 msec and
subsequent 100 msec of 3 / 2-fold extension.
[0048]
The speaker voice signal whose speech speed has been converted by the speech speed
conversion unit 5 as described above is input to the mixer 6 and mixed with the background
sound signal input from the sound signal processing unit 4 in the mixer 6. The mixed audio
signal is input to the recording / reproducing unit 7. The recording / reproducing unit 7 supplies
the input audio signal to the speaker 1 and the input / output I / F 9, converts the audio signal
into audio data (for example, compressed data such as MP3), and inputs it to the storage unit 3. .
Further, the recording / reproducing unit 7 reads the audio data recorded in the storage unit 3
and supplies an audio signal based on the audio data to the speaker 1 and the input / output I / F
9.
[0049]
04-05-2019
15
The speaker 1 emits an audio signal input from the recording / reproducing unit 7. Although a
cone type speaker is generally used as the speaker 1, other types such as a horn type speaker
may be used. In FIG. 1, a D / A converter for converting a digital audio signal into an analog
audio signal, an amplifier for amplifying the signal, and the like are omitted.
[0050]
The storage unit 3 records the audio data input from the recording / reproducing unit 7. Also, as
described above, the speaker position information input from the controller 8 is also recorded.
[0051]
As a result, among the voices collected by the sound emission and collection device, only the
voice of the speaker is subjected to speech speed conversion, and the background sound is
emitted or recorded as it is without speech speed conversion.
[0052]
The input / output I / F 9 supplies an audio signal to another device.
The input / output I / F 9 has an interface corresponding to the device to which it is supplied. For
example, it converts an audio signal into information suitable for network transmission, and the
network interface and other sound output connected via the network. Output an audio signal to
the device. Further, the input / output I / F 9 inputs an audio signal from another sound emitting
and collecting apparatus connected via a network, and inputs the audio signal to the recording /
reproducing unit 7. The recording / reproducing unit 7 records in the storage unit 3 the sound
picked up by the own device and the sound inputted from the other device.
[0053]
In addition, although the single speaker 1 was shown as a sound emission side in the said
embodiment, you may make it comprise a speaker array by arranging multiple speakers 1
04-05-2019
16
linearly. In this case, by sequentially delaying the audio signal supplied to each speaker, the audio
beam can be focused and sound image localization can be performed as if the audio was emitted
from the position of the speaker.
[0054]
In addition, when the collected voice signal is output to another device and the speaker array is
configured on the other device side, the above-described speaker position information is also
output, so that the voice of the other device is also the position of the speaker. Sound localization
as if it were emitted from
[0055]
Further, in the case of connecting a plurality of the sound emitting and collecting apparatuses of
the above embodiments via a network, the following application examples are possible.
FIG. 7 is a view showing an example of configuring a voice conference system by connecting a
plurality of the sound emitting and collecting devices of the above embodiment via a network.
This audio conference system has sound emitting and collecting devices 111A to 111C
connected via a network 100. Since the sound emission and collection devices 111A to 111C
have the same configuration and function as the sound emission and collection device described
in the above embodiment, the detailed description of the respective configurations and functions
will be omitted.
[0056]
The sound emission and collection devices 111A to 111C are disposed at distant points a to c,
respectively. The sound emitting and collecting device 111A is disposed at the point a, the sound
emitting and collecting device 111B is disposed at the point b, and the sound emitting and
collecting device 111C is disposed at the point c.
[0057]
At the point a, the conferees A and B are present at directions Dir11 and Dir13 with respect to
04-05-2019
17
the sound emission and collection device 111A, respectively. At the point b, the sound source A
exists in the direction Dir22 with respect to the sound emission and collection device 111B. At
the point c, the conferees C and D are present with respect to the sound emission and collection
device 111C in the directions Dir 31 and Dir 32, respectively. The azimuths Dir11 to Dir14, the
azimuths Dir21 to Dir24, and the azimuths Dir31 to Dir34 respectively correspond to the four
partial areas 101 to 104 in the above-described embodiment, and the sound emitting and
collecting apparatus collects voices of these azimuths. Do.
[0058]
In this audio conference system, each sound emitting and collecting device transmits the sound
collected by its own device to all other sound emitting and collecting devices. Also, each sound
emitting and collecting device records the sound transmitted from another device together with
the sound collected by its own device.
[0059]
When the conferee A and the conferee B speak, the voice emitting and collecting apparatus 111A
converts the speech speed of these voices and transmits the voice to the other apparatus. In
addition, when the conferee C and the conferee D speak, the voice emitting and collecting
apparatus 111C converts the speech speed of these voices and transmits the voice to the other
apparatus.
[0060]
Here, the sound emission and collection device 111B outputs the tone generated by the sound
source A to another device without converting the speech speed. At this time, the sound emission
and collection device 111B transmits the speech without converting the speech speed even if the
level of the tone generated by the sound source A is very large. For example, the speech speed is
not converted even if the level exceeds the above-described predetermined threshold (general
speech level). That is, in FIG. 1, when the controller 8 is instructed by the operation unit (not
shown) not to change the speech speed, the controller 8 always outputs the collected voice to the
mixer 6 to the voice signal processing unit 4. Set As a result, in this sound emitting and collecting
device, a voice whose speech speed is not converted is always output. In this case, the controller
8 does not have to determine the absolute value of the level of the sound collection beam
04-05-2019
18
(whether it is equal to or higher than the level of a typical speech sound) in order to output the
sound collection beam with the highest level.
[0061]
The controller 8 may set the voice signal processing unit 4 to always output the collected voice
to the voice speed conversion unit 5. In this case, in this sound emitting and collecting device,
speech whose speech speed has been converted is always output.
[0062]
As described above, even when any sound emitting and collecting apparatus in the audio
conference system is used as an apparatus dedicated to background sound output (a sound
emitting and collecting apparatus that does not convert the speech speed), the conferees at each
point You can listen to the speaker's voice slowly while listening at normal speed. Also, in each
audio conference device, the background sound is recorded at a normal speed, and only the
speech of the speaker is converted to speech speed and recorded.
[0063]
Block diagram showing the configuration of the sound emitting and collecting apparatus
according to the embodiment of the present invention Block diagram showing the configuration
of the main part of the audio signal processing unit Diagram showing sound source detection
area Diagram showing speech speed conversion process Diagram showing fast conversion
processing Diagram showing an example of calculation of correlation value of input voice data
Diagram showing an example of configuring a voice conference system by connecting a plurality
of sound emitting and collecting devices of the above embodiment through the network
Explanation of sign
[0064]
1-Speaker 2-Microphone 3-Storage unit 4-Audio signal processing unit 5-Speech speed
conversion unit 6-Mixer 7-Recording and reproduction unit 8-Controller
04-05-2019
19
Документ
Категория
Без категории
Просмотров
0
Размер файла
32 Кб
Теги
jp2008048342
1/--страниц
Пожаловаться на содержимое документа