Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2012058360 An object of the present invention is to enable noise elimination processing independent of microphone spacing. A target sound emphasizing unit 105 performs target sound emphasis processing on observation signals of microphones 101a and 101b to obtain a target sound estimation signal. The noise estimation unit 106 performs noise estimation processing on the observation signals of the microphones 101a and 101b to obtain a noise estimation signal. The post filtering unit 109 removes noise components remaining in the target sound estimation signal by post filtering processing using the noise estimation signal to obtain a noise suppression signal. The correction coefficient calculation unit 107 calculates a correction coefficient to correct the post-filtering process, that is, to match the noise component remaining in the target sound estimation signal with the gain of the noise estimation signal. The correction coefficient changing unit 108 changes, among the correction coefficients calculated by the correction coefficient calculation unit 107, the coefficient of the band causing spatial aliasing so as to crush the peak that can be made into the specific frequency. [Selected figure] Figure 1 Noise removal apparatus and noise removal method [0001] The present invention relates to a noise removal device and a noise removal method, and more particularly to a noise removal device and the like for removing noise by enhancing a target sound and post-filtering processing. [0002] 03-05-2019 1 For example, a situation in which the user listens to music to be reproduced by a mobile phone, a personal computer or the like with noise cancellation headphones is assumed. In this situation, when there is an incoming call, a chat call, etc., it is very troublesome for the user to start talking after preparing the microphone one by one. It is desirable for the user to start a call as it is without hands and without preparing a microphone. [0003] A noise canceling microphone is installed at the ear of the noise canceling headphone, and it is conceivable to make a call using this microphone. This makes it possible to realize a call with the headphones attached. In this case, ambient noise is a problem, so it is desirable to suppress noise and transmit only voice. [0004] For example, Patent Document 1 describes a technique for removing noise by emphasizing target sound and post-filtering processing. FIG. 31 shows a configuration example of the noise removal device described in Patent Document 1. In FIG. In this noise removal apparatus, the voice former (11) emphasizes speech and the blocking matrix (12) emphasizes noise. The noise reduction means (13) uses emphasis noise to reduce the noise component, as not all of the noise disappears in the speech enhancement. [0005] Furthermore, in this noise removal device, the postfiltering means (14) removes the noise that has been eliminated. In this case, although the noise reduction means (13) and the output of the processing means (15) are used, spectral errors occur in the filter characteristics. Therefore, correction is performed in the output adaptation unit (16). [0006] 03-05-2019 2 In this case, correction is performed so that the output S1 of the noise reduction means (13) and the output S2 of the adaptation unit (16) become equal in a section where there is no target sound and only noise exists. This is expressed by the following equation (1). In the equation (1), the left side indicates the expected value of the output S2 of the adaptation unit (16), and the right side indicates the expected value of the output S1 of the noise reduction means (13) in the section where there is no target sound. [0007] [0008] By such correction, in the post filtering means (14), there is no error of S1 and S2 in the noise only section, and all noise can be removed, and in the (voice + noise) section, only the noise component is It can be removed to leave an audio. [0009] This correction can be interpreted as correcting the directivity of the filter. FIG. 32 (a) shows an example of the directivity of the filter before correction, and FIG. 32 (b) shows an example of the directivity of the filter after correction. In these figures, the vertical axis shows the gain, and the higher it goes, the higher the gain. [0010] In FIG. 32 (a), the solid line a indicates the directivity characteristic created by the beam former (11) to emphasize the target sound. This directional characteristic emphasizes the front target sound and lowers the gain of sounds coming from other directions. Further, in FIG. 32A, a broken line b indicates the directivity characteristic produced by the blocking matrix unit 12. The directivity characteristic lowers the gain of the target sound direction and estimates noise. [0011] 03-05-2019 3 Before correction, there is a gain error in the direction of noise between the directivity of the target sound emphasis (solid line a) and the directivity of noise estimation (solid line b). Therefore, when the noise estimation signal is subtracted from the target sound estimation signal in the post-filtering means (14), the noise remains unerased or excessive. [0012] Further, in FIG. 32 (b), a solid line a 'indicates the directional characteristic of the target sound emphasis after the correction. Further, in FIG. 32 (b), a broken line b 'indicates the directivity of noise estimation after the correction. By the correction coefficient, the gain of the direction of the noise (noise) in the directivity of the target sound emphasis and the directivity of the noise estimation can be matched. Therefore, when the noise estimation signal is subtracted from the target sound estimation signal in the post filtering means (14), the noise elimination or excessive elimination is reduced. [0013] JP, 2009-49998, A [0014] The noise suppression technique described in the above-mentioned Patent Document 1 has a problem that the microphone spacing is not taken into consideration. That is, in the noise suppression technique described in Patent Document 1, there are cases where the correction coefficient can not be calculated correctly depending on the microphone spacing. If the correction coefficient is incorrectly calculated, the target sound may be distorted. When the microphone spacing is wide, the directional characteristic curve causes spatial aliasing, which amplifies or attenuates the gain of the unintended azimuth. [0015] 03-05-2019 4 FIG. 33 shows an example of the directivity of the filter in the case of spatial aliasing, where the solid line a shows the directivity of the target sound emphasis produced by the beam former (11), and the broken line b shows the blocking matrix (12 Shows the directional characteristics of the noise estimate made in In the case of the directivity characteristic example shown in FIG. 33, noise is also amplified simultaneously with the target sound. In this case, it is meaningless to obtain a correction coefficient, and the noise suppression performance is degraded. [0016] In the noise suppression technique described in the above-mentioned Patent Document 1, it is premised that the microphone spacing is known in advance, and furthermore, it is the microphone spacing where spatial aliasing does not occur. This premise is a considerable limitation. For example, at the sampling frequency of the telephone band (8000 Hz), the microphone spacing that does not cause spatial aliasing is about 4.3 cm. [0017] In order not to cause space aliasing, it is necessary to set the distance between the microphones (element distance) in advance. Here, assuming that the velocity of sound is c, the distance between the microphones (element distance) is d, and the frequency is f, the following equation (2) needs to be satisfied in order to prevent space aliasing. ｄ＜ｃ／２ｆ ・・・（２） [0018] For example, in the case of a noise canceling microphone installed in noise canceling headphones, the microphone spacing is the spacing between the left and right ears. That is, in this case, about 4.3 cm, which is the distance between microphones which does not cause space aliasing as described above, is impossible. [0019] Moreover, in the noise suppression technique described in the above-mentioned patent document 1, there is a problem that the number of sound sources of ambient noise is not taken 03-05-2019 5 into consideration. That is, in the situation where there are innumerable noise sources in the surroundings, the ambient sound is randomly input in each frame and each frequency. In this case, the location to which the gain should be adjusted in the directivity characteristic of the target sound emphasis and the directivity characteristic of the noise estimation move apart in each frame and each frequency. Therefore, the correction coefficient constantly changes with time and becomes unstable, which adversely affects the output sound. [0020] FIG. 34 shows the situation where there are innumerable noise sources in the vicinity. The solid line a indicates the directivity characteristic of the target sound emphasis similar to the solid line a in FIG. 32A, and the dashed line b indicates the directivity characteristic of the target sound emphasis similar to the dashed line b in FIG. It shows. If there are innumerable noise sources in the surroundings, there are many places where the gain of the two directional characteristics should be matched. In a real environment, since there are innumerable noise sources in the vicinity as described above, the noise suppression technology described in the above-mentioned Patent Document 1 can not cope with it. [0021] An object of the present invention is to enable noise removal processing independent of microphone spacing. Another object of the present invention is to enable noise reduction processing in accordance with the ambient noise conditions. [0022] The concept of the present invention includes: a target sound emphasizing unit that performs target sound enhancement processing on observation signals of a first microphone and a second microphone arranged at predetermined intervals to obtain a target sound estimation signal; And a noise estimation unit that performs noise estimation processing on the observation signal of the second microphone to obtain a noise estimation signal; noise components remaining in the target sound estimation signal obtained by the target sound enhancement unit; A post-filtering unit for removing by post-filtering processing using the noise estimation signal obtained by the estimation unit; a target sound estimation signal obtained by the target sound enhancement unit; and a noise estimation signal obtained by the noise estimation unit A correction coefficient calculation unit that calculates a correction coefficient for correcting the post-filtering process performed by the post-filtering unit for each frequency; And a correction coefficient change unit 03-05-2019 6 for changing a correction coefficient of a band causing spatial aliasing among the correction coefficients calculated by the correction coefficient calculation unit so as to crush a peak that can be made to a specific frequency. . [0023] In the present invention, the target sound emphasizing unit subjects the observation signals of the first microphone and the second microphone arranged at predetermined intervals to the target sound estimation signal to obtain the target sound estimation signal. As target sound enhancement processing, for example, DS (Delay and Sum) processing, adaptive beamformer processing, or the like that is conventionally known is used. Further, the noise estimation unit performs noise estimation processing on the observation signals of the first microphone and the second microphone to obtain a noise estimation signal. As the noise estimation processing, for example, NBF (Null-Beam Former) processing, adaptive beam former processing, or the like, which is conventionally known, is used. [0024] The post-filtering unit removes noise components remaining in the target sound estimation signal obtained by the target sound enhancement unit by post-filtering processing using the noise estimation signal obtained by the noise estimation unit. As post-filtering processing, for example, a spectral subtraction method, an MMSE-STSA method, etc., which are conventionally known, are used. Further, the correction coefficient calculation unit corrects the post-filtering process performed by the post-filtering unit based on the target sound estimation signal obtained by the target sound emphasizing unit and the noise estimation signal obtained by the noise estimation unit. Coefficients are calculated for each frequency. [0025] Among the correction coefficients calculated by the correction coefficient calculation unit, the correction coefficient changing unit changes the correction coefficient of the band causing spatial aliasing so as to close the peak that can be made to the specific frequency. For example, in the correction coefficient changing unit, the correction coefficient calculated by the correction 03-05-2019 7 coefficient calculation unit is smoothed in the frequency direction in a band causing spatial aliasing, and a corrected correction coefficient of each frequency is obtained. Also, for example, in the correction coefficient changing unit, the correction coefficient of each frequency is changed to 1 in the band causing spatial aliasing. [0026] When the distance between the first microphone and the second microphone, that is, the microphone distance is wide, spatial aliasing occurs, and the directivity of the target sound emphasis is such that the sound other than the azimuth of the target sound is also emphasized. Among the correction coefficients of each frequency calculated by the correction coefficient calculation unit, a peak can be generated at a specific frequency in a band causing spatial aliasing. Therefore, if this correction coefficient is used as it is, the peak formed at a specific frequency as described above adversely affects the output sound and degrades the sound quality. [0027] In the present invention, the correction coefficient of the band causing spatial aliasing is changed so as to crush the peak that can be made to a specific frequency, the adverse effect of this peak on the output sound can be reduced, and the sound quality is degraded. It can be suppressed. This enables noise removal processing independent of the microphone spacing. [0028] The present invention further includes a target sound segment detection unit that detects a segment having a target sound based on, for example, the target sound estimation signal obtained by the target sound enhancement unit and the noise estimation signal obtained by the noise estimation unit, The correction coefficient calculation unit is configured to calculate the correction coefficient in a section where there is no target sound based on the target sound section information obtained by the target sound section detection unit. In this case, only the noise component is included in the target sound estimation signal, so that the correction coefficient can be accurately calculated without being influenced by the target sound. 03-05-2019 8 [0029] For example, in the target sound detection unit, the energy ratio of the target sound estimation signal and the noise estimation signal is determined, and when the energy ratio is larger than the threshold, it is determined that the target sound section. Also, for example, in the correction coefficient calculation unit, the correction coefficient β (f, t) of the frame t of the f th frequency is the target sound estimation signal Z (f, t) of the frame t of this f th frequency and noise estimation The signal N (f, t) and the correction coefficient β (f, t-1) of the frame t-1 of the f-th frequency are used to calculate the following equation. [0030] Further, according to another concept of the present invention, there is provided a target sound emphasizing unit for obtaining a target sound estimation signal by subjecting observation signals of a first microphone and a second microphone arranged at a predetermined interval to target sound emphasis processing. A noise estimation unit that performs noise estimation processing on the observation signals of the first microphone and the second microphone to obtain a noise estimation signal; noise components remaining in the target sound estimation signal obtained by the target sound enhancement unit A post-filtering unit for removing the noise estimation signal obtained by the noise estimation unit by post-filtering processing; a target sound estimation signal obtained by the target sound enhancement unit; and a noise obtained by the noise estimation unit A correction factor for calculating a correction factor for correcting the postfiltering process performed by the post-filtering unit based on the estimated signal for each frequency An ambient noise state estimation unit for processing the observation signals of the first microphone and the second microphone to obtain information on the number of sound sources of ambient noise; and of the ambient noise obtained by the ambient noise state estimation unit Based on the sound source number information, the smoothing frame number is increased as the sound source number increases, and the correction coefficient calculated by the correction coefficient calculation unit is smoothed in the frame direction to obtain a corrected correction coefficient of each frame A noise eliminator comprising: [0031] In the present invention, the target sound emphasizing unit subjects the observation signals of the first microphone and the second microphone arranged at predetermined intervals to the target sound estimation signal to obtain the target sound estimation signal. 03-05-2019 9 As target sound enhancement processing, for example, DS (Delay and Sum) processing, adaptive beamformer processing, or the like that is conventionally known is used. Further, the noise estimation unit performs noise estimation processing on the observation signals of the first microphone and the second microphone to obtain a noise estimation signal. As the noise estimation processing, for example, NBF (Null-Beam Former) processing, adaptive beam former processing, or the like, which is conventionally known, is used. [0032] The post-filtering unit removes noise components remaining in the target sound estimation signal obtained by the target sound enhancement unit by post-filtering using the noise estimation signal obtained by the noise estimation unit. As post-filtering processing, for example, a spectral subtraction method, an MMSE-STSA method, etc., which are conventionally known, are used. Further, the correction coefficient calculation unit corrects the post-filtering process performed by the post-filtering unit based on the target sound estimation signal obtained by the target sound emphasizing unit and the noise estimation signal obtained by the noise estimation unit. Coefficients are calculated for each frequency. [0033] The ambient noise state estimation unit processes the observation signals of the first microphone and the second microphone to obtain information on the number of sources of ambient noise. For example, in the ambient noise state estimation unit, the correlation coefficient of the observation signals of the first microphone and the second microphone is calculated, and the calculated correlation coefficient is used as the sound source number information of the ambient noise. The correction coefficient change unit increases the number of smoothed frames as the number of sound sources increases, based on the sound source number information of the ambient noise obtained by the ambient noise state estimation unit, and the correction coefficient calculated by the correction coefficient calculation unit Directionally smoothed, the modified correction coefficients for each frame are obtained. [0034] In the situation where there are innumerable noise sources in the surroundings, the sound from each noise source around each frame and frequency should be input at random, and the gain 03-05-2019 10 should be matched with the directivity characteristics of the target sound emphasis and the noise estimation directivity. The parts move apart at each frame and each frequency. That is, the correction coefficient calculated by the correction coefficient calculation unit constantly changes with time and becomes unstable, which adversely affects the output sound. [0035] In the present invention, as the number of sources of ambient noise increases, the number of smoothed frames is increased, and the one smoothed in the frame direction is used as the correction coefficient of each frame. This makes it possible to suppress the change in the correction coefficient in the time direction and reduce the influence on the output sound in a situation where there are innumerable noise sources in the vicinity. This makes it possible to perform noise removal processing in accordance with the situation of ambient noise (a realistic environment in which there are an infinite number of ambient noises). [0036] According to the present invention, the correction coefficient of the band causing spatial aliasing is changed so as to crush the peak that can be made to a specific frequency, the adverse effect of this peak on the output sound can be reduced, and the sound quality is degraded. Can be suppressed, and noise removal processing independent of the microphone spacing becomes possible. Further, according to the present invention, the number of smoothed frames is increased as the number of sound sources of ambient noise increases, and as the correction coefficient of each frame, one smoothed in the frame direction is used, and there are innumerable numbers around In the situation where there is a noise source, it is possible to suppress the change in the correction coefficient in the time direction to reduce the influence on the output sound, and it is possible to perform the noise removal processing according to the situation of the ambient noise. [0037] FIG. 1 is a block diagram showing a configuration example of a voice input system as a first embodiment of the present invention. It is a figure for demonstrating the target sound emphasis part. It is a figure for demonstrating a noise estimation part. It is a figure for demonstrating a post-filtering part. It is a figure for demonstrating a correction coefficient calculation part. It is a 03-05-2019 11 figure which shows an example (microphone space | interval d = 2 cm, space aliasing absence) of the correction coefficient for every frequency calculated by a correction coefficient calculation part. It is a figure which shows an example (microphone space | interval d = 20 cm, space aliasing presence) of the correction coefficient for every frequency calculated by a correction coefficient calculation part. It is a figure which shows that noise (female speaker) exists in 45 degree azimuth | direction. It is a figure which shows an example (microphone distance d = 2 cm, space aliasing absence, the number of noise sources = 2) of the correction coefficient for every frequency calculated by a correction coefficient calculation part. It is a figure which shows an example (microphone space | interval d = 20 cm, space aliasing presence, the number of noise sources = 2) of the correction coefficient for every frequency calculated by a correction coefficient calculation part. It is a figure which shows that noise (female speaker) exists in azimuth of 45 degrees, and noise (male speaker) exists in azimuth of -30 degrees. It is a figure for demonstrating the method (1st method) of smoothing in a frequency direction, in order to change the coefficient of the zone which is causing space aliasing so that the peak which can be made to a specific frequency may be crushed. It is a figure for demonstrating the method (1st method) of smoothing in a frequency direction, in order to change the coefficient of the zone which is causing space aliasing so that the peak which can be made to a specific frequency may be crushed. It is a figure for demonstrating the method (2nd method) to replace with 1 in order to change the coefficient of the zone which is causing space aliasing so that the peak which can be made to a specific frequency may be crushed. It is a flowchart which shows the procedure of the process in a correction coefficient change part. It is a block diagram which shows the structural example of the audio | voice input system as 2nd Embodiment of this invention. It is a bar graph which shows an example of the relationship between the number of sound sources of noise, and correlation coefficient corr. It is a figure which shows an example (microphone space | interval d = 2 cm) of the correction coefficient for every frequency calculated by a correction coefficient calculation part, when noise exists in azimuth of 45 degrees. It is a figure which shows that noise exists in 45 degrees azimuth | direction. It is a figure which shows an example (microphone space | interval d = 2 cm) of the correction coefficient for every frequency calculated by a correction coefficient calculation part, when noise exists in several azimuth | direction. It is a figure which shows that noise exists in several azimuth | direction. It is a figure which shows that the correction coefficient calculated by the correction coefficient calculation part changes at random for every flame | frame. It is a figure which shows an example of the smoothed flame | frame number calculation function used when calculating | requiring smoothed flame | frame number (gamma) based on correlation coefficient corr (sound source number information of ambient noise). It is a figure for demonstrating obtaining the correction coefficient which smooth | blunted the correction coefficient calculated by the correction coefficient calculation part in the frame direction (time direction), and was changed. It is a flowchart which shows the procedure of the process in an ambient noise state estimation part and a correction coefficient change part. It is a block diagram which shows the structural 03-05-2019 12 example of the audio | voice input system as 3rd Embodiment of this invention. It is a flowchart which shows the procedure of the process in a correction coefficient change part, an ambient noise state estimation part, and a correction coefficient change part. It is a block diagram which shows the structural example of the audio | voice input system as 4th Embodiment of this invention. It is a figure for demonstrating the target sound detection part. It is a figure for demonstrating the principle of a target sound detection part. It is a block diagram which shows the structural example of the conventional noise removal apparatus. It is a figure which shows an example of the directional characteristic of the target sound emphasis before correction | amendment in the conventional noise removal apparatus after correction | amendment, and the directional characteristic of noise estimation. It is a figure which shows the directional characteristic example of the filter in, when space aliasing is caused. It is a figure which shows the condition where innumerable noise sources exist around. [0038] Hereinafter, modes for carrying out the invention (hereinafter referred to as “embodiments”) will be described. The description will be made in the following order. １． First Embodiment Second embodiment 3. Third embodiment 4. Fourth embodiment 5. Modified example [0039] ＜１． First Embodiment> [Configuration Example of Voice Input System] FIG. 1 shows a configuration example of a voice input system 100 according to a first embodiment. The voice input system 100 is a system that performs voice input using noise canceling microphones installed in the left and right headphones of the noise canceling headphone. [0040] The voice input system 100 includes microphones 101a and 101b, an A / D converter 102, a frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement unit 105, and a noise estimation unit (target sound The suppression unit 106 is included. The voice input system 100 further includes a correction coefficient calculation unit 107, a correction coefficient change unit 108, a post filtering unit 109, an inverse fast Fourier transform (IFFT) unit 110, and a waveform synthesis unit 111. 03-05-2019 13 [0041] The microphones 101a and 101b collect ambient sound to obtain an observation signal. The microphones 101a and 101b are arranged side by side at a predetermined interval. In this embodiment, the microphones 101a and 101b are noise canceling microphones installed on the left and right headphones of the noise canceling headphone. [0042] The A / D converter 102 converts an observation signal obtained from the microphones 101a and 101b from an analog signal to a digital signal. The frame division unit 103 divides the observation signal converted into the digital signal by the A / D converter 102 into frames of a predetermined time length in order to perform processing on a frame-by-frame basis. The fast Fourier transform unit 104 performs fast Fourier transform (FFT) processing on the framed signal obtained by the frame division unit 103 to transform it into a frequency spectrum X (f, t) in the frequency domain. . Here, (f, t) indicates that it is the frequency spectrum of the frame t of the f-th frequency. That is, f indicates frequency and t indicates time index. [0043] The target sound emphasizing unit 105 performs target sound emphasis processing on the observation signals of the microphones 101a and 101b, and obtains a target sound estimation signal for each frequency in each frame. As shown in FIG. 2, when the observation signal of the microphone 101 a is X 1 (f, t) and the observation signal of the microphone 101 b is X 2 (f, t) as shown in FIG. Get (f, t). The target sound emphasizing unit 105 uses, for example, conventionally known DS (Delay and Sum) processing or adaptive beamformer processing as target sound emphasis processing. [0044] DS is a technology for matching the phase of the signal input to the microphones 101a and 101b to the direction of the target sound. The microphones 101a and 101b are noise canceling microphones installed on the left and right headphones of the noise canceling headphone, and 03-05-2019 14 the user's mouth is always front as viewed from the microphones 101a and 101b. [0045] Therefore, in the case of using the DS processing, the target sound emphasizing unit 105 divides the observation signal X1 (f, t) and the observation signal X2 (f, t) based on the following equation (3), and then divides by two. A target sound estimation signal Z (f, t) is obtained. Z (f, t) = {X1 (f, t) + X2 (f, t)} / 2 (3) [0046] DS is a technique called a fixed beam former, which is a technique for controlling the directivity by changing the phase of an input signal. When the microphone interval is known in advance, the target sound emphasizing unit 105 uses the processing such as adaptive beamformer processing instead of the DS processing as described above to acquire the target sound estimation signal Z (f, t ) Can also be obtained. [0047] Returning to FIG. 1, the noise estimation unit (target sound suppression unit) 106 performs noise estimation processing on the observation signals of the microphones 101a and 101b, and obtains a noise estimation signal for each frequency in each frame. The noise estimation unit 106 estimates sounds other than the target sound (user voice) as noise. That is, the noise estimation unit 106 performs processing for removing only the target sound and leaving noise. [0048] As shown in FIG. 3, the noise estimation unit 106 sets the observation signal of the microphone 101a to X1 (f, t) and the observation signal of the microphone 101b to X2 (f, t), the noise estimation signal N (f, t). , t). The noise estimation unit 106 uses, for example, an NBF (Null-Beam Former) process, an adaptive beam former process, or the like, which is conventionally known, as the noise estimation process. 03-05-2019 15 [0049] As described above, the microphones 101a and 101b are noise canceling microphones installed on the left and right headphones of the noise canceling headphone, and the user's mouth is always front as viewed from the microphones 101a and 101b. Therefore, when using the NBF process, the noise estimation unit 106 subtracts the observation signal X1 (f, t) and the observation signal X2 (f, t) based on the following equation (4), and then divides by two. A noise estimation signal N (f, t) is obtained. N (f, t) = {X1 (f, t) -X2 (f, t)} / 2 (4) [0050] NBF is a technique called a fixed beam former, which is a technique for controlling the directivity by changing the phase of an input signal. When the microphone interval is known in advance, the noise estimation unit 106 uses a process such as adaptive beamformer processing instead of the NBF processing as described above to generate the noise estimation signal N (f, t). You can also get it. [0051] Returning to FIG. 1, the post-filtering unit 109 estimates noise components obtained by the noise estimation unit 106, the noise components remaining in the target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105. The signal N (f, t) is removed by postfiltering processing. That is, as shown in FIG. 4, the post-filtering unit 109 generates the noise suppression signal Y (f, t) based on the target sound estimation signal Z (f, t) and the noise estimation signal N (f, t). obtain. [0052] The post-filtering unit 109 obtains a noise suppression signal Y (f, t) using a known technique such as a spectral subtraction method or an MMSE-STSA method. The spectral subtraction method is described, for example, in the document “SF Boll,“ Suppression of acoustic noise in speech using spectral subtraction, ”IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113-120, 1979. "It is described in. Moreover, the MMSE-STSA method is described 03-05-2019 16 in the document "Y. Ephraimand D. Malah," Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator ", IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109-1121, 1984. "It is described in. [0053] Referring back to FIG. 1, the correction coefficient calculation unit 107 calculates the correction coefficient β (f, t) for each frequency in each frame. The correction coefficient β (f, t) is used to correct the post-filtering process performed by the above-described post-filtering unit 109, that is, the gain of the noise component remaining in the target sound estimation signal Z (f, t) This is to match the gain of the noise estimation signal N (f, t). As shown in FIG. 5, the correction coefficient calculation unit 107 estimates the target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106. Correction coefficient β (f, t) is calculated for each frequency in each frame based on [0054] In this embodiment, the correction coefficient calculation unit 107 calculates the correction coefficient β (f, t) based on the following equation (5). [0055] The correction coefficient calculation unit 107 has a stable correction coefficient by smoothing using the correction coefficient β (f, t−1) of the previous frame because the correction coefficient varies for each frame only with the calculation coefficient of the current frame. It is asking for β (f, t). The first term on the right side of the equation (5) is a term that carries the correction coefficient β (f, t-1) of the previous frame, and the second term on the right side of the equation (5) is a term for computing the coefficient of the current frame It is. Note that α is a smoothing coefficient, which is a fixed value such as 0.9 or 0.95, for example, and a weight is placed on the previous frame. [0056] 03-05-2019 17 When the noise suppression signal Y (f, t) is obtained using the known technique of the spectral subtraction method, the above-described post-filtering unit 109 uses the correction coefficient β (f, t) as in the following equation (6) Do. In this case, the post-filtering unit 109 corrects the noise estimation signal N (f, t) by multiplying the noise estimation signal N (f, t) by the correction coefficient β (f, t). In the equation (6), when the correction coefficient β (f, t) = 1, no correction is performed. Y (f, t) = Z (f, t) -β (f, t) * N (f, t) (6) [0057] The correction coefficient changing unit 108 crushes a peak at which the coefficient of the band causing spatial aliasing among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107 in each frame can be a specific frequency. To change. In practice, the post-filtering unit 109 uses not the correction coefficient β (f, t) itself calculated by the correction coefficient calculation unit 107, but the correction coefficient β ′ (f, t) after this change. [0058] As described above, when the microphone interval is wide, the curve of directivity characteristic causes space aliasing to be folded back, and the directivity characteristic of the target sound emphasis is such that the sound other than the azimuth of the target sound is also emphasized. Of the correction coefficients of each frequency calculated by the correction coefficient calculation unit 107, a peak can be generated at a specific frequency in a band causing spatial aliasing. If this correction factor is used as it is, a peak formed at a specific frequency adversely affects the output sound and degrades the sound quality. [0059] FIG. 6 and FIG. 7 show an example of the correction coefficient when noise (female speaker) exists in the 45 ° azimuth as shown in FIG. FIG. 6 shows the case where the microphone spacing d is 2 cm and there is no spatial aliasing. On the other hand, FIG. 7 shows the case where the microphone spacing d is 20 cm and there is space aliasing, and a peak is generated at a specific frequency. 03-05-2019 18 [0060] An example of the correction coefficients in FIGS. 6 and 7 described above shows the case where there is one noise. However, in a real environment, the noise is not one. In FIGS. 9 and 10, as shown in FIG. 11, noise (female speaker) is present at an azimuth of 45 °, and noise (male speaker) is present at an azimuth of −30 °. An example of the correction coefficient is shown. [0061] FIG. 9 shows the case where the microphone distance d is 2 cm and there is no space aliasing. On the other hand, FIG. 10 shows the case where the microphone distance d is 20 cm, and there is space aliasing, and a peak is generated at a specific frequency. In this case, although the peak of the coefficient is complicated as compared with the case of one noise (see FIG. 7), there is a frequency at which the value of the coefficient falls as in the case of one noise. [0062] The correction coefficient changing unit 108 checks the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107, and finds the first frequency Fa (t) on the low frequency side where the value of the coefficient falls. As shown in FIG. 7 and FIG. 10, the correction coefficient changing unit 108 determines that spatial aliasing occurs in a band of Fa (t) or more. Then, as described above, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the correction coefficient changing unit 108 can set the coefficient of the band causing spatial aliasing to a specific frequency. Change to collapse the peak. [0063] The correction coefficient changing unit 108 changes the correction coefficient of the band causing spatial aliasing by using, for example, the first method or the second method. When the first method is used, the correction coefficient changing unit 108 obtains the corrected correction coefficient β ′ (f, t) of each frequency as follows. As shown in FIGS. 12 and 13, 03-05-2019 19 among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the correction coefficient changing unit 108 has a correction coefficient for a band causing spatial aliasing, as shown in FIGS. Smoothing in the direction yields the modified correction factor β '(f, t) for each frequency. [0064] By smoothing in the frequency direction in this way, it is possible to crush the peak of the coefficient that appears excessively. The section length of the smoothing can be set arbitrarily, and in FIG. 12, the length of the arrow is shortened to indicate that the section length is set short. Further, in FIG. 13, it is shown that the section length is set to be long by increasing the length of the arrow. [0065] On the other hand, in the case of using the second method, the correction coefficient changing unit 108 corrects the correction coefficient of the band causing spatial aliasing among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107. As shown in 14, it is replaced by 1 to obtain a modified correction coefficient β ′ (f, t). Note that FIG. 14 is logarithmic notation, so it is not 1 but 0. The second method uses the fact that the correction coefficient approaches 1 when the image is extremely smoothed in the first method. This second method has the advantage that the smoothing operation can be omitted. [0066] The flowchart of FIG. 15 shows the procedure of the process (for one frame) in the correction coefficient changing unit 108. The correction coefficient changing unit 108 starts the process in step ST1, and then proceeds to the process of step ST2. In step ST2, the correction coefficient changing unit 108 acquires the correction coefficient β (f, t) from the correction coefficient calculation unit 107. Then, in step ST3, the correction coefficient changing unit 108 searches for the coefficient of each frequency f from the low band in the current frame t, and the first frequency Fa (t) on the low band side where the coefficient value drops is locate. [0067] 03-05-2019 20 Next, in step ST4, the correction coefficient changing unit 108 checks a flag indicating whether or not to smooth a band higher than Fa (t), that is, a band causing spatial aliasing. Note that this flag is set in advance by a user operation. When the flag is on (ON), the correction coefficient changing unit 108, in step ST5, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the coefficients in the band more than Fa (t) To obtain a modified correction factor β '(f, t) for each frequency f. After the process of step ST5, the correction coefficient changing unit 108 ends the process in step ST6. [0068] Further, when the correction coefficient changing unit 108 turns off the flag in step ST4, in step ST7, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the band more than Fa (t) The correction coefficient of is replaced with "1" to obtain a correction coefficient .beta. '(F, t). After the process of step ST7, the correction coefficient changing unit 108 ends the process in step ST6. [0069] Returning to FIG. 1, the inverse fast Fourier transform (IFFT) unit 110 performs an inverse fast Fourier transform process on the noise suppression signal Y (f, t) output from the post filtering unit 109 for each frame. The inverse fast Fourier transform unit 110 performs processing reverse to that of the above-described Fourier transform unit 104, converts a frequency domain signal into a time domain signal, and obtains a framing signal. [0070] The waveform synthesis unit 111 synthesizes the framed signals of the respective frames obtained by the inverse fast Fourier transform unit 110 and restores them into a time-series continuous speech signal. The waveform synthesis unit 111 constitutes a frame synthesis unit. The waveform synthesis unit 111 outputs the noise-suppressed speech signal SAout as an output of the speech input system 100. 03-05-2019 21 [0071] The operation of the voice input system 100 shown in FIG. 1 will be briefly described. Ambient sound is collected by the microphones 101a and 101b arranged at predetermined intervals, and an observation signal is obtained. The observation signals obtained by the microphones 101 a and 101 b are supplied from the A / D converter 102 to the frame division unit 103 after being converted from analog signals to digital signals. Then, in the frame division unit 103, the observation signals from the microphones 101a and 101b are divided into frames of a predetermined time length and framed. [0072] The framed signals of each frame obtained by being framed by the frame division unit 103 are sequentially supplied to the fast Fourier transform unit 104. The fast Fourier transform unit 104 performs fast Fourier transform (FFT) processing on the framed signal, and the observation signal X1 (f, t) of the microphone 101a and the observation signal of the microphone 101b as signals in the frequency domain. And X2 (f, t) is obtained. [0073] The observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104 are supplied to the target sound emphasis unit 105. The target sound emphasizing unit 105 subjects the observation signals X1 (f, t) and X2 (f, t) to conventionally known DS processing or adaptive beamformer processing, etc. An estimated signal Z (f, t) is obtained. For example, when DS processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are subjected to addition processing and then divided by 2 to obtain the target sound estimation signal Z (f, t) (See equation (3)). [0074] Also, the observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104 are supplied to the noise estimation unit 106. The noise estimation unit 106 subjects the observation signals X1 (f, t) and X2 (f, t) to conventional NBF processing or adaptive beamformer processing, etc. N (f, t) is obtained. For example, when NBF processing is used, the observation 03-05-2019 22 signal X1 (f, t) and the observation signal X2 (f, t) are subjected to subtraction processing and then divided by 2 to obtain a noise estimation signal N (f, t). (See equation (4)). [0075] The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to the correction coefficient calculation unit 107. In correction coefficient calculation section 107, correction coefficients β (f, t) for correcting the post-filtering process are calculated based on target sound estimation signal Z (f, t) and noise estimation signal N (f, t). In a frame, it is calculated for each frequency (see equation (5)). [0076] The correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is supplied to the correction coefficient change unit 108. Of the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the correction coefficient changing unit 108 changes the coefficient of the band causing spatial aliasing so as to crush the peak that can be made to a specific frequency. Thus, the corrected correction coefficient β ′ (f, t) is obtained. [0077] In this correction coefficient changing unit 108, the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is checked to find the first frequency Fa (t) on the low frequency side where the coefficient value drops. It is determined that spatial aliasing occurs in a band higher than Fa (t). Then, in the correction coefficient changing unit 108, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the coefficient of the band higher than Fa (t) collapses the peak that can be made to a specific frequency. Be changed. [0078] 03-05-2019 23 For example, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the correction coefficients in the band more than Fa (t) are smoothed in the frequency direction, and the correction coefficients changed for each frequency are β ′ (f, t) is obtained (see FIGS. 12 and 13). Further, for example, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the correction coefficient in the band of Fa (t) or more is replaced by 1 and the correction coefficient β ′ (f , t) are obtained (see FIG. 14). [0079] The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to the post filtering unit 109. Further, the correction coefficient β ′ (f, t) changed by the correction coefficient changing unit 108 is supplied to the post filtering unit 109. In this post filtering unit 109, noise components remaining in the target sound estimation signal Z (f, t) are removed by post filtering processing using the noise estimation signal N (f, t). The correction coefficient β ′ (f, t) is used to correct this post-filtering process, that is, the gain of the noise component remaining in the target sound estimation signal Z (f, t) and the noise estimation signal N (f, t). Is used to match the gain of [0080] In this post-filtering unit 109, for example, a known technique such as a spectral subtraction method or an MMSE-STSA method is used to obtain the noise suppression signal Y (f, t). For example, when the spectral subtraction method is used, the noise suppression signal Y (f, t) is obtained based on the following equation (7). Y (f, t) = Z (f, t)-β '(f, t) * N (f, t) (7) [0081] The noise suppression signal Y (f, t) of each frequency output for each frame from the postfiltering unit 109 is supplied to the inverse fast Fourier transform unit 110. The inverse fast Fourier transform unit 110 performs inverse fast Fourier transform processing on the noise suppression signal Y (f, t) of each frequency for each frame to obtain a framed signal converted into a time domain signal. Be The framing signals of each frame are sequentially supplied to the waveform synthesis unit 111. The waveform synthesis unit 111 synthesizes the framed signals of 03-05-2019 24 the respective frames to obtain a noise-suppressed speech signal SAout as an output of the speech input system 100 continuous in time series. [0082] As described above, in the voice input system 100 shown in FIG. 1, the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is changed by the correction coefficient change unit 108. In this case, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the coefficient of the band causing spatial aliasing (the band higher than Fa (t)) can be a peak that can be a specific frequency. It is modified to collapse to obtain a modified correction factor β '(f, t). The post-filtering unit 109 uses this modified correction coefficient β ′ (f, t). [0083] Therefore, it is possible to reduce the adverse effect that the peak of the coefficient that can be made to the specific frequency of the band causing spatial aliasing has on the output sound, and to suppress the deterioration of the sound quality. This enables noise removal processing independent of the microphone spacing. Therefore, since the microphones 101a and 101b are noise cancellation microphones installed in the headphones, noise can be efficiently corrected even when the microphone spacing is wide, and noise is removed with less distortion. Processing is performed. [0084] ＜２． Second Embodiment> [Configuration Example of Voice Input System] FIG. 16 shows a configuration example of a voice input system 100A according to a second embodiment. Similarly to the voice input system 100 shown in FIG. 1 described above, the voice input system 100A is also a system that performs voice input using noise canceling microphones installed in the left and right headphones of the noise canceling headphone. In FIG. 16, parts corresponding to FIG. 1 are given the same reference numerals, and the detailed description thereof will be omitted as appropriate. [0085] 03-05-2019 25 The voice input system 100A includes microphones 101a and 101b, an A / D converter 102, a frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement unit 105, and a noise estimation unit 106. doing. In addition, the voice input system 100A includes a correction coefficient calculation unit 107, a post filtering unit 109, an inverse fast Fourier transform (IFFT) unit 110, a waveform synthesis unit 111, an ambient noise state estimation unit 112, and a correction coefficient change. It has a part 113. [0086] The ambient noise state estimation unit 112 processes the observation signals of the microphones 101a and 101b to obtain information on the number of sources of ambient noise. The ambient noise state estimation unit 112 calculates the correlation coefficient corr of the observation signal of the microphone 101a and the observation signal of the microphone 101b for each frame based on the following equation (8), and uses it as the number-of-sounds information of ambient noise. . In the equation (8), x1 (n) indicates time axis data of the microphone 101a, x2 (n) indicates time axis data of the microphone 101b, and N indicates the number of samples. [0087] [0088] The bar graph in FIG. 17 shows an example of the relationship between the number of noise sources and the correlation coefficient corr. Generally, as the number of sound sources increases, the correlation between the observation signals of the microphones 101a and 101b decreases. Theoretically, the correlation coefficient corr approaches 0 as the number of sound sources increases. Therefore, the number of sources of ambient noise can be estimated by the correlation coefficient corr. [0089] 03-05-2019 26 Referring back to FIG. 16, the correction coefficient changing unit 113 calculates the correction coefficient calculation unit 107 based on the correlation coefficient corr (information on the number of sound sources of ambient noise) obtained by the ambient noise state estimation unit 112 in each frame. The corrected correction coefficient β (f, t) is changed. That is, the correction coefficient changing unit 113 increases the number of smoothed frames as the number of sound sources increases, smoothes the coefficients calculated by the correction coefficient calculation unit 107 in the frame direction, and changes the correction coefficient β ′ (f , t). In practice, the post-filtering unit 109 uses not the correction coefficient β (f, t) itself calculated by the correction coefficient calculation unit 107, but the correction coefficient β ′ (f, t) after this change. [0090] FIG. 18 shows an example of the correction coefficient (microphone distance d is 2 cm) when noise is present at the 45 ° azimuth as shown in FIG. On the other hand, as shown in FIG. 21, FIG. 20 shows an example (microphone distance d is 2 cm) of the correction coefficient when noise exists in a plurality of azimuths. As described above, even if the microphone spacing is a proper spacing that does not cause spatial aliasing, the correction coefficient becomes unstable as the number of noise sources increases. As a result, as shown in FIG. 22, the correction coefficient changes randomly for each frame. If this correction factor is used as it is, it adversely affects the output sound and degrades the sound quality. [0091] The correction coefficient changing unit 113 calculates the number of smoothed frames γ based on the correlation coefficient corr (information on the number of sound sources of ambient noise) obtained by the ambient noise state estimation unit 112 in each frame. The correction coefficient changing unit 113 obtains the number of smoothed frames γ using, for example, a number of smoothed frames calculation function as shown in FIG. In this case, when the correlation of the observation signals of the microphones 101a and 101b is large, that is, when the value of the correlation coefficient corr is large, the number of smoothed frames γ is required to be small. [0092] 03-05-2019 27 On the other hand, when the correlation between the observation signals of the microphones 101a and 101b is small, that is, when the value of the correlation coefficient corr is small, the number of smoothed frames γ can be determined large. Note that the correction coefficient changing unit 113 does not need to actually perform arithmetic processing, and the number of smoothed frames γ is calculated according to the correlation coefficient corr from a table storing the correspondence between the correlation coefficient corr and the number of smoothed frames γ. May be read out. [0093] The correction coefficient changing unit 113 smoothes the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 in each frame in the frame direction (time direction) as shown in FIG. To obtain a modified correction factor β ′ (f, t) of In this case, smoothing is performed with the number of smoothed frames γ obtained as described above. The correction coefficient β ′ (f, t) of each frame changed in this manner changes gently in the frame direction (time direction). [0094] The flowchart in FIG. 25 shows the procedure of processing (one frame) in the ambient noise state estimation unit 112 and the correction coefficient change unit 113. Each unit starts processing in step ST11. Thereafter, in step ST12, the ambient noise state estimation unit 112 acquires data frames x1 (t) and x2 (t) of the observation signals of the microphones 101a and 101b. Then, in step ST13, the ambient noise state estimation unit 112 calculates a correlation coefficient corr (t) indicating the degree of correlation of the observation signals of the microphones 101a and 101b (see equation (8)). [0095] Next, in step ST14, the correction coefficient changing unit 113 uses the value of the correlation coefficient corr (t) calculated by the ambient noise state estimation unit 112 in step ST13 to calculate the number of smoothed frames (FIG. 23). Refer to) and calculate the number of smoothed frames γ. Then, in step ST15, the correction coefficient changing unit 113 changes the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 by 03-05-2019 28 smoothing with the smoothed frame number γ calculated in step ST14. The correction coefficient β ′ (f, t) is obtained. After the process of step ST15, each unit ends the process at step ST16. [0096] The rest of the voice input system 100A shown in FIG. 16 is configured in the same manner as the voice input system 100 shown in FIG. 1 although the detailed description is omitted. [0097] The operation of the voice input system 100A shown in FIG. 16 will be briefly described. Ambient sound is collected by the microphones 101a and 101b arranged at predetermined intervals, and an observation signal is obtained. The observation signals obtained by the microphones 101 a and 101 b are supplied from the A / D converter 102 to the frame division unit 103 after being converted from analog signals to digital signals. Then, in the frame division unit 103, the observation signals from the microphones 101a and 101b are divided into frames of a predetermined time length and framed. [0098] The framed signals of each frame obtained by being framed by the frame division unit 103 are sequentially supplied to the fast Fourier transform unit 104. The fast Fourier transform unit 104 performs fast Fourier transform (FFT) processing on the framed signal, and the observation signal X1 (f, t) of the microphone 101a and the observation signal of the microphone 101b as signals in the frequency domain. And X2 (f, t) is obtained. [0099] The observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104 are supplied to the target sound emphasis unit 105. The target sound emphasizing unit 105 subjects the observation signals X1 (f, t) and X2 (f, t) to conventionally known DS processing or adaptive beamformer processing, etc. An estimated signal Z (f, t) is obtained. For example, when 03-05-2019 29 DS processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are subjected to addition processing and then divided by 2 to obtain the target sound estimation signal Z (f, t) (See equation (3)). [0100] Also, the observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104 are supplied to the noise estimation unit 106. The noise estimation unit 106 subjects the observation signals X1 (f, t) and X2 (f, t) to conventional NBF processing or adaptive beamformer processing, etc. N (f, t) is obtained. For example, when NBF processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are subjected to subtraction processing and then divided by 2 to obtain a noise estimation signal N (f, t). (See equation (4)). [0101] The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to the correction coefficient calculation unit 107. In correction coefficient calculation section 107, correction coefficients β (f, t) for correcting the post-filtering process are calculated based on target sound estimation signal Z (f, t) and noise estimation signal N (f, t). In a frame, it is calculated for each frequency (see equation (5)). [0102] Also, the framing signal of each frame obtained by being framed by the frame division unit 103, that is, observation signals x1 (n) and x2 (n) of the microphones 101a and 101b are supplied to the ambient noise state estimation unit 112. . That is, in the ambient noise state estimation unit 112, the correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the microphones 101a and 101b is determined and used as the number information of the ambient noise (see equation (8)) ). [0103] 03-05-2019 30 The correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is supplied to the correction coefficient change unit 113. Further, the correction coefficient changing unit 113 is also supplied with the correlation coefficient corr obtained by the ambient noise condition estimating unit 112. The correction coefficient changing unit 113 calculates the correction coefficient β (the correction coefficient calculation unit 107 based on the correlation coefficient corr (information on the number of sound sources of ambient noise) obtained by the ambient noise state estimation unit 112 in each frame. f, t) are changed. [0104] First, the correction coefficient changing unit 113 obtains the number of smoothed frames based on the correlation coefficient corr. In this case, the number of smoothed frames γ is small when the value of the correlation coefficient corr is large, and large when the value of the correlation coefficient corr is small (see FIG. 23). Next, in the correction coefficient changing unit 113, the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is smoothed in the frame direction (time direction) by the number of smoothing frames γ The modified correction coefficient β ′ (f, t) of is obtained (see FIG. 24). [0105] The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to the post filtering unit 109. Further, the correction coefficient β ′ (f, t) changed by the correction coefficient changing unit 113 is supplied to the post filtering unit 109. In this post filtering unit 109, noise components remaining in the target sound estimation signal Z (f, t) are removed by post filtering processing using the noise estimation signal N (f, t). The correction coefficient β ′ (f, t) is used to correct this post-filtering process, that is, the gain of the noise component remaining in the target sound estimation signal Z (f, t) and the noise estimation signal N (f, t). Is used to match the gain of [0106] In this post-filtering unit 109, for example, a known technique such as a spectral subtraction method or an MMSE-STSA method is used to obtain the noise suppression signal Y (f, t). For example, when the spectral subtraction method is used, the noise suppression signal Y (f, t) is 03-05-2019 31 obtained based on the following equation (9). Y (f, t) = Z (f, t)-β '(f, t) * N (f, t) (9) [0107] The noise suppression signal Y (f, t) of each frequency output for each frame from the postfiltering unit 109 is supplied to the inverse fast Fourier transform unit 110. The inverse fast Fourier transform unit 110 performs inverse fast Fourier transform processing on the noise suppression signal Y (f, t) of each frequency for each frame to obtain a framed signal converted into a time domain signal. Be The framing signals of each frame are sequentially supplied to the waveform synthesis unit 111. The waveform synthesis unit 111 synthesizes the framed signals of the respective frames to obtain a noise-suppressed speech signal SAout as an output of the speech input system 100 continuous in time series. [0108] As described above, in the voice input system 100A shown in FIG. 16, the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is changed by the correction coefficient change unit 113. In this case, the ambient noise state estimation unit 112 obtains the correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the microphones 101a and 101b as the number-of-sounds information of the ambient noise. Then, based on the sound source number information, the correction coefficient changing unit 113 obtains the smoothed frame number γ so as to increase as the sound source number increases, and the correction coefficient β (f, t) is smoothed in the frame direction. Thus, the modified correction factor β '(f, t) of each frame is obtained. The post-filtering unit 109 uses this modified correction coefficient β ′ (f, t). [0109] Therefore, in the situation where there are innumerable noise sources in the surroundings, it is possible to suppress the change in the frame direction (time direction) of the correction coefficient and reduce the influence on the output sound. This makes it possible to perform noise removal processing in accordance with the ambient noise conditions. Therefore, the microphones 101a and 101b are noise canceling microphones installed in headphones, and even when there are many noise sources in the surroundings, noise correction can be performed efficiently and distortion is small. Good noise removal processing is performed. 03-05-2019 32 [0110] ＜３． Third Embodiment> [Configuration Example of Voice Input System] FIG. 26 shows a configuration example of a voice input system 100B according to a third embodiment. Similarly to the voice input systems 100 and 100A shown in FIGS. 1 and 16 described above, the voice input system 100B also performs voice input using noise canceling microphones installed on the left and right headphones of the noise canceling headphone. It is a system. In FIG. 26, parts corresponding to FIGS. 1 and 16 are assigned the same reference numerals, and the detailed description thereof will be omitted as appropriate. [0111] The voice input system 100B includes microphones 101a and 101b, an A / D converter 102, a frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement unit 105, and a noise estimation unit 106. A correction coefficient calculation unit 107 is provided. In addition, this voice input system 100B includes a correction coefficient changing unit 108, a post filtering unit 109, an inverse fast Fourier transform (IFFT) unit 110, a waveform combining unit 111, an ambient noise state estimating unit 112, and a correction coefficient change. It has a part 113. [0112] The correction coefficient changing unit 108 crushes a peak at which the coefficient of the band causing spatial aliasing among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107 in each frame can be a specific frequency. Then, the correction coefficient β ′ (f, t) is obtained. Although the detailed description is omitted, the correction coefficient changing unit 108 is the same as the correction coefficient changing unit 108 of the voice input system 100 shown in FIG. The correction coefficient changing unit 108 constitutes a first correction coefficient changing unit. [0113] 03-05-2019 33 The ambient noise state estimation unit 112 calculates, for each frame, the correlation coefficient corr between the observation signal of the microphone 101a and the observation signal of the microphone 101b, and uses it as the number-of-sounds information of the ambient noise. Although the detailed description is omitted, the ambient noise state estimation unit 112 is the same as the ambient noise state estimation unit 112 of the voice input system 100A shown in FIG. [0114] The correction coefficient changing unit 113 corrects the correction coefficient β ′ changed by the correction coefficient changing unit 108 based on the correlation coefficient corr (information on the number of sound sources of ambient noise) obtained by the ambient noise state estimating unit 112 in each frame. Further modify (f, t) to obtain a correction factor β ′ ′ (f, t). Although the detailed description is omitted, the correction coefficient changing unit 113 is the same as the correction coefficient changing unit 113 of the voice input system 100A shown in FIG. The correction coefficient changing unit 113 constitutes a second correction coefficient changing unit. In practice, the post-filtering unit 109 uses not the correction coefficient β (f, t) itself calculated by the correction coefficient calculation unit 107 but the correction coefficient β ′ ′ (f, t) after this change. [0115] The other parts of the voice input system 100B shown in FIG. 26 are configured in the same manner as the voice input systems 100 and 100A shown in FIGS. [0116] The flowchart in FIG. 27 shows the procedure of processing (one frame) in the correction coefficient changing unit 108, the ambient noise state estimating unit 112, and the correction coefficient changing unit 113. Each unit starts processing in step ST21. Thereafter, in step ST22, the correction coefficient changing unit 108 acquires the correction coefficient β (f, t) from the correction coefficient calculation unit 107. Then, in step ST23, the correction coefficient changing unit 108 searches for the coefficient of each frequency f from the low band in the current frame t, and the first frequency Fa (t) on the low band side where the value of the coefficient drops is locate. 03-05-2019 34 [0117] Next, in step ST24, the correction coefficient changing unit 108 checks a flag indicating whether or not to smooth a band higher than Fa (t), that is, a band causing spatial aliasing. Note that this flag is set in advance by a user operation. When the flag is on, the correction coefficient changing unit 108 smoothes in the frequency direction the coefficients in the band of Fa (t) or more among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107 in step ST25. Thus, the corrected correction coefficient β ′ (f, t) of each frequency f is obtained. Further, when the flag is off at step ST24, the correction coefficient changing unit 108 corrects the correction coefficient of the band more than Fa (t) among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107 at step ST27. Is replaced by “1” to obtain a correction coefficient β ′ (f, t). [0118] After the process of step ST25 or step ST26, in step ST27, the ambient noise state estimation unit 112 acquires data frames x1 (t) and x2 (t) of observation signals of the microphones 101a and 101b. Then, in step ST28, the ambient noise state estimation unit 112 calculates a correlation coefficient corr (t) indicating the degree of correlation of the observation signals of the microphones 101a and 101b (see equation (8)). [0119] Next, in step ST29, the correction coefficient changing unit 113 uses the value of the correlation coefficient corr (t) calculated by the ambient noise state estimation unit 112 in step ST28 to calculate the number of smoothed frames (FIG. 23). Refer to) and calculate the number of smoothed frames γ. Then, in step ST30, the correction coefficient changing unit 113 changes the correction coefficient β ′ (f, t) changed by the correction coefficient changing unit 108 by smoothing with the smoothed frame number γ calculated in step ST29. The correction coefficient β ′ ′ (f, t) is obtained. After the process of step ST30, each unit ends the process at step ST31. [0120] 03-05-2019 35 The operation of the voice input system 100B shown in FIG. 26 will be briefly described. Ambient sound is collected by the microphones 101a and 101b arranged at predetermined intervals, and an observation signal is obtained. The observation signals obtained by the microphones 101 a and 101 b are supplied from the A / D converter 102 to the frame division unit 103 after being converted from analog signals to digital signals. Then, in the frame division unit 103, the observation signals from the microphones 101a and 101b are divided into frames of a predetermined time length and framed. [0121] The framed signals of each frame obtained by being framed by the frame division unit 103 are sequentially supplied to the fast Fourier transform unit 104. The fast Fourier transform unit 104 performs fast Fourier transform (FFT) processing on the framed signal, and the observation signal X1 (f, t) of the microphone 101a and the observation signal of the microphone 101b as signals in the frequency domain. And X2 (f, t) is obtained. [0122] The observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104 are supplied to the target sound emphasis unit 105. The target sound emphasizing unit 105 subjects the observation signals X1 (f, t) and X2 (f, t) to conventionally known DS processing or adaptive beamformer processing, etc. An estimated signal Z (f, t) is obtained. For example, when DS processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are subjected to addition processing and then divided by 2 to obtain the target sound estimation signal Z (f, t) (See equation (3)). [0123] Also, the observation signals X 1 (f, t) and X 2 (f, t) obtained by the fast Fourier transform unit 104 are supplied to the noise estimation unit 106. The noise estimation unit 106 subjects the observation signals X1 (f, t) and X2 (f, t) to conventional NBF processing or adaptive beamformer processing, etc. N (f, t) is obtained. For example, when NBF processing is used, the observation signal X1 (f, t) and the observation signal X2 (f, t) are subjected to subtraction processing and 03-05-2019 36 then divided by 2 to obtain a noise estimation signal N (f, t). (See equation (4)). [0124] The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to the correction coefficient calculation unit 107. In correction coefficient calculation section 107, correction coefficients β (f, t) for correcting the post-filtering process are calculated based on target sound estimation signal Z (f, t) and noise estimation signal N (f, t). In a frame, it is calculated for each frequency (see equation (5)). [0125] The correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is supplied to the correction coefficient change unit 108. Of the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the correction coefficient changing unit 108 changes the coefficient of the band causing spatial aliasing so as to crush the peak that can be made to a specific frequency. Thus, the corrected correction coefficient β ′ (f, t) is obtained. [0126] In addition, the framing signal of each frame obtained by framing by the frame division unit 103, that is, the observation signals x1 (n) and x2 (n) of the microphones 101a and 101b are supplied to the ambient noise state estimation unit 112. Ru. The ambient noise state estimation unit 112 obtains the correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the microphones 101a and 101b, and obtains the correlation coefficient corr as the sound source number information of the ambient noise ((( 8) See the equation). [0127] The corrected correction coefficient β ′ (f, t) obtained by the correction coefficient changing unit 108 is further supplied to the correction coefficient changing unit 113. Further, the 03-05-2019 37 correction coefficient changing unit 113 is also supplied with the correlation coefficient corr obtained by the ambient noise condition estimating unit 112. The correction coefficient changing unit 113 corrects the correction coefficient β ′ obtained by the correction coefficient calculation unit 107 based on the correlation coefficient corr (information on the number of sound sources of ambient noise) obtained by the ambient noise state estimation unit 112 in each frame. (f, t) is further changed. [0128] First, the correction coefficient changing unit 113 obtains the number of smoothed frames based on the correlation coefficient corr. In this case, the number of smoothed frames γ is small when the value of the correlation coefficient corr is large, and large when the value of the correlation coefficient corr is small (see FIG. 23). Next, in the correction coefficient changing unit 113, the correction coefficient β ′ (f, t) obtained by the correction coefficient calculation unit 107 is smoothed in the frame direction (time direction) by the number of smoothed frames γ. A modified correction factor β ′ ′ (f, t) of the frame is obtained (see FIG. 24). [0129] The target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106 are supplied to the post filtering unit 109. Further, the correction coefficient β ′ ′ (f, t) changed by the correction coefficient changing unit 113 is supplied to the post filtering unit 109. In this post filtering unit 109, noise components remaining in the target sound estimation signal Z (f, t) are removed by post filtering processing using the noise estimation signal N (f, t). The correction coefficient β ′ ′ (f, t) is used to correct this post-filtering process, that is, the gain of the noise component remaining in the target sound estimation signal Z (f, t) and the noise estimation signal N (f, t). Is used to match the gain of [0130] In this post-filtering unit 109, for example, a known technique such as a spectral subtraction method or an MMSE-STSA method is used to obtain the noise suppression signal Y (f, t). For example, when the spectral subtraction method is used, the noise suppression signal Y (f, t) is obtained based on the following equation (10). Y (f, t) = Z (f, t)-β "(f, t) * N (f, t) (10) 03-05-2019 38 [0131] The noise suppression signal Y (f, t) of each frequency output for each frame from the postfiltering unit 109 is supplied to the inverse fast Fourier transform unit 110. The inverse fast Fourier transform unit 110 performs inverse fast Fourier transform processing on the noise suppression signal Y (f, t) of each frequency for each frame to obtain a framed signal converted into a time domain signal. Be The framing signals of each frame are sequentially supplied to the waveform synthesis unit 111. The waveform synthesis unit 111 synthesizes the framed signals of the respective frames to obtain a noise-suppressed speech signal SAout as an output of the speech input system 100 continuous in time series. [0132] As described above, in the voice input system 100B shown in FIG. 26, the correction coefficient β (f, t) calculated by the correction coefficient calculation unit 107 is changed by the correction coefficient change unit 108. In this case, among the correction coefficients β (f, t) calculated by the correction coefficient calculation unit 107, the coefficient of the band causing spatial aliasing (the band higher than Fa (t)) can be a peak that can be a specific frequency. It is modified to collapse to obtain a modified correction factor β '(f, t). [0133] Further, in the voice input system 100B shown in FIG. 26, the correction coefficient β ′ (f, t) changed by the correction coefficient changing unit 108 is further changed by the correction coefficient changing unit 113. In this case, the ambient noise state estimation unit 112 obtains the correlation coefficient corr of the observation signals x1 (n) and x2 (n) of the microphones 101a and 101b as the number-of-sounds information of the ambient noise. Then, based on the sound source number information, the correction coefficient changing unit 113 obtains the smoothed frame number γ so as to increase as the sound source number increases, and the correction coefficient β ′ (f, t) is smoothed in the frame direction. To obtain the modified correction factor β ′ ′ (f, t) of each frame. The post-filtering unit 109 uses this modified correction coefficient β ′ ′ (f, t). 03-05-2019 39 [0134] Therefore, it is possible to reduce the adverse effect that the peak of the coefficient that can be made to the specific frequency of the band causing spatial aliasing has on the output sound, and to suppress the deterioration of the sound quality. This enables noise removal processing independent of the microphone spacing. Therefore, since the microphones 101a and 101b are noise cancellation microphones installed in the headphones, noise can be efficiently corrected even when the microphone spacing is wide, and noise is removed with less distortion. Processing is performed. [0135] In addition, in the situation where there are innumerable noise sources in the surroundings, it is possible to suppress the change in the frame direction (time direction) of the correction coefficient and reduce the influence on the output sound. This makes it possible to perform noise removal processing in accordance with the ambient noise conditions. Therefore, the microphones 101a and 101b are noise canceling microphones installed in headphones, and even when there are many noise sources in the surroundings, noise correction can be performed efficiently and distortion is small. Good noise removal processing is performed. [0136] ＜４． Fourth Embodiment> [Configuration Example of Voice Input System] FIG. 28 shows a configuration example of a voice input system 100C according to a fourth embodiment. Similarly to the voice input systems 100, 100A, and 100B shown in FIGS. 1, 16, and 26, the voice input system 100C also uses the noise canceling microphones installed on the left and right headphones of the noise canceling headphone. It is a system that performs input. In FIG. 28, parts corresponding to FIG. 26 are assigned the same reference numerals, and the detailed description thereof will be omitted as appropriate. [0137] The voice input system 100C includes microphones 101a and 101b, an A / D converter 102, a frame division unit 103, a fast Fourier transform (FFT) unit 104, a target sound enhancement 03-05-2019 40 unit 105, and a noise estimation unit 106. A correction coefficient calculation unit 107C is provided. Further, this voice input system 100C includes correction coefficient changing units 108 and 113, a post filtering unit 109, an inverse fast Fourier transform (IFFT) unit 110, a waveform synthesis unit 111, an ambient noise state estimation unit 112, and a purpose. A sound section detection unit 114 is provided. [0138] The target sound segment detection unit 114 detects a segment having a target sound. As shown in FIG. 29, the target sound segment detection unit 114 detects the target sound estimation signal Z (f, t) obtained by the target sound enhancement unit 105 and the noise estimation signal N (f, t) obtained by the noise estimation unit 106. Based on t), in each frame, it is determined whether it is a target sound section, and target sound section information is output. [0139] The target sound segment detection unit 114 obtains an energy ratio of the target sound estimation signal Z (f, t) and the noise estimation signal N (f, t). The following equation (11) indicates the energy ratio. [0140] The target sound section detection unit 114 determines whether the energy ratio is larger than a threshold. Then, when the energy ratio is larger than the threshold value, the target sound segment detection unit 114 determines that the target sound segment is present and outputs “1” as target sound segment detection information, as shown in the following equation (12). Otherwise, it determines that it is not a target sound section and outputs "0". [0141] [0142] 03-05-2019 41 In this case, the target sound is in the front as shown in FIG. 30, and when there is the target sound, the gain difference between the target sound estimation signal Z (f, t) and the noise estimation signal N (f, t) is large. In the case of noise only, the difference in their gain is small. The same processing can be performed when the microphone spacing is known and the target sound is not in front but in an arbitrary direction. [0143] The correction coefficient calculation unit 107C calculates the correction coefficient β (f, t) in the same manner as the correction coefficient calculation unit 107 of the voice input systems 100, 100A, and 100B in FIGS. 1, 16, and 26. However, unlike the correction coefficient calculation unit 107, the correction coefficient calculation unit 107C determines whether to calculate the correction coefficient β (f, t) based on the target sound period information from the target sound period detection unit 114. Do. That is, the correction coefficient calculation unit 107C newly calculates and outputs the correction coefficient β (f, t) in a frame having no target sound, and does not calculate the correction coefficient β (f, t) in the other frames. The correction coefficient β (f, t) same as that of the previous frame is output as it is. [0144] The other parts of the voice input system 100C shown in FIG. 28 are configured in the same manner as the voice input system 100B shown in FIG. Therefore, in the voice input system 100C, the same effect as the voice input system 100B shown in FIG. 26 can be obtained. [0145] Further, in the voice input system 100C, the correction coefficient calculation unit 107C further calculates the correction coefficient β (f, t) in a section where there is no target sound. In this case, since the target sound estimation signal Z (f, t) contains only noise components, the correction coefficient β (f, t) can be accurately calculated without being affected by the target sound, and as a result, Noise removal processing is performed. 03-05-2019 42 [0146] ＜５． Modified Example> In the above embodiment, the microphones 101a and 101b are noise canceling microphones provided on the left and right headphones of the noise canceling headphone. However, it is also conceivable that the microphones 101a and 101b are microphones installed in the personal computer main body. [0147] Further, in the voice input systems 100 and 100A shown in FIGS. 1 and 16 as well as the voice input system 100C shown in FIG. 28, the target sound section detection unit 114 is provided, and the correction coefficient calculation unit 107 is a frame without a target sound. Only the correction coefficient β (f, t) may be calculated. [0148] The present invention can be applied to a system for making a call by using a noise canceling microphone installed in noise canceling headphones or a microphone installed in a personal computer. [0149] 100, 100A, 100B, 100C ... voice input system 101a, 101b ... microphone 102 ... A / D converter 103 ... frame division unit 104 ... fast Fourier transform (FFT) unit 105 · · · Target sound emphasis unit 106 Noise estimation unit (target sound suppression unit) 107, 107C Correction coefficient calculation unit 108 Correction coefficient changing unit 109 Post filtering unit 110 Inverse fast Fourier Transform (IFFT) unit 111 ... Waveform synthesis unit 112 ... Ambient noise state estimation unit 113 ... Correction coefficient change unit 114 ... Target sound section detection unit 03-05-2019 43

1/--страниц