J. Cent. South Univ. Technol. (2011) 18: 1595-1601
DOI: 10.1007/s11771-011-0877-1
Adaptive bands filter bank optimized by genetic algorithm for robust speech recognition system
HUANG Li-xia(黄丽霞)1, 2, G. Evangelista2, ZHANG Xue-ying(张雪英)1
1. Department of Information and Engineering, Taiyuan University of Technology, Taiyuan 030024, China;
2. Department of Science Technology (ITN), Link?ping University, Norrk?ping SE-60174, Sweden
? Central South University Press and Springer-Verlag Berlin Heidelberg 2011
Abstract: Perceptual auditory filter banks such as Bark-scale filter bank are widely used as front-end processing in speech recognition systems. However, the problem of the design of optimized filter banks that provide higher accuracy in recognition tasks is still open. Owing to spectral analysis in feature extraction, an adaptive bands filter bank (ABFB) is presented. The design adopts flexible bandwidths and center frequencies for the frequency responses of the filters and utilizes genetic algorithm (GA) to optimize the design parameters. The optimization process is realized by combining the front-end filter bank with the back-end recognition network in the performance evaluation loop. The deployment of ABFB together with zero-crossing peak amplitude (ZCPA) feature as a front process for radial basis function (RBF) system shows significant improvement in robustness compared with the Bark-scale filter bank. In ABFB, several sub-bands are still more concentrated toward lower frequency but their exact locations are determined by the performance rather than the perceptual criteria. For the ease of optimization, only symmetrical bands are considered here, which still provide satisfactory results.
Key words: perceptual filter banks; bark scale; speaker independent speech recognition systems; zero-crossing peak amplitude; genetic algorithm
1 Introduction
Since the first automatic speech recognition (ASR) was developed, a large portion of researches have aimed to find methods to improve the performance of these systems in adverse and noisy conditions. In the last few decades, this technology has achieved great success, not only in laboratories but also in practical and commercial systems. However, there is still a large discrepancy between current speech systems and the hearing system of human beings. For the purpose of reducing the discrepancy, two main aspects, namely feature extraction and recognition network, have been intensively explored. Linear prediction cepstral coefficients (LPCC) [1], mel- frequency cepstral coefficients (MFCC) [2], zero- crossing peak amplitude (ZCPA) [3] are some main features which are used in speech recognition task. Hidden Markov models (HMM) [4], artificial neutral network (ANN) [5] and sequential speaker clustering algorithm [6] are used as clustering in back-end speech processing. However, the extraction of the features is closely related and depends on the front-end filter bank, whose main task is to meaningfully subdivide the signal spectrum to improve the extraction of useful information. Therefore, a careful design of the front-end filter bank can lead to improvement of the feature extraction, which is bound to contribute to higher recognition effects.
Early front-end processing was based on models of the physical structure of the auditory system, for which it is difficult to transform the structure details into parameters and the computation complexity is high. A large amount of researches were focused on perceptual properties of the hearing system [7-9]. Bark-scale spectrum splitting is one criterion that is utilized in the design of front-end filters [10-11]. An adaptive bands filter bank (ABFB) is proposed here which allows for higher flexibility in the design with respect to perceptual criteria. In this work, a simple mathematical model approach is introduced for the design of the basic filter bank for signal spectrum splitting. The parameters controlling the filter bandwidths and center frequencies are then optimized by means of genetic algorithm (GA), which combine the front-end filter bank with the back-end network in the performance evaluation and obtain an optimized filter bank (OFB). The final version of the ABFB is obtained by taking average over the scaling factors of OFB model. The evaluation of the ABFB was integrated with the extraction of the ZCPA feature and a radial basis function (RBF) based recognition network. The experiments showed that the ABFB significantly outperforms the Bark-scale filter bank.
2 Design of basic filter bank model
2.1 Bark-scale filter bank
Traditional FIR filters were replaced by IIR warped filter banks (WFBs) in order to introduce non-uniform frequency and unsymmetrical bandwidth in speech recognition tasks [10]. The WFB bandwidths were warped by a simple parameter called warping factor α corresponding to the pole in the first order real all-pass transformation. Typical warping factor α=0.48 corresponds to a close approximation of the Bark-scale filter bank. Compared with generic FIR, WFBs show higher robustness to noisy environments. In this section, a new design that achieves higher flexibility in the choice of the bands was proposed.
2.2 Basic filter bank model
2.2.1 Shift and scale properties
Given a N-length low-pass impulse response sequence (window) h(n), with discrete Fourier transform (DFT):
(1)
one can obtain tuned bandpass versions of this filter by modulation, i.e., by changing the impulse response to The resulting frequency response is
(2)
which means H(ejω) shifts to angular frequency ω0. This process is shown in Figs.1 (a) and (c).
The choice of the window is not highly critical. Since the Hamming window was adopted in designing Bark-scale filter bank [10], it was also used to design the basic filter bank model, though other windows can be adopted. In the present experiments, the window length is fixed to N=20.
Along with the frequency shift, in the present design, it is also necessary to change the bandwidth, which can be performed by scaling the window h(n).
The scaled impulse sequence g(n) is obtained, by resampling the original impulse sequence h(n). As the window h(n) is only given numerically, resampling is used as an approximation for scaling. For example, g(n) is attained by resampling with 1/3 in Eq.(3):
g(n)=resample(h(n), 1, 3) (3)
Fig.1 Shift and scale properties: (a) Hamming window h(n); (b) Resampled h(n); (c) Shift frequency; (d) Scale bandwidth
Then the corresponding frequency response G(ω) can be calculated by Eq.(1). Since the impulse sequence has been contracted, the frequency response has been dilated, disregarding the aliasing, compared to original frequency response. The process is indicated in Figs.1(b) and (d).
2.2.2 Basic 16-channel filter bank model
As described above, a Hamming sequence is chosen to design a basic 16-channel filter bank model. The Hamming sequence h(n), which is resampled by Eq.(4), will get the new impulse sequence which is labeled as gi(n). Then, module gi(n) with by Eq.(5), the corresponding frequency response will shift ωi, as shown in Eq.(6):
gi(n)=resample(h, pi, qi) (4)
(5)
(6)
where is called the i-th channel scaling factor, ωi is the i-th channel frequency, si(n) is the i-th channel impulse sequence, Si(ejω) is the frequency response, 1≤i≤16. Thus, the basic 16-channel filter bank can be easily obtained by scaling and then shifting to the desired center frequencies. The basic 16-channel filter bank model is shown in Fig.2.
Fig.2 Basic 16-channel filter bank model
2.3 Parameters in basic 16-channel filter bank constituting Bark-scale filter bank model
The Bark-scale filter bank possesses good performance [10]. Since the GA is highly dependent on the original population values, the OFB model will be optimized from Bark-scale filter bank. The Bark-scale filter bank is shown in Fig.3. Table 1 gives the center frequencies of the Bark-scale filter bank.
Fig.3 Bark-scale filter bank
Table 1 Center frequencies of Bark-scale filter bank
The frequency response of Hamming sequence h(n) is H(ω). The scaled sequence gi(n) is calculated by Eq.(4). The corresponding frequency response will be ωqi/pi, i.e., ω/αi.
As the Bark-scale filter bank is the original values, the cross points of each adjacent bands in Bark-scale filter bank occur when P=0.95, as shown in Fig.3. So, the distance C can be calculated when P=0.95 cross with H(ω) which is 151 Hz. The distances Di can be calculated by Eq.(7a). The center frequencies can be calculated by Eq.(7b):
Di=C/αi (i=1, 2, …, 16) (7a)
(i=1, 2, …, 16) (7b)
where αi is the i-th scaling factor; Di is the i-th channel distance, is the i-th channel center frequency.
The relationship between each two adjacent bands in basic 16-channel filter bank model is shown in Fig.4. The scaling factors and center frequencies are listed in Table 2 which are used for simulating the Bark-scale filter bank.
There are differences between the basic 16-channel filter bank model in Table 2 and Bark-scale filter bank in Table 1, since the former one adopts symmetrical bands.
Figure 5 describes the basic 16-channel filter bank model with parameters listed in Table 2, matching a Bark-scale filter bank.
Fig.4 Relationship between each two adjacent bands
Table 2 Basic 16-channel filter bank model scaling factors and center frequencies used for simulating Bark-scale filter bank
Fig.5 Basic 16-channel filter bank model simulating Bark-scale filter bank
3 Optimized filter bank by genetic algorithm
3.1 Genetic algorithm
The genetic algorithm (GA) is based on the concept of natural selection and evolution processes. Due to its good performance in many applications [12-14], it has become a popular method for optimization. This popularity can be attributed to the robustness in avoiding local optima, the low requirements on cost or error function derivatives and ease of use in parallel computing [15]. The main procedures of the GA are as follows [16]:
1) Choose a randomly generated population (feasible candidate solution).
2) Calculate the fitness at each chromosome in the population.
3) Perform the selection process with the chosen population to evolve for new population by genetic operation.
4) Create a new population by genetic operator of crossover and mutation.
5) Check termination condition.
3.2 Optimized filter bank
The Bark-scale filter bank, which is simulated by the selection of the first center frequency and by the scaling factors αi, shows good recognition rate [10]. However, the optimum selection of the parameters αi which control the distribution of each sub-band center frequency and bandwidth is still a question. The objective function here is based on the recognition percentage rate which can be expressed as fobj=1-f(α1, α2, …, α16), where f(α1, α2, …, α16) is the correct recognition rate received only by incorporating feature extraction and recognition network. Thus, the objective function is not an analytical expression, and there is no way to compute the function derivatives. Therefore, the GA is particularly suitable to be employed in the optimization. Furthermore, the optimum solution αi of the objective function minimum value will be related to the maximum of the recognition rate. The filter bank obtained from the optimum selection of αi will constitute the OFB.
4 Experimental conditions and related parameters
4.1 Corpus, ZCPA and RBF
Speaker-independent isolated words experiments were conduced to evaluate the robustness of ABFB with ZCPA under noisy environments. The speech data consist of 50, 40, 30, 20, 10 Korean words made by 16 male speakers under different SNR (15 dB, 20 dB, 25dB, 30 dB and clean), respectively. Each speaker uttered each word three times. The utterances were sampled at 11.025 kHz sampling rate with 16-bits resolution. The data were divided into two sets, nine speakers were used in the training set and the other seven speakers formed the testing sample.
The feature ZCPA was proposed by KIM [3]. Here 16 vectors for each frame and 1 024 vectors after normalization in all were used in these experiments [10]. According to the frequency domain method in ZCPA processing, each filter impulse sequence is not longer than 100 samples because of the relationship between circular convolution and linear convolution [11]. Thus, the maximum allowed scaling factor value is not larger than 5.
For example, starting from a window h(n) of 20 samples duration, the length of the resampled window
g(n)=resample(h(n), m, n)
is m×length(h(n))/n samples, where m/n≈αi.
For comparison with FIR and the Bark-scale filter bank, the same RBF was adopted as recognizer. The ZCPA features are 1 024-dimensional vectors and they constitute the sole input of the RBF based recognizer.
4.2 Related parameters
Since the GA convergence rate strongly depends on the initial population, in order to reduce complexity, the initial population in the present experiments is the set of values for the scaling factors αi, as listed in Table 2. The impulse sequence length is not larger than 100, so the maximum values of αi are 16 fives. The minimum values are listed in Table 3. The largest value (2) provides narrow bandwidth, which allocates high frequency resolution in low frequency domain, while the smallest value (0.4) provides wide bandwidth, corresponding to low frequency resolution at high frequencies.
Table 3 Minimum values of scaling factors bound
Speech signal frequency is mainly ranged between 200 Hz to 3 400 Hz, thus the following is a constraint of the optimization process:
+D1+2(D2+…+D15)+D16≤3 400 (8)
The larger the number of the population and generations is given, the higher the computational time is needed. For efficiency, the number of generations was set as three and population size was fixed to 32 which corresponded to twice the number of scaling factors.
5 Results and discussion
The results are listed in Table 4 and Table 5. From the differences between OFB and Bark scale filter bank, it can be seen that the OFB has obviously improved the recognition accuracy by 2.8% at least in clean50 words and by 7.2% in clean10 words. The highest increase in average occurs in 10 words, and the average improvement is 6.84%. The second one is 4.62% in 20 words, and the other three situations are almost the same and the average improvements are around 3.5%.
From the view of different SNR, although there are no large differences from lower SNR to clean condition, a phenomenon can still be observed that a higher increase occurs in 30 dB. However, for a larger dictionary of 50 words, the highest improvement is 4.9%, which occurs under 15 dB, while the lowest improvement occurs in clean50 words which is 2.8%. This is due to the 4% decrease of the performance from clean50 words to 15 dB 50 words in Bark-scale filter bank based recognition. These facts corroborate the hypothesis that the OFB based recognition characteristics are more robust.
As the optimized filter bank was based on each different speech corpus unit, such as 15 dB 10words, 30 dB 30words, etc, the OFB listed in Table 4 improve the recognition rate significantly. However, a major flaw is that they are not general filter banks, which may limit their practical use. Herein, the average values provided in Table 4 were considered, which were formed by taking average over the optimum scale factors at different SNRs with the same number of words in the dictionary. The results show that there is a decrease of performance compared with OFB because of the lower individual precision in the choice of optimum parameters. The higher the size of the dictionary, the less the accuracy decrease is observed. But, it is still noticeable that the average OFB still shows improvement over the Bark-scale filter bank. Furthermore, the entire average was considered, i.e., the one labeled as ABFB in Table 4. The results show that the ABFB still clearly outperforms the Bark scale, the minimum increase is 1.6% in 15 dB 30 words, and the maximum is 4.8% in clean10 words.
Three generations have been used in GA, and the optimizing process stops either if the iteration is out of generation or if the object value change fits the desired accuracy. In this algorithm, there are around 470 sets parameters in which one of the best is chosen for constituting the OFB model. The total average values of 16 scaling factors are listed in Table 6. The adaptive bands filter bank frequency response is shown in Fig.6.
Table 4 Results of different front-end filter banks with ZCPA and RBF (%)
Table 5 Differences between OFB and Bark (%)
Table 6 16 scaling factors constituting adaptive bands filter bank
Fig.6 Adaptive bands filter bank frequency response
The ABFB is more concentrated at low frequencies, where there are nine filters distributing from 215 Hz to 1 000 Hz while only seven in Bark scale. Moreover, the upper limit center frequency is around 3 500 Hz compared with 4 252 Hz in Bark situation. In addition, the ABFB gives symmetrical bandwidth while the Bark- scale has unsymmetrical shape.
6 Conclusions
1) A novel method of designing front-end filter bank is proposed. It is based on spectrum analysis with signal processing principles rather than the perceptual criterion. The OFB is optimized by the basic 16-channel filter bank. The front-end filters are combined with back-end networks in a single unit for the optimization unit. The recognition rate is taken as a benchmark for designing the filters in the provided parametric model.
2) OFB has 25 different filter bank models for the entire corpus which is a bit complicated to be applied. The average filter bank model is adopted and it still has a prior performance compared with Bark-scale filter bank model. There are five different models for 10, 20, 30, 40 and 50 words in different SNR conditions. The ABFB is used by averaging the entire scaling factors over the entire corpus and it outperforms the Bark-scale one. It has largely reduced the complexity and gives an easy model to apply. Although the ABFB is less elaborate in band shape design, it significantly outperforms the Bark- scale model.
3) The approach is simple and the experimental results are encouraging. However, the process of optimizing is time consuming. Optimization times for 10 words typically take 4.5 h, while for 50 words the requested time is longer than 30 h. This is mainly due to the choice of the GA as the optimization procedure and the specific objective function. Moreover, the ABFB strongly depends on the corpus and features, which may limit the general use in other fields. More research is to be performed in order to investigate the general applicability.
References
[1] ATAL B S. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification [J]. Journal of the Acoustical Society of America, 1974, 55(6): 1304-1312.
[2] DAVIS S, MERMELSTEIN P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences [J]. IEEE Transaction on Acoustics, Speech and Signal Processing, 1980, 28(4): 357-366.
[3] KIM D S, LEE S Y, KIL R M. Auditory processing of speech signal for robust speech recognition in real-world noisy environments [J]. IEEE Transaction on Speech and Audio Processing, 1999, 7(1): 55-69.
[4] JUANG B H, RABINER L R. Hidden Markov models for speech recognition [J]. Technometrics, 1991, 33(3): 251-272.
[5] BROOMHEAD D S, LOWE D. Multivariable functional interpolation and adaptive networks [J]. Complex Systems, 1988, 2(3): 321-355.
[6] SAYOUD H, OUAMOUR S. Speaker clustering of stereo audio documents based on sequential gathering process [J]. Journal of Information Hiding and Multimedia Signal Processing, 2010, 1(4): 344-360.
[7] HANDEL S. Listening: An introduction to the perception of auditory events [M]. Massachusetts: MIT Press, 1993: 461-546.
[8] STROPE B, ALWAN A. A model of dynamic auditory perception and its application to robust word recognition [J]. IEEE Transaction on Speech and Audio Processing, 1997, 5 (5): 451-464.
[9] HOLMBERG M, GELBART D, HEMMERT W. Automatic speech recognition with an adaptation model motivated by auditory processing [J]. IEEE Transaction on Audio, Speech, Language Processing, 2006, 14(1): 44-49.
[10] ZHANG Xue-ying, HUANG Li-xia, EVANGELISTA G. Warped filter banks used in noisy speech recognition [C]// Proceedings of Innovative Computing, Information and Control. Kaohsiung: IEEE, 2009: 1385-1388.
[11] HUANG Li-xia, ZHANG Xue-ying, EVANGELISTA G. Speaker independent recognition on OLLO French corpus by using different features [C]// Proceedings of Pervasive Computing, Signal Processing and Applications. Harbin: IEEE, 2010: 332-335.
[12] HUANG Hsiang-cheh, PAN Jeng-shyang, LU Zhe-ming, SUN Sheng-he, HANG Hsueh-ming. Vector quantization based on genetic simulated annealing [J]. Signal Processing, 2001, 81(7): 1513-1523.
[13] LI Xi, CAO Guang-yi, ZHU Xin-jian, WEI Dong. Identification and analysis based on genetic algorithm for proton exchange membrane fuel cell stack [J]. Journal of Central South University of Technology, 2006, 13(4): 428-431.
[14] YU Shou-yi, KUANG Su-qiong. Fuzzy adaptive genetic algorithm based on auto-regulating fuzzy rules [J]. Journal of Central South University of Technology, 2010, 17(1): 123-128.
[15] GOSSELIN L, TYE-GINGRAS M, MATHIEU-POTVIN F. Review of utilization of genetic algorithms in heat transfer problems [J]. International Journal of Heat and Mass Transfer, 2009, 52(9/10): 2169-2188.
[16] PRAKOTPOL D, SRINOPHAKUN T. GAPinch: genetic algorithm toolbox for water pinch technology [J]. Chemical Engineering and Processing, 2004, 43(2): 203-217.
(Edited by HE Yun-bin)
Foundation item: Project(61072087) supported by the National Natural Science Foundation of China; Project(20093048) supported by Shanxi Provincial Graduate Innovation Fund of China
Received date: 2010-12-24; Accepted date: 2011-02-02
Corresponding author: ZHANG Xue-ying, Professor, PhD; Tel: +86-351-6014942; E-mail: tyzhangxy@163.com