Pertanika Journal

Go to Pertanika

Go to JTAS Home

Go to Pertanika Facebook

Home / Regular Issue / JST Vol. 32 (4) Jul. 2024 / JST-4578-2023

Low Resource Malay Dialect Automatic Speech Recognition Modeling Using Transfer Learning from a Standard Malay Model

Tien-Ping Tan, Lei Qin, Sarah Flora Samson Juan and Jasmina Yen Min Khaw

Pertanika Journal of Science & Technology, Volume 32, Issue 4, July 2024

DOI: https://doi.org/10.47836/pjst.32.4.06

Keywords: Automatic speech recognition, Malay dialects, Malay language, transfer learning

Published on: 25 July 2024

Abstract

Approaches to automatic speech recognition have transited from Hidden Markov Model (HMM)-based ASR to deep neural networks. The advantages of deep neural network approaches are that they can be developed quickly and perform better given large language resources. Nevertheless, dialect speech recognition is still challenging due to the limited resources. Transfer learning approaches have been proposed to improve speech recognition for low resources. In the first approach, the model is pre-trained on a large and diverse labeled dataset to learn the acoustic and language patterns from the speech signal. Then, the model parameters are updated with a new dataset, and the pre-trained model is fine-tuned on a low-resource language dataset. The fine-tuning process is usually completed by freezing the pre-trained layers and training the remaining layers of the model on the low-resource language corpus. Another approach is to use a pre-trained model to capture the compact and meaningful features as input to the encoder. Pre-training in this approach usually involves using unsupervised learning methods to train models on a corpus of large amounts of unmarked data. It enables the model to learn the general patterns and relationships between the input speech signals. This paper proposes a training recipe using transfer learning and Standard Malay models to improve automatic speech recognition for Kelantan and Sarawak Malay dialects.

References

Aitoulghazi, O., Jaafari, A., & Mourhir, A. (2022). DarSpeech: An automatic speech recognition system for the Moroccan dialect. In 2022 International Conference on Intelligent Systems and Computer Vision (ISCV) (pp. 1-6). IEEE Publishing. https://doi.org/10.1109/ISCV54655.2022.9806105
Ali, A. R. (2020). Multi-dialect Arabic speech recognition. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1-7). IEEE Publishing. https://doi.org/10.1109/IJCNN48605.2020.9206658
Asmah, H. O. (1991). Aspek bahasa dan kajiannya [Aspects of language and its study]. Dewan Bahasa dan Pustaka.
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. ArXiv, Article 2006.11477. https://doi.org/10.48550/arXiv.2006.11477
Boersma, P. (2001). Praat: A system for doing phonetics by computer. Glot International, 5(9) 341-345.
Chong, T. Y., Xiao, X., Xu, H., Tan, T. P., Chau-Khoa, P., Lyu, D. C., Chng, E. S., & Li, H., (2013). The development and analysis of a Malay broadcast news corpus. In 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE) (pp. 1-5). IEEE Publishing. https://doi.org/10.1109/ICSDA.2013.6709862
Colins, J. T. (1989). Malay dialect research in Malaysia: The issue of perspective. Bijdragen tot de Taal-, Land- en Volkenkunde, 235-264.
Grace, M., Bastani, M., & Weinstein. E. (2018). Occam’s adaptation: A comparison of interpolation of bases adaptation methods for multi-dialect acoustic modeling with LSTMS. In 2018 IEEE Spoken Language Technology Workshop (SLT), (pp. 174-181). IEEE Publishing. https://doi.org/10.1109/SLT.2018.8639654
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. ArXiv, Article 1706.02737. https://doi.org/10.48550/arXiv.1706.02737
Hou, W., Dong, Y., Zhuang, B., Yang, L. Shi, J. & Shinozaki, T. (2020). Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning. In Interspeech (pp. 1037-1041). ISCA Publishing. https://doi.org/10.21437/Interspeech.2020-2164
Jain, A., Upreti, M. & P. Jyothi, P. (2018) Improved accented speech recognition using accent embeddings and multi-task learning. In Proceedings of Interspeech (pp. 2454-2458). ISCA Publishing.
Juan, S. S., Besacier, L., & Tan, T. P. (2012). Analysis of Malay speech recognition for different speaker origins. In 2012 International Conference on Asian Language Processing (pp. 229-232). IEEE Publishing. https://doi.org/10.1109/IALP.2012.23
Khaw, J. Y. M. (2017). Bootstrapping Kelantan and Sarawak Malay dialect models on text and phonetic analyses in text-to-speech system [Doctorate Dissertation]. Universiti Sains Malaysia.
Khaw, J. K. M., Tan, T.-P. & Ranaivo-Malancon, B. (2024). Hybrid distance-statistical-based phrase alignment for analyzing parallel texts in standard Malay and Malay dialects. Malaysian Journal of Computer Science, 37(1), 89–106. https://doi.org/10.22452/mjcs.vol37no1.5
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, (pp. 177-180). Association for Computational Linguistics.
Li, B., Sainath, T. N., Sim, K. C., Bacchiani, M., Weinstein, E., Nguyen, P., Chen, Z., Wu, Y. & Rao, K. (2018) Multi-dialect speech recognition with a single sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 4749-4753). IEEE Publishing. https://doi.org/10.1109/ICASSP.2018.8461886
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G. & Vesel, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (pp. 1-4). IEEE Signal Processing Society.
Rahman, F. D., Mohamed, N., Mustafa, M. B., & Salim, S. S. (2014). Automatic speech recognition system for Malay speaking children. In 2014 Third ICT International Student Project Conference (ICT-ISPC) (pp. 79-82). IEEE Publishing. https://doi.org/10.1109/ICT-ISPC.2014.6923222
Ravanelli, M., Parcollet, T., Plantinga, P. W., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J., Yeh, S., Fu, S., Liao, C., Rastorgueva, E. N., Grondin, F., Aris, W., Na, H., Gao, Y., Mori, R. D., & Bengio, Y. (2021). SpeechBrain: A general-purpose speech toolkit. ArXiv, Article 2106.04624. https://doi.org/10.48550/arXiv.2106.04624
Renduchintala, A., Ding, S., Wiesner, M., & Watanabe, S. (2018). Multi-modal data augmentation for end-to-end ASR. ArXiv, Article 1803.10299. https://doi.org/10.48550/arXiv.1803.10299
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-Vectors: Robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 5329-5333). IEEE Publishing. https://doi.org/10.1109/ICASSP.2018.8461375
Tan, T. P., Xiao, X., Tang, E. K., Chng, E. S., & Li, H. (2009). MASS: A Malay language LVCSR corpus resource. In 2009 Oriental COCOSDA International Conference on Speech Database and Assessments (pp. 25-30). IEEE Publishing. https://doi.org/10.1109/ICSDA.2009.5278382
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. In IEEE Journal of Selected Topics in Signal Processing, (Vol. 11, No. 8, pp. 1240-1253). IEEE Publishing. https://doi.org/10.1109/JSTSP.2017.2763455
Yan, J., Yu, H., & Li, G. (2018). Tibetan acoustic model research based on TDNN. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 601-604). IEEE Publishing. https://doi.org/10.23919/APSIPA.2018.8659584
Yi, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2021). Transfer ability of monolingual Wav2vec2.0 for low-resource speech recognition. In 2021 International Joint Conference on Neural Networks (IJCNN) (pp. 1-6). IEEE Publishing. https://doi.org/10.1109/IJCNN52387.2021.9533587