Language Identification tool
The method of determining the major language from a variety of audio input samples is known as language identification. Language identification is a significant and difficult challenge in natural language processing (NLP). Numerous language-related jobs exist, such as texting someone on your phone, locating interesting news articles, or finding the answers to problems you may have. NLP models power all of these jobs. We need to do language identification in order to determine which model to use at a given moment.
This article provides a detailed solution and code sample for language identification using Intel Neural Compressor, a tool to speed up AI inference without compromising accuracy, and Intel Extension for PyTorch, a version of the well-known PyTorch AI framework optimised for use on Intel processors.
The code sample shows how to use the Hugging Face datasets collection, the Hugging Face SpeechBrain toolkit, and Intel AI tools to optimise a model for language identification. By altering the code sample, the user can use the Common Voice dataset to detect up to 133 languages.
Suggested Approaches to Language Identification
In the suggested method, the user will train a model using Intel AI Tools and use Intel-optimized PyTorch libraries to do inference. To expedite inference, the training model can also be quantised using Intel Neural Compressor.
Language Identification dataset
For this code sample, the Common Voice dataset more especially, Common Voice Corpus 11.0 for Swedish and Japanese is utilised. Using the Hugging Face SpeechBrain library, an Emphasised Channel Attention, Propagation and Aggregation Time Delay Neural Network (ECAPA-TDNN) is trained on this dataset. Time Delay Neural Networks (TDNNs), also known as one-dimensional Convolutional Neural Networks (1D CNNs), are multilayer artificial neural network architectures designed to model context at each network layer and categorise patterns with shift-invariance. Building on the original x-vector architecture, ECAPA-TDNN is a novel TDNN-based speaker-embedding extractor for speaker verification that prioritises channel attention, propagation, and aggregation.
Execution
Download the Dataset
Huggingface Datasets offers the Common Voice dataset version 11. A convenient method for downloading the dataset is included in the code sample. The dataset download scripts (dataset.py) can be executed using the following options:
Input Option | Description |
–output_dir | Specify the data path. Default is /data/commonVoice |
Following the download of the Common Voice dataset, the data is divided into training, validation, and testing sets and preprocessed by converting the MP3 files into WAV format to prevent information loss.
To focus on the languages of interest, the Hugging Face SpeechBrain package is used to retrain a pretrained VoxLingua107 model using the Common Voice dataset. A speech dataset called VoxLingua107 is used to train models for spoken language recognition that perform well with diverse and real-world speech data. Data for 107 languages are included in this dataset. The Swedish and Japanese datasets will be used by default to fine-tune the model, but additional languages can be added.
The testing dataset or a user-specified dataset is then used for inference using this refined model. Another alternative is to use SpeechBrain’s Voice Activity Detection (VAD), in which samples are chosen at random to be fed into the model after only the speech segments from the audio files have been removed and concatenated. All the resources required to complete VAD are available at this link. The user can use Intel Neural Compressor to reduce latency and quantise the trained model to integer-8 (INT8) in order to enhance performance.
Instruction
Copies of the training scripts, such as train.py to carry out the actual training process, train_ecapa.yaml to specify the training options, and create_wds_shards.py to construct the WebDataset shards, are copied to the current working directory. The two languages selected for this code example are compatible with the script used to generate WebDataset shards and YAML files.
The prepareAllCommonVoice.py script is run during the data preparation stage in order to convert the input from MP3 to WAV format by randomly selecting a certain number of samples. Ten percent of these samples will be used for testing, ten percent for validation, and eighty percent for training. It is advised that the number of input samples be at least 2000, which is the default amount.
The training and validation datasets are then used to construct WebDataset shards. To achieve high I/O speeds from local storage, this stores the audio files as tar files, which makes it possible to write completely sequential I/O pipelines for large-scale deep learning. This is roughly three to ten times quicker than random access.
The user will make changes to the YAML file. This involves determining the batch size, the number of epochs to train over the full dataset, the output neurones to the number of languages of interest, and the value for the largest number for the WebDataset shards. If the CPU or GPU runs out of memory while the training script is running, the batch size should be reduced.
The CPU will be used to run the training script in this code sample. “cpu” will be supplied as an input parameter when the script is executing. Additionally supplied as arguments are the configurations specified in train_ecapa.yaml.
The following command will launch the model training script:
python train.py train_ecapa.yaml –device “cpu”
With future updates from the Intel Extension for PyTorch, the train.py training script can be altered to include Intel GPUs, including the Intel Data Centre GPU Flex Series, Intel Data Centre GPU Max Series, and Intel Arc A-Series.
To find out how to train the models and run the training script, run it. Because of its Intel Advanced Matrix Extensions (Intel AMX) instruction set, which improves speed, the 4th Generation Intel Xeon Scalable Processor or later is advised for this transfer learning application.
Checkpoint files are accessible following training. The model is loaded for inference using these files.
Conclusion
Users have the option of doing inference using their own custom data in WAV format or the Common Voice testing set. The inference scripts (inference_custom.py and inference_commonVoice.py) can be executed with the following parameters:
Input Option | Description |
---|---|
-p | Specify the data path. |
-d | Specify the duration of wave sample. The default value is 3. |
-s | Specify size of sample waves, default is 100. |
–vad | (inference_custom.py only) Enable VAD model to detect active speech. The VAD option will identify speech segments in the audio file and construct a new .wav file containing only the speech segments. This improves the quality of speech data used as input into the language identification model. |
–ipex | Run inference with optimizations from Intel Extension for PyTorch. This option will apply optimizations to the pretrained model. Using this option should result in performance improvements related to latency. |
–ground_truth_compare | (inference_custom.py only) Enable comparison of prediction labels to ground truth values. |
–verbose | Print additional debug information, like latency. |
It is necessary to specify the path to the data. By default, the language identification model will use 100 randomly chosen 3-second audio samples from the original audio file as input.
Audio samples are processed by a tiny Convolutional Recurrent Deep Neural Network (CRDNN) that has been pretrained on the LibriParty dataset. The CRDNN then outputs the segments in which speech activity is detected. The –vad option can be used to infer this.
The CRDNN model provides the timestamps where speech will be identified, as seen in the picture below. These timestamps are then utilised to create a new, shorter audio file that contains only speech. A better estimate of the predominant language spoken will be provided by sampling from this new audio file.

This will use the data you supply, which is in the data_custom folder, to perform inference. Using 50 randomly chosen 3-second audio samples and vocal activity detection, this program carries out inference.
Download the Common Voice Corpus 11.0 datasets for other languages if you wish to run the code example in other languages.
PyTorch and Intel Neural Compressor Optimisations with Intel Extension
PyTorch
For an additional performance boost on Intel technology, the Intel extension adds modern capabilities and optimisations to PyTorch. See how to install the PyTorch Intel Extension. The extension can be linked as a C++ library or loaded as a Python module. It can be dynamically enabled by Python users by importing intel_extension_for_pytorch.
- The Intel Extension for PyTorch for Intel CPUs is covered in full in the CPU lesson. The master branch contains the source code.
- The Intel Extension for PyTorch for Intel GPUs is covered in full in the GPU lesson. The xpu-master branch has the source code.
The –ipex option can be entered to optimise the model for inference using the Intel Extension for PyTorch. The plug-in is used to optimise the model. Because PyTorch is performed in graph mode, TorchScript accelerates inference. With this optimisation, the command to execute is:
python inference_custom.py -p data_custom -d 3 -s 50 –vad –ipex -–verbose
Note: In order to observe the latency measurements, the –verbose option is necessary.
A later version of the code example will include support for auto-mixed precision, such as bfloat16 (BF16).
The Intel Neural Compressor
This Python library is open-source and compatible with both CPUs and GPUs.
- Reduces the size of the model and speeds up deep learning inference for deployment by performing model quantisation.
- Automates well-known techniques across several deep learning systems, including quantisation, compression, pruning, and knowledge distillation.
- Is included with the AI Kit.
The quantize_model.py script can be used to quantise the model from float32 (FP32) precision to integer-8 (INT8) by giving it the model’s directory and a validation dataset. This INT8 model can be loaded for inference using the code below:
from neural_compressor.utils.pytorch import load
model_int8 = load(“./lang_id_commonvoice_model_INT8”, self.language_id)
signal = self.language_id.load_audio(data_path)
prediction = self.model_int8(signal)
Keep in mind that loading the quantised model requires the original model. Using quantize_model.py, the following script quantises the trained model from FP32 to INT8:
python quantize_model.py -p ./lang_id_commonvoice_model -datapath $COMMON_VOICE_PATH/commonVoiceData/commonVoice/dev