Explaining Speech Classification Models with Feature Attribution

Explaining Speech Classification Models with Feature Attribution#

Author: Gaofei Shen

This notebook is part of the Interspeech 2025 tutorial on Interpretability Techniques for Speech Models. The notebook explores the ways to use feature attribution methods in the context of speech models.

#@title Installing up python packages
# %%capture
from IPython.display import clear_output

# Utility script for updating ffmpeg if needed
# !bash <(curl -s https://raw.githubusercontent.com/XniceCraft/ffmpeg-colab/master/install)
# clear_output()

!pip install uv
clear_output()
!uv pip install --system -U torch==2.8.0 torchaudio==2.8.0 torchvision praat-textgrids captum seaborn "datasets<=3.6.0" transformers
clear_output()

########## Checking numpy version ###########
import numpy as np
print("Current Numpy version is:", np.__version__)
if int(np.__version__[0]) > 1:
  import os
  # Kill the colab kernel process to force restart
  os.kill(os.getpid(), 9)
########## End checking numpy version #######

The Colab runtime will automatically crash after running the code cell above.

Don’t worry! This is expected because Colab now (July 2025) ships with numpy==2.0.2 which is not compatible with the captum library that we’re using for feature attribution.

The Colab runtime should automatically restart. After runtime restart you can safely proceed to the following code cells.

#@title Setting up environments and resource files
# Check if the tar exists already
from IPython.display import clear_output

import os
if not os.path.isfile('/content/feature_attribution.tgz'):
  print(f"Resources are not downloaded yet! Downloading now")
  !gdown 16fV8OGuWw9tVe8QarQhakJJOKnJN_yKE
  clear_output()
# Check if the tar has been untared already
if not os.path.isdir('/content/interspeech_2025_interp_tutorial/'):
  print(f"Resources files haven't been untared yet! Untaring now")
  !tar -xzvf feature_attribution.tgz
  clear_output()
# Change directory into the directory with the resource files
%cd /content/interspeech_2025_interp_tutorial/

/content/interspeech_2025_interp_tutorial

import copy

import IPython.display as ipd
import matplotlib.pyplot as plt
import textgrids
import torch
from datasets import load_from_disk
from transformers import AutoModelForAudioClassification

from helper import (
    WrapperModelMultiHead,
    attribute,
    choose_attr_method,
    align_score_with_forced_alignment,
)

The example dataset#

Similar to the implementation of Pastor et al (2023), we are using the Fluent Speech Commands dataset . I have sampled 100 examples from the test split of the FSC dataset and formatted it into the HuggingFace dataset format.

Pastor, E., Koudounas, A., Attanasio, G., Hovy, D., & Baralis, E. (2023, September 14). Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features. https://doi.org/10.48550/arXiv.2309.07733

#@title Loading dataset
dataset = load_from_disk("FSC_test_sampled")
# Shuffle the dataset to ensure a random selection of examples
dataset = dataset.shuffle(seed=42)

The example model#

We will be using a model that is fine-tuned as a part of Shen et al (2025). This model uses three classifier heads for the three intent “slots” in the FSC dataset. This implementation is slightly different compared to the common publically available intent classification models.

Shen, G., Mohebbi, H., Bisazza, A., Alishahi, A., & Chrupała, G. (2025). On the reliability of feature attribution methods for speech classification. arXiv preprint arXiv:2505.16406.

#@title Load the pre-trained model from Hugging Face
# This model is a multi classifier head, multi class classification model fine-tuned on the FSC-IC dataset
# It has three heads for action, object, and location classification
# The model is based on Wav2Vec2 and uses a multi-label classification approach with a customized model class.

model = AutoModelForAudioClassification.from_pretrained(
    "techsword/wav2vec2-base-fsc-ic-multiheadmultilabel", trust_remote_code=True
)

device = torch.accelerator.current_accelerator() if torch.accelerator.is_available() else "cpu"
model.to(device)
model.eval()

A new version of the following files was downloaded from https://huggingface.co/techsword/wav2vec2-base-fsc-ic-multiheadmultilabel:
- configuration_wav2vec2multihead.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.

A new version of the following files was downloaded from https://huggingface.co/techsword/wav2vec2-base-fsc-ic-multiheadmultilabel:
- modeling_wav2vec2multihead.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.

Wav2Vec2ForMultiHeadMultiLabelClassification(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2GroupNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
          (activation): GELUActivation()
          (layer_norm): GroupNorm(512, 512, eps=1e-05, affine=True)
        )
        (1-4): 4 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projection): Linear(in_features=512, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): Wav2Vec2Encoder(
      (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
        (conv): ParametrizedConv1d(
          768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (padding): Wav2Vec2SamePadLayer()
        (activation): GELUActivation()
      )
      (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (layers): ModuleList(
        (0-11): 12 x Wav2Vec2EncoderLayer(
          (attention): Wav2Vec2Attention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (dropout): Dropout(p=0.1, inplace=False)
          (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (feed_forward): Wav2Vec2FeedForward(
            (intermediate_dropout): Dropout(p=0.0, inplace=False)
            (intermediate_dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
            (output_dense): Linear(in_features=3072, out_features=768, bias=True)
            (output_dropout): Dropout(p=0.1, inplace=False)
          )
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (dropout): Dropout(p=0.0, inplace=False)
  (classifier1): Linear(in_features=768, out_features=6, bias=True)
  (classifier2): Linear(in_features=768, out_features=14, bias=True)
  (classifier3): Linear(in_features=768, out_features=4, bias=True)
)

You can see at the bottom of the outputs from the cell above that this custom model has three classifiers at the end. Each classifier being responsible for one Intent Classification target in the Fluent Speech Commands dataset.

Let’s make sure the model works!#

We will run five examples in our selected dataset and see how the model performs

# Let's check the model's performance on some examples from the dataset
for example in dataset.select(range(5)):  # type: ignore
    # Select the first 5 examples from the dataset
    audio = example["audio"]["array"]  # type: ignore
    audio = torch.tensor(audio, dtype=torch.float).unsqueeze(0).to(device)
    with torch.no_grad():
        output = model(audio)

    # We can save the three logit values for each of the classifiers
    logits1 = output.logits1
    logits2 = output.logits2
    logits3 = output.logits3

    class_1 = logits1.argmax(dim=-1).item()
    class_2 = logits2.argmax(dim=-1).item() + model.config.num_labels_1
    class_3 = (
        logits3.argmax(dim=-1).item()
        + model.config.num_labels_1
        + model.config.num_labels_2
    )

    ipd.display(ipd.Audio(audio.squeeze().cpu().numpy(), rate=16000))
    print(f"Transcription: {example['transcription']}")  # type: ignore
    print(
        f"Predicted labels: {model.config.id2label[class_1]}, {model.config.id2label[class_2]}, {model.config.id2label[class_3]}"
    )

    real_class_1 = example["action"]  # type: ignore
    real_class_2 = example["object"] + model.config.num_labels_1  # type: ignore
    real_class_3 = (
        example["location"] + model.config.num_labels_1 + model.config.num_labels_2  # type: ignore
    )
    print(
        f"Real labels: {model.config.id2label[real_class_1]}, {model.config.id2label[real_class_2]}, {model.config.id2label[real_class_3]}"
    )
    print("-" * 50)

Transcription: I couldn't hear anything, turn up the volume
Predicted labels: increase, volume, none_location
Real labels: increase, volume, none_location
--------------------------------------------------

Transcription: Fetch the newspaper
Predicted labels: bring, newspaper, none_location
Real labels: bring, newspaper, none_location
--------------------------------------------------

Transcription: Turn the bedroom heat down
Predicted labels: decrease, heat, bedroom
Real labels: decrease, heat, bedroom
--------------------------------------------------

Transcription: Switch the lights on
Predicted labels: activate, lights, none_location
Real labels: activate, lights, none_location
--------------------------------------------------

Transcription: Turn down the volume
Predicted labels: decrease, volume, none_location
Real labels: decrease, volume, none_location
--------------------------------------------------

The Feature attribution part!#

We are following a similar procedure laid out in the Captum tutorial for BERT models
The following code cell includes helper function to run the feature attribution pipeline and then plotting the results.

#@title Helper functions
input_type_lookup = {
    "input": "Waveform",
    "spec": "Spectrogram",
    "embedding": "CNN Embedding",
}
attr_method_lookup = {
    "ig": "Integrated Gradients",
    "saliency": "Saliency",
    "lime": "LIME",
    "occlusion": "Occlusion",
    "featureablation": "Feature Ablation"
}


def plot_feat_attribution(
    subtask="action",
    input_type="input",
    attr_method_name="ig",
    random_seed=42,
    word_level=False,
):
    """Plot the feature attribution for a given subtask, input type, and attribution method."""

    wrapper_model = WrapperModelMultiHead(model, input_type=input_type)
    attr_method = choose_attr_method(attr_method_name, wrapper_model)


    attr_method_fullname = attr_method_lookup[attr_method_name]
    input_type_fullname = input_type_lookup[input_type]

    renamed_dataset = copy.deepcopy(dataset).rename_column(subtask, "label")
    # Shuffle the renamed_dataset to ensure a random selection of examples
    renamed_dataset = renamed_dataset.shuffle(random_seed)

    classifier_to_idx = {"action": 0, "object": 1, "location": 2}
    additional_forward_args = (classifier_to_idx[subtask],)

    example = renamed_dataset[0]  # Get the first example from the dataset

    # Run feature attribution on the sample
    results = attribute(
        attr_method,
        example,
        batch_size=8,
        additional_forward_args=additional_forward_args,
        word_level=word_level,
    ).squeeze()

    if input_type != "input":
        # We sum over the feature dimension if the input type is not waveform
        results = results.mean(-1)

    real_label = dataset.features[subtask].int2str(example["label"])  # type: ignore

    if word_level:
        # If word_level is True, we need to summarize the attribution scores over the
        # words using the timestamp information from the textgrid
        word_level_results = align_score_with_forced_alignment(
            example, attributions=results, input_type=input_type
        )

        # Plot the attribution scores for word-level attribution
        # Word level results is a list of tuples (word, score)
        plt.figure(figsize=(12, 6), dpi=150)
        plt.bar(
            [word[0] for word in word_level_results],
            [word[1] for word in word_level_results],
            color="blue",
        )

        # Put the word labels on the x-axis
        plt.xticks(rotation=45, ha="right")
        plt.title(
            f"Attribution for {subtask} classification in FSC dataset, real label: {real_label}. \n Input type: {input_type_fullname}, Attribution method: {attr_method_fullname}"
        )
        plt.xlabel("Words")
        plt.ylabel("Attribution score")
        plt.tight_layout()
        plt.show()

    else:  # Plotting the attribution scores for frame-level attribution
        plt.figure(figsize=(12, 6), dpi=150)
        plt.plot(results)
        # alignments = textgrids.TextGrid(example["textgrid_path"])
        alignments = example["textgrid"]  # type: ignore
        # Use the alignments to plot some vertical red lines for each word in the alignment
        for word in alignments["words"]:  # type: ignore
            if len(word["text"]) == 0:
                # Skip silence
                continue
            start = word["xmin"] * 16000  # Convert timestap from seconds to samples
            end = word["xmax"] * 16000  # Convert timestap from seconds to samples
            if input_type != "input":
                # If the input type is not "input", we need to convert the start and end times to the number of frames
                # And the wav2vec2 has a 20ms stride, so we need to divide the start and end times by 320
                # 16000 samples per second / 50 frames per second = 320 samples per frame
                # So we divide the start and end times by 320 to get the number of frames
                start = start // 320
                end = end // 320

            # print(f"Word: {word['text']}, Start: {start}, End: {end}")
            plt.plot(
                [start, start],
                [results.min(), results.max()],
                color="red",
                linewidth=0.5,
            )
            # Add the text of the word centered between the start and end times
            plt.text(
                (start + end) / 2,
                results.max(),
                word["text"],
                color="brown",
                fontsize=8,
                horizontalalignment="center",
                verticalalignment="bottom",
            )

        plt.title(
            f"Attribution for {subtask} classification in FSC dataset, real label: {real_label}. \n Input type: {input_type_fullname}, Attribution method: {attr_method_fullname}"
        )
        plt.xlabel("Time (samples or frames)")
        plt.ylabel("Attribution score")

        plt.show()

After setting up everything, we can now test the effects of different variables in the speech model feature attribution setup.

subtask is one of three Intents the model is capable of classifying

Input_Signal_for_FA is the type of input we want to attribute back to. We can choose from waveform, spectrogram or CNN embedding.

Attribution_Method contains a variety of methods we can choose. Including perturbation based methods like “Feature Ablation” and also gradient based methods like “Integrated Gradients”

Random_seed controls the random sampling of the sample we’re running feature attribution on.

For perturbation based methods, the word_level parameter controls if entire word chunks are being perturbated. And word_level also controls the summarization step. Essentially we use the word-level alignment timestamp to pool feature attribution scores together.

When running the code cell below, feel free to copy-and-paste or duplicate the cell so you can compare across different subtasks, input signals and/or attribution methods!

# Choose the subtask within the intent classification task
subtask = "action"  #@param ["action", "object", "location"]

# Choose the input types
Input_Signal_for_FA = "CNN Embedding" #@param ["Waveform", "Spectrogram", "CNN Embedding"]
input_type = {v: k for k, v in input_type_lookup.items()}[Input_Signal_for_FA]

# Choose what attribution method you want to use
Attribution_Method = "Integrated Gradients" #@param ["Integrated Gradients", "Saliency", "LIME", "Feature Ablation"]
attr_method_name = {v: k for k, v in attr_method_lookup.items()}[Attribution_Method]

# Seed for random selection of examples
random_seed = 85 #@param {type: "slider", min: 0, max: 999}
# Whether to use word-level attribution or not
word_level = False  #@param {type: "boolean"}

# Use feature attribution to find out which parts of the audio are important for the prediction
plot_feat_attribution(
    subtask=subtask,
    input_type=input_type,
    attr_method_name=attr_method_name,
    word_level=word_level,
    random_seed=random_seed,
)

../../_images/3c58005c3d55dfada4822ba3b697d8c692308d0d9d23b2a71e01db372f133482.png

# Choose the subtask within the intent classification task
subtask = "action"  #@param ["action", "object", "location"]

# Choose the input types
Input_Signal_for_FA = "CNN Embedding" #@param ["Waveform", "Spectrogram", "CNN Embedding"]
input_type = {v: k for k, v in input_type_lookup.items()}[Input_Signal_for_FA]

# Choose what attribution method you want to use
Attribution_Method = "Integrated Gradients" #@param ["Integrated Gradients", "Saliency", "LIME", "Feature Ablation"]
attr_method_name = {v: k for k, v in attr_method_lookup.items()}[Attribution_Method]

# Seed for random selection of examples
random_seed = 85 #@param {type: "slider", min: 0, max: 999}
# Whether to use word-level attribution or not
word_level = True  #@param {type: "boolean"}

# Use feature attribution to find out which parts of the audio are important for the prediction
plot_feat_attribution(
    subtask=subtask,
    input_type=input_type,
    attr_method_name=attr_method_name,
    word_level=word_level,
    random_seed=random_seed,
)

../../_images/dd9dcb780c06843f32055051bc1969709639efa539231b415fcf86aa75f6c8a1.png

Last words:#

Feel free to dig into the files tab on the left into the interspeech_2025_interp_tutorial folder to see what the actual attribute function look like in the helper.py file.