Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper,

[arXiv]

How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured \textit{within} individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence. We further show that characteristics can be changed by manipulating the corresponding dimensions.

We now demonstrate the result of changing a principal dimension of an utterance to control a specific speaker characteristic. All utterances are from Librispeech's dev-clean and test-clean datasets.

Pitch modification (Principal dimension 1)

Male Speaker Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower

Intensity modification (Principal dimension 2)

Male Speaker Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower

F2 modification (Principal dimension 4)

Male Speaker Female Speaker
5 Standard deviations higher
3 Standard deviation higher
Original utterance
3 Standard deviation lower
5 Standard deviations lower

No correlation modification

These are control experiments to show that even if a principal dimension, that according to our analysis has no correlations associated with it, is changed, there is no significant change in the original audio.

Principal dimension 3

Male Speaker Female Speaker
5 Standard deviations higher
Original utterance
5 Standard deviations lower

Principal dimension 5

Male Speaker Female Speaker
5 Standard deviations higher
Original utterance
5 Standard deviations lower

Principal dimension 7

Male Speaker Female Speaker
5 Standard deviations higher
Original utterance
5 Standard deviations lower