Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper,

[arXiv]

How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. Using WavLM, we find that the principal dimension that explains most variance encodes pitch, and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. Finally, in subsequent synthesis experiments, we show that most characteristics can be controlled by changing the corresponding dimensions. We show that this control is isolated control which results in high quality audio synthesis.

We now demonstrate the result of changing a principal dimension of an utterance to control a specific speaker characteristic. All utterances are from Librispeech's dev-clean and test-clean datasets.

Pitch modification (Principal dimension 1)

Male Speaker Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower

Intensity modification (Principal dimension 2)

Male Speaker Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower

F2 modification (Principal dimension 4)

Male Speaker Female Speaker
5 Standard deviations higher
3 Standard deviation higher
Original utterance
3 Standard deviation lower
5 Standard deviations lower

No correlation modification

These are control experiments to show that even if a principal dimension, that according to our analysis has no correlations associated with it, is changed, there is no significant change in the original audio.

Principal dimension 3

Male Speaker Female Speaker
5 Standard deviations higher
Original utterance
5 Standard deviations lower

Principal dimension 5

Male Speaker Female Speaker
5 Standard deviations higher
Original utterance
5 Standard deviations lower

Principal dimension 7

Male Speaker Female Speaker
5 Standard deviations higher
Original utterance
5 Standard deviations lower