How do speech models trained through self-supervised learning structure their representations?
Previous studies have looked at how information is encoded in feature vectors across different layers.
But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features.
In this paper we specifically look at speaker information using PCA on utterance-averaged representations.
Using WavLM, we find that the principal dimension that explains most variance encodes pitch, and associated characteristics like gender.
Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics.
Finally, in subsequent synthesis experiments, we show that most characteristics can be controlled by changing the corresponding dimensions.
We show that this control is isolated control which results in high quality audio synthesis.
We now demonstrate the result of changing a principal dimension of an utterance to control a specific speaker characteristic.
All utterances are from Librispeech's dev-clean and test-clean datasets.
Pitch modification (Principal dimension 1)
Male Speaker
Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower
Intensity modification (Principal dimension 2)
Male Speaker
Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower
F2 modification (Principal dimension 4)
Male Speaker
Female Speaker
5 Standard deviations higher
3 Standard deviation higher
Original utterance
3 Standard deviation lower
5 Standard deviations lower
No correlation modification
These are control experiments to show that even if a principal dimension, that according to our analysis
has no correlations associated with it, is changed, there is no significant change in the original audio.