How do speech models trained through self-supervised learning structure their representations?
Previous studies have looked at how information is encoded in feature vectors across different layers.
But few studies have considered whether speech characteristics are captured \textit{within} individual dimensions of SSL features.
In this paper we specifically look at speaker information using PCA on utterance-averaged representations.
For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender.
Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics.
We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence.
We further show that characteristics can be changed by manipulating the corresponding dimensions.
We now demonstrate the result of changing a principal dimension of an utterance to control a specific speaker characteristic.
All utterances are from Librispeech's dev-clean and test-clean datasets.
Pitch modification (Principal dimension 1)
Male Speaker
Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower
Intensity modification (Principal dimension 2)
Male Speaker
Female Speaker
3 Standard deviations higher
1 Standard deviation higher
Original utterance
1 Standard deviation lower
3 Standard deviations lower
F2 modification (Principal dimension 4)
Male Speaker
Female Speaker
5 Standard deviations higher
3 Standard deviation higher
Original utterance
3 Standard deviation lower
5 Standard deviations lower
No correlation modification
These are control experiments to show that even if a principal dimension, that according to our analysis
has no correlations associated with it, is changed, there is no significant change in the original audio.