Categorisation of my Spotify listening history using k-means clustering.
Author
Harry Zhong
Published
March 23, 2024
Introduction
The motivation for this project was:
I thought it would be fun.
That’s it.
So, let’s get into how we can use R and Spotify’s Web API to categorise songs that we have listened to.
Data Extraction
For this project, there are two key datasets that we need:
Listening activity history.
Track features.
The first can be obtained by requesting your Spotify data. For this project, we will using the extended streaming history option, which takes longer to process but gives us our full listening history as opposed to only the most recent year.
The second dataset can be generated from our listening activity using Spotify’s Web API to pull track features for each song in our streaming history.
Activity History Data
Our Spotify activity history data is given as a set of .json files. We can extract the data from all the .json files into a dataframe and perform some preliminary cleaning using the code below.
Next, we’ll need to use Spotify’s Web API to obtain track features, which requires the track ID of each track we’re interested in. Fortunately, since we requested our full activity history, this data is included as a column.
Note
On the topic of Spotify’s Web API, it’s interesting to note that it also includes genres. However, genres are linked to artists, not tracks, which makes this feature less noteworthy compared to track features.
We can use the httr and jsonlite packages to create a function that takes a Spotify track ID and returns its track features.
Given the large number of track IDs, using this function on all tracks in our dataset is a long and painful process, where we will get rate limited many times by Spotify. Conveniently, I have a local file containing all of our tracks and their associated track features, which I will load in.
The description for each feature can be found in Spotify’s documentation.
track_name
artist_name
track_uri
danceability
energy
key
loudness
mode
speechiness
acousticness
instrumentalness
liveness
valence
tempo
duration_ms
time_signature
Psychedelic Switch
Carly Rae Jepsen
7zy2kNoeD72x2NEDaAsJOX
0.681
0.805
4
-6.676
1
0.0480
0.00182
1.58e-02
0.341
0.304
127.907
272035
4
Sick Feeling
boy pablo
7zxLkZbUxITHabPzGN8Xgc
0.415
0.504
9
-10.003
1
0.0318
0.02200
3.80e-06
0.363
0.401
165.860
155714
4
You Get Me So High
The Neighbourhood
7zwn1eykZtZ5LODrf7c0tS
0.551
0.881
7
-6.099
0
0.0542
0.18600
7.91e-02
0.152
0.387
88.036
153000
4
No Different
Epik High
7ztlf9mCrjoLXAYYf0LCYx
0.732
0.600
1
-6.127
1
0.0535
0.03070
8.25e-05
0.137
0.238
131.912
200362
4
Sorry
The Rose
7zmrZMinkTMJ2kZgM9Kqgp
0.388
0.642
10
-4.659
1
0.0337
0.47100
0.00e+00
0.306
0.402
173.610
215477
4
Livin It Up (with Post Malone & A$AP Rocky)
Young Thug
7zjEyeBsaw9gV0jofJLfOM
0.767
0.313
7
-12.059
1
0.0798
0.83800
0.00e+00
0.105
0.765
82.582
210907
4
We can then remove discrete track features and scale the remaining features so that the clusters are not affected by the difference in magnitude of different features.
Livin It Up (with Post Malone & A$AP Rocky) - Young Thug
1.0746895
-0.9667206
-0.7933436
0.0892422
1.2803337
-0.3915657
-0.5228246
1.3975087
-1.1808328
0.2545290
K-means
Now that we have the required data, we can move on to the machine learning algorithm of interest, k-means clustering.
What is K-means?
The k-means algorithm basically goes:
Choose \(k\) random points within the domain of your factors.
Create \(k\) clusters by assigning each observation to its nearest point, which is now referred to as a mean.
The centroid of each cluster then becomes the new mean.
Repeat until convergence.
The obvious question is: how do we determine the value of \(k\) for a given set of factors? One method would be to use something called a silhouette value. We can understand the silhouette value by considering a group of clustered data points, shown below.
We’ll let \(s_i\) be the silhouette value for point \(i\) which belongs to cluster \(C_I\), then
\[
\begin{split}
s_i&=\frac{b_i-a_i}{\text{max}(a_i,b_i)},\text{ if }|C_I|>1,\\
s_i&=0,\text{ if }|C_I|=0,
\end{split}
\]
where
\[
\begin{split}
a_i&=\frac{1}{|C_I|-1}\sum_{j\in C_I,i\neq j}\text{d}(i,j),\\
b_i&=\text{min}\frac{1}{|C_J|}\sum_{j\in C_J}\text{d}(i,j),\text{ where }J\neq I.
\end{split}
\]
Simply put, \(a_i\) is the average of some measure of distance between point \(i\) and every other point in cluster \(C_I\) besides itself, and \(b_i\) is the minimum average of some measure of distance between \(i\) and every other point in some other cluster \(C_J\). The cluster \(C_J\), used to determine \(b_i\), is sometimes referred to as the neighboring cluster of point \(i\) as it is the next closest cluster after \(C_I\).
Thus, given the definition of \(s_i\), higher values of \(s_i\) indicate a better fit of a point \(i\) in its cluster \(C_I\).
Following the definition of a silhouette value for a single point, the clustering performance of the entire dataset is calculated via the average silhouette value of all points.
Thus, we can determine the optimal number of clusters by:
Running the k-means algorithm using \(n\) clusters.
Evaluating the average silhouette value.
Repeat for a reasonable range of \(n\).
Ranking \(n\) by maximum average silhouette value.
Conveniently, the fviz_nbclust function takes care of this process for us.
Feature selection
The next problem is finding the optimal features to include our model. Generally speaking, an easy way to determine the relevant features to include in a model is to use domain knowledge. However, we do not have this luxury as I know nothing about audio engineering. So, we will do it the hard way, by trying every combination of features and selecting the ones with the best performance.
To do this, we can use a function I wrote that does the following:
Takes inputs n, data, nstart, itermax. Where n is is the number of factors to consider, and data is the feature matrix previously generated. The nstart and itermax inputs are values passed on to the kmeans function.
Finds all combinations of n factors within data.
For each combination, determine the optimal number of clusters using using average silhouette value, and fit k-means clusters.
Records performance metrics and cluster plot.
The function then outputs a dataframe that contains a row for each combination of n factors.
The function can then be used on all values of n, from 2 to the total number of factors. As this process is computationally intensive, and most R packages do not support multi-threading, we can use the foreach and doParallel packages to write a multi-threaded for loop which utilises all cores of the local computer. If we paid for all our CPU cores, we might as well use them right?
This results in a dataframe containing all possible combinations of factors for our dataset, and their k-means clustering results and performance, based on a value of \(k\) determined by average silhouette value.
Results
We can now visualise the top 20 k-means results from our previous calculation using a shiny application. It’s interesting that the highest performing clustering results only contain 2-3 factors.
Just for fun, and to see if our subjective interpretation of grouping songs together aligns at all with k-means and Spotify’s API, we can calculate the top 5 tracks by hours listened for each cluster in the highest performing k-means result.