Analysis of My Spotify Data

K-means
R
Categorisation of my Spotify listening history using k-means clustering.
Author

Harry Zhong

Published

March 23, 2024

Introduction

The motivation for this project was:

  1. I thought it would be fun.
  2. That’s it.

So, let’s get into how we can use R and Spotify’s Web API to categorise songs that we have listened to.

Data Extraction

For this project, there are two key datasets that we need:

  1. Listening activity history.
  2. Track features.

The first can be obtained by requesting your Spotify data. For this project, we will using the extended streaming history option, which takes longer to process but gives us our full listening history as opposed to only the most recent year.

The second dataset can be generated from our listening activity using Spotify’s Web API to pull track features for each song in our streaming history.

Activity History Data

Our Spotify activity history data is given as a set of .json files. We can extract the data from all the .json files into a dataframe and perform some preliminary cleaning using the code below.

Activity History Code
files <- list.files(
  "data/full_history_data", 
  pattern = "*.json", 
  full.names = TRUE
)

full_streaming_history <- foreach(
  file = files, 
  .packages = c("jsonlite"),
  .combine = rbind
) %do% {
  fromJSON(file, flatten = TRUE)
} %>%
  rename(
    track_name = "master_metadata_track_name",
    artist_name = "master_metadata_album_artist_name"
  ) %>%
  mutate(
    track_uri = gsub(
      "spotify:track:", 
      "", 
      spotify_track_uri
    ),
    month = ts %>%
      substring(1, 7) %>%
      paste0("-01") %>%
      ymd()
  ) %>%
  select(
    -spotify_track_uri,
    -username,
    -platform,
    -ip_addr_decrypted
  ) %>%
  filter(month >= ymd("2019-04-01"))
ts ms_played conn_country user_agent_decrypted track_name artist_name master_metadata_album_album_name episode_name episode_show_name spotify_episode_uri reason_start reason_end shuffle skipped offline offline_timestamp incognito_mode track_uri month
2024-02-08T23:57:04Z 2072 AU unknown Kiss Me More (feat. SZA) Doja Cat Planet Her NA NA NA trackdone fwdbtn FALSE TRUE FALSE 1707436622 FALSE 3DarAbFujv6eYNliUTyqtz 2024-02-01
2024-02-08T23:57:01Z 219724 AU unknown vampire Olivia Rodrigo GUTS NA NA NA fwdbtn trackdone FALSE FALSE FALSE 1707436403 FALSE 1kuGVB7EU95pJObxwvfwKS 2024-02-01
2024-02-08T23:53:23Z 2284 AU unknown Judas Lady Gaga Born This Way NA NA NA fwdbtn fwdbtn FALSE TRUE FALSE 1707436400 FALSE 7F25roCtYi55JouckaayPC 2024-02-01
2024-02-08T23:53:20Z 1721 AU unknown Heads Will Roll Yeah Yeah Yeahs It’s Blitz! NA NA NA fwdbtn fwdbtn FALSE TRUE FALSE 1707436398 FALSE 2WRFD9WczJ975X2K1Y9YVs 2024-02-01
2024-02-08T23:53:18Z 4018 AU unknown What You Waiting For? Gwen Stefani Love Angel Music Baby NA NA NA fwdbtn fwdbtn FALSE TRUE FALSE 1707436394 FALSE 0ny5zITdmyNwyTPVzRGscU 2024-02-01
2024-02-08T23:53:14Z 2404 AU unknown Space Song Beach House Depression Cherry NA NA NA fwdbtn fwdbtn FALSE TRUE FALSE 1707436392 FALSE 3CLhX1JkJZ4s5umNnOqCRh 2024-02-01

From here, we can use the ggplot2 and shiny packages to visualise trends in my most listened to artists and tracks.

app.R
library(shiny)
library(tidyverse)
library(here)
library(foreach)

files <- list.files(
  paste0(here(), "/data/full_history_data"), 
  pattern = "*.json", 
  full.names = TRUE
)

full_streaming_history <- foreach(
  file = files, 
  .packages = c("jsonlite"),
  .combine = rbind
) %do% {
  fromJSON(file, flatten = TRUE)
} %>%
  rename(
    track_name = "master_metadata_track_name",
    artist_name = "master_metadata_album_artist_name"
  ) %>%
  mutate(
    track_uri = gsub("spotify:track:", "", spotify_track_uri),
    month = ts %>%
      substring(1, 7) %>%
      paste0("-01") %>%
      ymd()
  ) %>%
  select(-spotify_track_uri) %>%
  filter(month >= ymd("2019-04-01"))

min_date <- full_streaming_history %>%
  pull(month) %>%
  min()

max_date <- full_streaming_history %>%
  pull(month) %>%
  max()

ui <- fluidPage(
  titlePanel("Spotify Streaming History"),
  sidebarLayout(
    sidebarPanel(
      sliderInput(
        "top_n",
        "Top n",
        min = 1,
        max = 20,
        value = 10
      ),
      sliderInput(
        "dates",
        "Streaming History Date Range",
        min = min_date,
        max = max_date,
        value = c(max_date %m-% months(6), max_date)
      )
    ),
    mainPanel(
      tabsetPanel(
        type = "tabs",
        tabPanel(
          "Artist History Plot", 
          h2(textOutput("artist_history_title")),
          plotOutput("artist_history_plot")
        ),
        tabPanel(
          "Track History Plot",
          h2(textOutput("track_history_title")),
          plotOutput("track_history_plot")
        )
      )
    )
  )
)

server <- function(input, output) {
  output$artist_history_title <- renderText({
    top_n <- input$top_n
    
    paste0(
      "Proportion of Hours Listened: Top ",
      top_n,
      " Artists"
    )
  })
  
  output$artist_history_plot <- renderPlot({
    top_artists <- data.frame(
      month = full_streaming_history %>%
        select(month) %>%
        distinct()
    ) %>%
      mutate(
        top_artists = map(
          month,
          ~full_streaming_history %>%
            filter(month == .x) %>%
            group_by(artist_name) %>%
            summarise(time = sum(ms_played)) %>%
            slice_max(time, n = as.numeric(input$top_n)) %>%
            pull(artist_name)
        )
      )
    
    artist_summary <- full_streaming_history %>%
      filter(
        month %>%
          between(input$dates[1], input$dates[2])
      ) %>%
      left_join(
        top_artists,
        by = "month"
      ) %>%
      rowwise() %>%
      filter(artist_name %in% top_artists) %>%
      group_by(
        artist_name, 
        month
      ) %>%
      summarise(hours_listened = sum(ms_played/(1000*60)))
    
    ggplot(artist_summary, aes(x = month, y = hours_listened, fill = artist_name, label = artist_name)) +
      xlab("Date") +
      ylab("Proportion") +
      geom_bar(position = "fill", stat = "identity") +
      geom_text(size = 3, position = position_fill(vjust = 0.5)) +
      theme(legend.position = "none")
  })
  
  output$track_history_title <- renderText({
    top_n <- input$top_n
    
    paste0(
      "Proportion of Hours Listened: Top ",
      top_n,
      " Tracks"
    )
  })
  
  output$track_history_plot <- renderPlot({
    top_tracks <- data.frame(
      month = full_streaming_history %>%
        select(month) %>%
        distinct()
    ) %>%
      mutate(
        top_tracks = map(
          month,
          ~full_streaming_history %>%
            filter(month == .x) %>%
            mutate(track_artist_name = paste(track_name, artist_name, sep = "\n")) %>%
            group_by(track_artist_name) %>%
            summarise(time = sum(ms_played)) %>%
            slice_max(time, n = as.numeric(input$top_n)) %>%
            pull(track_artist_name)
        )
      )
    
    track_summary <- full_streaming_history %>%
      filter(
        month %>%
          between(input$dates[1], input$dates[2])
      ) %>%
      mutate(track_artist_name = paste(track_name, artist_name, sep = "\n")) %>%
      left_join(
        top_tracks,
        by = "month"
      ) %>%
      rowwise() %>%
      filter(track_artist_name %in% top_tracks) %>%
      group_by(
        track_artist_name, 
        month
      ) %>%
      summarise(hours_listened = sum(ms_played/(1000*60)))
    
    ggplot(track_summary, aes(x = month, y = hours_listened, fill = track_artist_name, label = track_artist_name)) +
      xlab("Date") +
      ylab("Proportion") +
      geom_bar(position = "fill", stat = "identity") +
      geom_text(size = 3, position = position_fill(vjust = 0.5)) +
      theme(legend.position = "none")
  })
}

shinyApp(ui = ui, server = server)

Track Feature Data

Next, we’ll need to use Spotify’s Web API to obtain track features, which requires the track ID of each track we’re interested in. Fortunately, since we requested our full activity history, this data is included as a column.

Note

On the topic of Spotify’s Web API, it’s interesting to note that it also includes genres. However, genres are linked to artists, not tracks, which makes this feature less noteworthy compared to track features.

We can use the httr and jsonlite packages to create a function that takes a Spotify track ID and returns its track features.

get_audio_features <- function(track_id) {
  url = paste0("https://api.spotify.com/v1/audio-features/", track_id)
  response <- GET(
    url,
    add_headers(Authorization = paste("Bearer", spotify_token))
  )
  data <- fromJSON(
    content(
      response, 
      "text", 
      encoding = "UTF-8"
    )
  )
  return(data)
}

Given the large number of track IDs, using this function on all tracks in our dataset is a long and painful process, where we will get rate limited many times by Spotify. Conveniently, I have a local file containing all of our tracks and their associated track features, which I will load in.

The description for each feature can be found in Spotify’s documentation.

track_name artist_name track_uri danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature
Psychedelic Switch Carly Rae Jepsen 7zy2kNoeD72x2NEDaAsJOX 0.681 0.805 4 -6.676 1 0.0480 0.00182 1.58e-02 0.341 0.304 127.907 272035 4
Sick Feeling boy pablo 7zxLkZbUxITHabPzGN8Xgc 0.415 0.504 9 -10.003 1 0.0318 0.02200 3.80e-06 0.363 0.401 165.860 155714 4
You Get Me So High The Neighbourhood 7zwn1eykZtZ5LODrf7c0tS 0.551 0.881 7 -6.099 0 0.0542 0.18600 7.91e-02 0.152 0.387 88.036 153000 4
No Different Epik High 7ztlf9mCrjoLXAYYf0LCYx 0.732 0.600 1 -6.127 1 0.0535 0.03070 8.25e-05 0.137 0.238 131.912 200362 4
Sorry The Rose 7zmrZMinkTMJ2kZgM9Kqgp 0.388 0.642 10 -4.659 1 0.0337 0.47100 0.00e+00 0.306 0.402 173.610 215477 4
Livin It Up (with Post Malone & A$AP Rocky) Young Thug 7zjEyeBsaw9gV0jofJLfOM 0.767 0.313 7 -12.059 1 0.0798 0.83800 0.00e+00 0.105 0.765 82.582 210907 4

We can then remove discrete track features and scale the remaining features so that the clusters are not affected by the difference in magnitude of different features.

feature_matrix <- full_track_features %>%
  mutate(track_artist = paste(track_name, artist_name, sep = " - ")) %>%
  select(
    -track_name, 
    -artist_name
  ) %>%
  column_to_rownames(var = "track_artist") %>%
  select(
    -track_uri,
    -key,
    -mode,
    -time_signature
  ) %>%
  scale()
danceability energy loudness speechiness acousticness instrumentalness liveness valence tempo duration_ms
Psychedelic Switch - Carly Rae Jepsen 0.5034878 1.3400006 0.4206709 -0.3325940 -1.3132909 -0.3305910 1.3463448 -0.5950222 0.3365591 1.3459625
Sick Feeling - boy pablo -1.2632525 -0.0712252 -0.3296591 -0.5474916 -1.2506975 -0.3915509 1.5205894 -0.1757695 1.6071509 -0.7309358
You Get Me So High - The Neighbourhood -0.3599567 1.6963234 0.5508003 -0.2503492 -0.7420099 -0.0863063 -0.1505748 -0.2362802 -0.9982436 -0.7793940
No Different - Epik High 0.8422237 0.3788667 0.5444855 -0.2596349 -1.2237122 -0.3912473 -0.2693779 -0.8802870 0.4706386 0.0662492
Sorry - The Rose -1.4425833 0.5757820 0.8755599 -0.5222876 0.1419900 -0.3915657 1.0691375 -0.1714473 1.8666057 0.3361258
Livin It Up (with Post Malone & A$AP Rocky) - Young Thug 1.0746895 -0.9667206 -0.7933436 0.0892422 1.2803337 -0.3915657 -0.5228246 1.3975087 -1.1808328 0.2545290

K-means

Now that we have the required data, we can move on to the machine learning algorithm of interest, k-means clustering.

What is K-means?

The k-means algorithm basically goes:

  1. Choose \(k\) random points within the domain of your factors.
  2. Create \(k\) clusters by assigning each observation to its nearest point, which is now referred to as a mean.
  3. The centroid of each cluster then becomes the new mean.
  4. Repeat until convergence.

The obvious question is: how do we determine the value of \(k\) for a given set of factors? One method would be to use something called a silhouette value. We can understand the silhouette value by considering a group of clustered data points, shown below.

Chart Code
set.seed(2687)

rand_data <- data.frame(
  x = rnorm(50),
  y = rnorm(50)
)

km <- kmeans(rand_data, 3)

fviz_cluster(
  km,
  data = rand_data,
  geom = "point"
)

We’ll let \(s_i\) be the silhouette value for point \(i\) which belongs to cluster \(C_I\), then

\[ \begin{split} s_i&=\frac{b_i-a_i}{\text{max}(a_i,b_i)},\text{ if }|C_I|>1,\\ s_i&=0,\text{ if }|C_I|=0, \end{split} \]

where

\[ \begin{split} a_i&=\frac{1}{|C_I|-1}\sum_{j\in C_I,i\neq j}\text{d}(i,j),\\ b_i&=\text{min}\frac{1}{|C_J|}\sum_{j\in C_J}\text{d}(i,j),\text{ where }J\neq I. \end{split} \]

Simply put, \(a_i\) is the average of some measure of distance between point \(i\) and every other point in cluster \(C_I\) besides itself, and \(b_i\) is the minimum average of some measure of distance between \(i\) and every other point in some other cluster \(C_J\). The cluster \(C_J\), used to determine \(b_i\), is sometimes referred to as the neighboring cluster of point \(i\) as it is the next closest cluster after \(C_I\).

Thus, given the definition of \(s_i\), higher values of \(s_i\) indicate a better fit of a point \(i\) in its cluster \(C_I\).

Following the definition of a silhouette value for a single point, the clustering performance of the entire dataset is calculated via the average silhouette value of all points.

Thus, we can determine the optimal number of clusters by:

  1. Running the k-means algorithm using \(n\) clusters.
  2. Evaluating the average silhouette value.
  3. Repeat for a reasonable range of \(n\).
  4. Ranking \(n\) by maximum average silhouette value.

Conveniently, the fviz_nbclust function takes care of this process for us.

Feature selection

The next problem is finding the optimal features to include our model. Generally speaking, an easy way to determine the relevant features to include in a model is to use domain knowledge. However, we do not have this luxury as I know nothing about audio engineering. So, we will do it the hard way, by trying every combination of features and selecting the ones with the best performance.

To do this, we can use a function I wrote that does the following:

  1. Takes inputs n, data, nstart, itermax. Where n is is the number of factors to consider, and data is the feature matrix previously generated. The nstart and itermax inputs are values passed on to the kmeans function.
  2. Finds all combinations of n factors within data.
  3. For each combination, determine the optimal number of clusters using using average silhouette value, and fit k-means clusters.
  4. Records performance metrics and cluster plot.

The function then outputs a dataframe that contains a row for each combination of n factors.

Function Code
kmeans_select_features <- function(n, data, nstart, itermax) {
  comb_n <- data %>%
    colnames() %>%
    combn(n, simplify = FALSE)
  
  old_cols <- seq(1, n)
  new_cols <- paste0("factor_", seq(1, n))
  
  new_cols_sym <- syms(new_cols)
  
  factor_combinations_n <- do.call(rbind.data.frame, comb_n) %>%
    rename_with(~new_cols, all_of(old_cols)) %>%
    mutate(factors = pmap(list(!!!new_cols_sym), c)) %>%
    mutate(n_factors = n) %>%
    select(-(!!new_cols)) %>%
    mutate(data = map(factors,
                      ~data %>%
                        as.data.frame() %>%
                        select(all_of(.x)))) %>%
    mutate(n_clusters = map(data,
                            ~fviz_nbclust(.x, 
                                          kmeans, 
                                          nstart = nstart, 
                                          iter.max = itermax)[["data"]] %>%
                              slice(which.max(y)) %>%
                              select(clusters) %>%
                              as.numeric(),
                            .progress = paste("Finding optimal n_clusters:", 
                                              n, 
                                              "factors")) %>%
             as.numeric()) %>% 
    mutate(km = map2(data,
                     n_clusters,
                     ~kmeans(.x, 
                             .y, 
                             nstart = nstart, 
                             iter.max = itermax,
                             algorithm = "MacQueen"),
                     .progress = paste("Calculating kmeans:", 
                                       n, 
                                       "factors"))) %>%
    mutate(total_withinss = map(km,
                                ~.x$tot.withinss) %>%
             as.numeric(),
           bsstssRatio = map(km,
                             ~.x$betweenss/.x$totss) %>%
             as.numeric()) %>%
    mutate(km_plot = map2(km,
                          data,
                          ~fviz_cluster(.x,
                                        data = .y,
                                        geom = "point",
                                        ellipse.type = "convex"))) %>%
    arrange(desc(bsstssRatio))  
  
  return(factor_combinations_n)
}

The function can then be used on all values of n, from 2 to the total number of factors. As this process is computationally intensive, and most R packages do not support multi-threading, we can use the foreach and doParallel packages to write a multi-threaded for loop which utilises all cores of the local computer. If we paid for all our CPU cores, we might as well use them right?

set.seed(2687)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

kmeans_nfact <- foreach(n = seq(2, ncol(feature_matrix)),
                        .packages = c(
                          "tidyverse",
                          "cluster",
                          "factoextra",
                          "rlang"
                        ),
                        .combine = bind_rows) %dopar% {
                          kmeans_select_features(n, feature_matrix, 25, 1000)
                        } %>% 
  arrange(desc(bsstssRatio))

stopCluster(cl)

This results in a dataframe containing all possible combinations of factors for our dataset, and their k-means clustering results and performance, based on a value of \(k\) determined by average silhouette value.

Results

We can now visualise the top 20 k-means results from our previous calculation using a shiny application. It’s interesting that the highest performing clustering results only contain 2-3 factors.

app.R
library(shiny)
library(tidyverse)
library(here)
library(foreach)
library(english)

# Load Data

load("data/R_data/kmeans_nfact.RData")

files <- list.files(
  paste0(here(), "/data/full_history_data"), 
  pattern = "*.json", 
  full.names = TRUE
)

full_streaming_history <- foreach(
  file = files, 
  .packages = c("jsonlite"),
  .combine = rbind
) %do% {
  fromJSON(file, flatten = TRUE)
} %>%
  rename(
    track_name = "master_metadata_track_name",
    artist_name = "master_metadata_album_artist_name"
  ) %>%
  mutate(
    track_uri = gsub("spotify:track:", "", spotify_track_uri),
    month = ts %>%
      substring(1, 7) %>%
      paste0("-01") %>%
      ymd()
  ) %>%
  select(-spotify_track_uri) %>%
  filter(month >= ymd("2019-04-01"))

min_date <- full_streaming_history %>%
  pull(month) %>%
  min()

max_date <- full_streaming_history %>%
  pull(month) %>%
  max()

# UI

ui <- fluidPage(
  titlePanel("Spotify Clustering"),
  sidebarLayout(
    sidebarPanel(
      selectInput(
        "index", 
        "Clustering Performance Rank", 
        choices = seq(1, nrow(kmeans_nfact_save))
      ),
      sliderInput(
        "dates",
        "Streaming History Date Range",
        min = min_date,
        max = max_date,
        value = c(max_date %m-% months(6), max_date)
      )
    ),
    mainPanel(
      tabsetPanel(
        type = "tabs",
        tabPanel(
          "Cluster Plot",
          h1(textOutput("cluster_title")),
          textOutput("cluster_description"),
          h2("Cluster Centres"),
          tableOutput("cluster_table"),
          h2("Cluster Plot"),
          plotOutput("cluster_plot")
        ),
        tabPanel(
          "Cluster History Plot", 
          h1("Cluster History Plot"),
          p("This plot shows the proportion of hours listened for each cluster."),
          plotOutput("cluster_history_plot")
        ),
        tabPanel(
          "Track Clusters", 
          h1("Track Clusters"),
          textOutput("track_clusters_desc"),
          dataTableOutput("track_clusters")
        )
      )
    )
  )
)

# SERVER

server <- function(input, output) {
  output$cluster_title <- renderText({
    rank <- input$index
    
    paste0(
      "K-means Cluster: Rank ",
      rank
    )
  })
  
  output$cluster_description <- renderText({
    total_withinss <- kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(total_withinss) %>%
      round(digits = 2)
    
    bsstssRatio <- kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(bsstssRatio) %>%
      round(digits = 2)
    
    index_ordinal <- input$index %>%
      as.numeric() %>%
      ordinal()
    
    paste0(
      "The graph below shows the k-means cluster with the ", 
      index_ordinal,
      " highest BSS/TSS ratio of ",
      bsstssRatio,
      " and a total within sum of squares of ",
      total_withinss,
      "."
    )
  })
  
  output$track_clusters_desc <- renderText({
    start_date <- input$dates[1]
    end_date <- input$dates[2]
    
    paste0(
      "This table shows the cluster of all tracks listened to from ",
      start_date,
      " to ",
      end_date,
      "."
    )
  })
  
  output$track_clusters <- renderDataTable({
    kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km) %>%
      pluck(1) %>%
      pluck("cluster") %>%
      as.data.frame() %>%
      rownames_to_column("track_artist") %>%
      rename(cluster = 2) %>% 
      mutate(cluster = paste0("Cluster ", cluster)) %>%
      left_join(
        full_streaming_history %>%
          filter(
            month %>%
              between(input$dates[1], input$dates[2])
          ) %>%
          select(
            track_name,
            artist_name
          ) %>%
          na.omit() %>%
          distinct() %>%
          mutate(
            track_artist = paste(
              track_name,
              artist_name,
              sep = " - "
            )
          ),
        .,
        by = "track_artist"
      ) %>%
      select(
        cluster,
        artist_name,
        track_name
      ) %>%
      arrange(
        cluster,
        artist_name,
        track_name
      ) %>%
      rename(
        Cluster = "cluster",
        `Artist Name` = "artist_name",
        `Track Name` = "track_name"
      )
  })
  
  output$cluster_plot <- renderPlot({
    kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km_plot)
  })
  
  output$cluster_table <- renderTable({
    kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km) %>%
      pluck(1) %>%
      pluck("centers") %>%
      as.data.frame() %>%
      rownames_to_column("cluster") %>%
      mutate(cluster = paste0("Cluster ", cluster))
  })
  
  output$cluster_history_plot <- renderPlot({
    track_clusters <- kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km) %>%
      pluck(1) %>%
      pluck("cluster") %>%
      as.data.frame() %>%
      rownames_to_column("track_artist") %>%
      rename(cluster = 2) %>% 
      mutate(cluster = paste0("Cluster ", cluster)) %>%
      left_join(
        full_streaming_history %>%
          select(
            track_name,
            artist_name
          ) %>%
          na.omit() %>%
          distinct() %>%
          mutate(
            track_artist = paste(
              track_name,
              artist_name,
              sep = " - "
            )
          ),
        .,
        by = "track_artist"
      ) %>%
      select(-track_artist) %>%
      arrange(
        artist_name,
        track_name
      )
    
    cluster_summary <- full_streaming_history %>%
      select(
        month,
        track_name,
        artist_name,
        ms_played
      ) %>%
      na.omit() %>%
      filter(
        month %>%
          between(input$dates[1], input$dates[2])
      ) %>%
      left_join(
        track_clusters, 
        by = c("track_name", "artist_name")
      ) %>%
      group_by(
        month, 
        cluster
      ) %>%
      summarise(hours_listened = sum(ms_played/(1000*60)))
    
    ggplot(cluster_summary, aes(x = month, y = hours_listened, fill = cluster, label = cluster)) +
      xlab("Date") +
      ylab("Proportion") +
      geom_bar(position = "fill", stat = "identity")
  })
}

shinyApp(ui = ui, server = server)

Cluster Playlists

Just for fun, and to see if our subjective interpretation of grouping songs together aligns at all with k-means and Spotify’s API, we can calculate the top 5 tracks by hours listened for each cluster in the highest performing k-means result.

Top 5 Tracks Code
cluster_top_tracks <- kmeans_nfact_save %>%
  slice(1) %>%
  pull(km) %>%
  pluck(1) %>%
  pluck("cluster") %>%
  as.data.frame() %>%
  rownames_to_column("track_artist") %>%
  rename(cluster = 2) %>% 
  mutate(cluster = paste0("Cluster ", cluster)) %>%
  left_join(
    full_streaming_history %>%
      select(
        track_name,
        artist_name,
        ms_played
      ) %>%
      na.omit() %>%
      group_by(
        track_name,
        artist_name
      ) %>%
      summarise(
        hours_listened = sum(ms_played/(1000*60))
      ) %>%
      mutate(
        track_artist = paste(
          track_name,
          artist_name,
          sep = " - "
        )
      ),
      .,
      by = "track_artist"
    ) %>%
  select(-track_artist) %>%
  group_by(cluster) %>%
  slice_max(hours_listened, n = 5) %>%
  ungroup() %>%
  select(
    cluster,
    track_name,
    artist_name
  )
cluster track_name artist_name
Cluster 1 i <3 u boy pablo
Cluster 1 Apocalypse Cigarettes After Sex
Cluster 1 Nothing Bruno Major
Cluster 1 Sunsetz Cigarettes After Sex
Cluster 1 Wonder ADOY
Cluster 2 May I Ask Luke Chiang
Cluster 2 drunk keshi
Cluster 2 is your bedroom ceiling bored? (feat. Rxseboy) - Fudasca Remix Sody
Cluster 2 drivers license Olivia Rodrigo
Cluster 2 Location Unknown ◐ HONNE
Cluster 3 change ur mind Sarcastic Sounds
Cluster 3 stay4ever (feat. Mounika.) Powfu
Cluster 3 affection BETWEEN FRIENDS
Cluster 3 boyfriend (with Social House) Ariana Grande
Cluster 3 not ur friend Jeremy Zucker

Then, we can make some playlists:

Cluster 1 Playlist
Cluster 2 Playlist
Cluster 3 Playlist