Facebook Messenger Data Part 1: Latent Dirichlet Allocation

LDA
Python
Identification and classification of topics in Facebook Messenger conversations using Latent Dirichlet Allocation (LDA).
Author

Harry Zhong

Published

January 5, 2025

Introduction

After analysing my Spotify data, I thought it would be interesting to investigate another dataset generated by my online activity: Facebook Messenger chat logs. However, since chat logs are composed of text data we’ll have to learn some basic natural language processing techniques to gain any meaningful insights from this dataset.

This is part 1 of 2 of my analysis on my Facebook Messenger conversations, where I’ll use latent Dirichlet allocation to (hopefully) identify topics within conversations.

Data Extraction

First we’ll need to load the data, which Facebook provides as a series of json files. The raw data is structured like the sample below.

with open('sample_messages.json') as file:
    sample = file.read()

print(sample)
{
    "participants": [
      {
        "name": "Alice"
      },
      {
        "name": "Bob"
      }
    ],
    "messages": [
      {
        "sender_name": "Alice",
        "timestamp_ms": 1710762618535,
        "content": "Hey, Bob! Did you see the new movie?",
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Bob",
        "timestamp_ms": 1710762591958,
        "content": "Yeah, I just watched it last night. What did you think of it?",
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Bob",
        "timestamp_ms": 1710762583886,
        "content": "I loved the plot twist at the end. It was so unexpected!",
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Alice",
        "timestamp_ms": 1710749703948,
        "content": "I know, right? I was totally surprised when it happened.",
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Bob",
        "timestamp_ms": 1710749659963,
        "content": "What did you think of the special effects?",
        "reactions": [
          {
            "reaction": "\ud83d\ude04",
            "actor": "Alice"
          }
        ],
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Bob",
        "timestamp_ms": 1710749648743,
        "content": "I thought they were really well done. The explosions were so cool!",
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Alice",
        "timestamp_ms": 1710749547622,
        "content": "Definitely! I loved the action scenes.",
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Alice",
        "timestamp_ms": 1710749516271,
        "content": "I'm so glad we watched it together. It was a lot more fun that way.",
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Bob",
        "timestamp_ms": 1710749502826,
        "content": "Definitely! We should do it again soon.",
        "reactions": [
          {
            "reaction": "\ud83d\ude04",
            "actor": "Alice"
          }
        ],
        "is_geoblocked_for_viewer": false
      },
      {
        "sender_name": "Alice",
        "timestamp_ms": 1710749462947,
        "content": "Sounds like a plan to me!",
        "is_geoblocked_for_viewer": false
      }
    ]
  }

Ideally though, we’ll need to extract the relevant information from each json file and store it as a dataframe. In this case only the messages property is required. To do this, we’ll first import some libraries which we’ll need.

import os
import json
import pandas as pd
import re
import gensim as gs

Then, we can define a function which takes a directory containing json files as an input and returns a dataframe using the messages property of the json files in a pandas dataframe.

def ms_import_data(directory: str) -> pd.DataFrame:
    data_file_names = os.listdir(directory)
    data_files = [os.path.join(directory, data_file_name) for data_file_name in data_file_names]

    messenger_data = pd.DataFrame()

    for data_file in data_files:
        with open(data_file) as file:
            data = json.load(file)
        json_data = pd.json_normalize(data['messages'])
        messenger_data = pd.concat([messenger_data, json_data])

    messenger_data = (
        messenger_data[[
            'sender_name',
            'timestamp_ms',
            'content'
        ]]
        .dropna()
        .sort_values('timestamp_ms', ascending=True)
    )

    messenger_data['timestamp'] = pd.to_datetime(messenger_data['timestamp_ms'], unit='ms')

    messenger_data = messenger_data[[
        'sender_name',
        'timestamp',
        'content'
    ]]

    return messenger_data

My analysis will focus on a group chat called “The Office” which should have the most data, starting from 2019 to early 2024 when Facebook sent me the data. Using the function we just made, we can import the chat logs and look at the resulting dataframe.

chat_data = ms_import_data('data/the_office')
chat_data.head(3)
sender_name timestamp content
6328 Harry Zhong 2019-04-08 11:23:07.519 You created the group.
6327 Harry Zhong 2019-04-08 11:23:08.454 Hoi
6326 Dhruv Jobanputra 2019-04-08 11:23:19.718 Hoi

Now that we have the raw data, we can move on to additional cleaning which is required for LDA:

  1. Convert all text to lower case.
  2. Remove non-text characters.
  3. Remove messages that aren’t part of a conversation (reactions and links).
  4. Remove stop words (words which can belong to any topic, such as conjunctions and abbreviations).
  5. Group messages into conversations which will be used as documents for LDA.

Points 1 to 3 are fairly straightforward to implement using pandas. However, point 4 brings us to our first usage of gensim, a Python topic modelling library, where it’s used to remove generic stop words from the chat log. Additional stop words specific to the chat are then defined and removed on top of the words from gensim.

The code used to implement points 1 to 4 is shown below.

# import custom stopwords
from modules.preprocessing import custom_stopwords

def remove_custom_stopwords(document: str, stopwords: list) -> str:
    for word in stopwords:
        pattern = r'\b'+word+r'\b'
        document = re.sub(pattern, '', document).replace('  ', ' ')
    
    return document

def lda_preprocess(dataframe: pd.DataFrame, content_col: str, cleaned_col: str, rm_stopwords: bool=True) -> pd.DataFrame:
    dataframe[cleaned_col] = (
        dataframe[content_col]
        .str.lower()
        .str.strip()
        .str.replace('[^a-z\\s]', '', regex=True) # remove anything that isn't text or spaces
        .str.replace('\\s{2,}', ' ', regex=True) # replace 2+ spaces with single space
    )

    chat_actions = [
        'reacted to your message',
        'https'
    ]

    # remove chat actions which aren't part of the conversation
    dataframe = dataframe[
        ~dataframe[cleaned_col]
        .str.contains(
            '|'.join(chat_actions)
        )
    ]

    if rm_stopwords:
        # remove gensim and custom stopwords
        dataframe.loc[:, cleaned_col] = (
            dataframe[cleaned_col]
            .apply(gs.parsing.preprocessing.remove_stopwords)
            .apply(remove_custom_stopwords, args=(custom_stopwords,))
            .str.strip()
            .str.replace('\\s{2,}', ' ', regex=True)
        )

    return dataframe

We can again look at the resulting dataframe.

chat_data = lda_preprocess(chat_data, 'content', 'clean_content')
chat_data.head(3)
sender_name timestamp content clean_content
6328 Harry Zhong 2019-04-08 11:23:07.519 You created the group. created group
6327 Harry Zhong 2019-04-08 11:23:08.454 Hoi hoi
6326 Dhruv Jobanputra 2019-04-08 11:23:19.718 Hoi hoi

Next we’ll move on to separating the chat log into conversations, since LDA assumes our complete body of text is composed of a number of documents. There are a number of ways to split a chat log into sections, such as dividing into sections based on text length. However, ideally each document would have a primary topic which is picked up by LDA. Thus, I settled on grouping messages based on the length of time between each message, if the time between two messages exceeds a certain threshold then those two messages would belong to different conversations. I chose the cutoff to be 10 minutes using the educated guess methodology.

Implementing this logic in pandas is fairly simple, one approach is to:

  1. Create a column with the difference between the current row and the previous row.
  2. Create a boolean column indicating if the time difference is greater than the cutoff.
  3. Count the cumulative number of True rows as the conversation number.
  4. Group by conversation number and concatenate the message content column.
  5. Convert conversations from dataframe to list.

This approach is implemented in the function below.

def lda_getdocs(dataframe: pd.DataFrame, content_col: str, ts_col: str, conv_cutoff: int=600):
    # calculate difference between each message
    dataframe['time_diff'] = (
        dataframe[ts_col]
        .diff()
        .fillna(pd.Timedelta(seconds=0))
    )
    dataframe['time_diff'] = (
        dataframe['time_diff']
        .dt.total_seconds()
    )

    # group dataframe into different conversations based on cutoff value
    dataframe['new_conv'] = dataframe['time_diff'] > conv_cutoff
    dataframe['conv_num'] = 'Conv ' + (
        dataframe['new_conv']
        .cumsum()
        .astype(str)
    )

    # join together messages in the same coversation
    conversations = (
        dataframe
        .groupby('conv_num')
        [content_col]
        .apply(lambda x: ' '.join(map(str, x)))
        .str.strip()
        .str.replace('\\s{2,}', ' ', regex=True)
        .reset_index()
    )
    conversations = conversations[conversations[content_col] != '']

    documents = conversations[content_col].tolist()

    return documents

We can then apply the function to our dataset and view the new columns that were created in our dataframe.

documents = lda_getdocs(chat_data, 'clean_content', 'timestamp')
chat_data.head(3)
sender_name timestamp content clean_content time_diff new_conv conv_num
6328 Harry Zhong 2019-04-08 11:23:07.519 You created the group. created group 0.000 False Conv 0
6327 Harry Zhong 2019-04-08 11:23:08.454 Hoi hoi 0.935 False Conv 0
6326 Dhruv Jobanputra 2019-04-08 11:23:19.718 Hoi hoi 11.264 False Conv 0

And view a our documents.

for doc in documents[0:3]:
    print(doc[0:50] + '...')
created group hoi hoi premium tech support named g...
episodes grand tour season episode season episode ...
cute dog sohails shitzoo pubg visited nepam wild a...

Latent Dirichlet Allocation

LDA is an unsupervised machine learning algorithm used for topic discovery within a collection of documents, where a topic is considered to be a distribution of words. LDA models the creation of documents (members of the chat messaging in this case) as a generative process involving two Dirichlet distributions and two multinomial distributions (with one trial).

Dirichlet Distribution

The Dirichlet distribution is a continuous multivariate distribution, LDA typically uses the symmetric case of the Dirichlet distribution where the parameter is sparse, defined by the probability density function:

\[ f(x_1,...,x_K;\alpha)=\frac{\Gamma(\alpha K)}{\Gamma(\alpha)^K}\prod_{i=1}^{K}x_i^{\alpha-1} \]

Where \(\Gamma(z)\) is the Gamma function, and the following constraints are satisfied:

  1. \(\alpha <1\) (sparse condition).
  2. \(K\ge 2\)
  3. \(\sum_{i=1}^K x_i=1\)
  4. \(x_i\in [0,1]\)

We can think of a Dirichlet distribution as a distribution of multinomial distributions, where \(x_1,...,x_K\) is the sample space of the multinomial distributions. The sparse condition (\(\alpha <1\)) indicates that the probability density function is greater around the points \(x_1,...,x_K\). Constraints 2 to 4 basically allows the domain of the Dirichlet PDF to satisfy the sample space of a multinomial distribution.

This distribution can intuitively be understood as a topic distribution for documents, as documents will typically have very few primary topics, and a sparse Dirichlet distribution will assign a higher probability to multinomial distributions that are heavily weighted to a few outcomes (topics). This works similarly for word distributions for topics, as topics will typically be defined by a handful of words.

Document Generating Process

Now that we understand the relevant type of Dirichlet distribution, we’ll assign some parameters which will define how documents are generated under LDA, let:

  • \(M\) be the number of documents.
  • \(N_i\) be the number of words in document \(i\in \{1,...,M\}\).
  • \(K\) be the total number of topics across the \(M\) documents.
  • \(\alpha\) be the symmetric parameter of the Dirichlet distribution for topics within documents.
  • \(\beta\) be the symmetric parameter of the Dirichlet distribution for words within topics.
  • \(\theta_i\) be the multinomial topic distribution for document \(i\in \{1,...,M\}\).
  • \(\phi_k\) be the multinomial word distribution for topic \(k\in \{1,...,K\}\)

So we have \(M\) documents which each have a length of \(N_i\). To generate documents, we’ll first sample \(\phi_k\sim Dir(\beta)\) for all \(k\). This gives us the multinomial word distribution for each of the \(K\) topics. Then, for each document \(i\):

  1. Sample \(\theta_i\sim Dir(\alpha)\). This gives us the multinomial topic distribution for document \(i\).
  2. Generate word \(j\in \{1,...,N_i\}\) in document \(i\) by:
    • Sampling a topic from \(\theta_i\).
    • Sampling a word from the corresponding \(\phi_k\) where \(k\) is the topic chosen previously.
  3. Repeat for all \(i\).

Training

The goal of model training is to find the \(K\) word distributions and \(M\) topic distributions which maximises the likelihood of the model generating the training data given the hyperparameters \(\alpha\), \(\beta\) and \(K\). There are various methods to achieve this, we’ll be using gensim.models.ldamodel.LdaModel which implements online leaning.

Coherence Score

Next, to evaluate the performance of trained models we’ll use UMass coherence. The idea behind UMass coherence is that words that come from the same topic should appear together in documents more often than not, so we take each word pair from the top \(N\) (gensim sets this as 20 by default) words of each topic and compare the probability of appearing together vs the probability of appearing by itself. UMass coherence for a topic is defined by the formula below.

\[ UMass=\frac{2}{N(N-1)}\sum_{i=1}^N\sum_{j=1}^{i-1}\text{log}(\frac{\text{P}(w_i,w_j)+\epsilon}{\text{P}(w_j)}) \]

Where:

  • \((w_i,w_j)\) is a word pair.
  • \(\text{P}()\) is the probability of the word/pair occurring calculated using the frequency observed in the training documents.
  • \(\epsilon=1\) to avoid \(\text{log}(0)\).

Implementation

We can then create a class that uses gensim to train the model, show the discovered topics and calculate the coherence score as an evaluation metric. For simplicity, we’ll leave \(\alpha\) and \(\beta\) as the default value in LdaModel which is 1/num_topics, this leave us with only num_topics as a tunable hyperparameter.

class fbm_lda:
    def __init__(self, documents):
        self.documents = documents
        self.texts = [doc.split() for doc in documents]
        self.dictionary = gs.corpora.Dictionary(self.texts)
        self.corpus = [self.dictionary.doc2bow(text) for text in self.texts]
    
    def train_lda(self, num_topics, passes=10, random_state=2687):
        self.num_topics = num_topics

        self.lda_model = gs.models.ldamodel.LdaModel(
            corpus=self.corpus,
            num_topics=num_topics,
            id2word=self.dictionary,
            passes=passes,
            random_state=random_state
        )

        return self.lda_model

    def print_topics(self, num_words=5):
        for topic in self.lda_model.print_topics(num_topics=self.num_topics, num_words=num_words):
            print(topic)

    def get_coherence(self, coherence_type='u_mass'):
        coherence = gs.models.CoherenceModel(
            model=self.lda_model,
            texts=self.texts,
            dictionary=self.dictionary,
            coherence=coherence_type
        )

        return coherence.get_coherence()

This allows us to train a LDA model using our documents given a number of topics and calculate its coherence score.

model = fbm_lda(documents)
model.train_lda(5)
model.get_coherence()
-2.206920240598983

Hyperparameter Tuning

Naturally, the next step would be to test a range of values for num_topics to evaluate which ones gives the best UMass coherence score. This is easily done by extending the LDA class we created previously.

Note

As someone who mainly uses R at work, extending a class with additional functionality like this is pretty neat.

import numpy as np

class tune_lda(fbm_lda):
    def tune(self, n_start, n_stop, step, sort: bool=False):
        n_topics_values = np.arange(n_start, n_stop, step)

        tuning_results = pd.DataFrame(
            columns=[
                'n_topics',
                'umass_coherence'
            ]
        )

        for index in range(len(n_topics_values)):
            n_topics = n_topics_values[index]
            self.train_lda(num_topics=n_topics)
            umass_coherence = self.get_coherence()

            tuning_results.loc[index] = (
                [n_topics] + 
                [umass_coherence]
            )

        if sort:
            tuning_results = tuning_results.sort_values('umass_coherence', ascending=False)

        return tuning_results

The new class can then easily be used to test values 2 to 20 for num_topics.

tuning = tune_lda(documents)
tuning_results = tuning.tune(2, 21, 1)
tuning_results.plot(x='n_topics')

Results

Based off the hyperparameter tuning plot, we can observe that the UMass coherence tends to decrease (lower performance) as the number of topics increases. While 2 to 4 topics results in the best performance as measures by UMass coherence, I found that 7 topics gives the most sensible topics based on my knowledge of the source data.

model.train_lda(7)
model.print_topics(10)
(0, '0.029*"play" + 0.018*"game" + 0.012*"playing" + 0.010*"dog" + 0.010*"pubg" + 0.008*"watch" + 0.006*"fun" + 0.006*"games" + 0.006*"unlucky" + 0.005*"movie"')
(1, '0.016*"car" + 0.014*"buy" + 0.011*"house" + 0.011*"drive" + 0.010*"th" + 0.010*"money" + 0.008*"stonks" + 0.007*"rich" + 0.006*"dbd" + 0.005*"new"')
(2, '0.023*"phone" + 0.023*"turn" + 0.022*"test" + 0.018*"laptop" + 0.015*"pc" + 0.012*"mac" + 0.010*"apple" + 0.009*"iphone" + 0.009*"gb" + 0.009*"windows"')
(3, '0.009*"tennis" + 0.009*"play" + 0.007*"fat" + 0.007*"guy" + 0.007*"gym" + 0.006*"win" + 0.006*"eat" + 0.006*"beat" + 0.006*"team" + 0.005*"point"')
(4, '0.011*"girls" + 0.010*"girl" + 0.007*"tell" + 0.006*"haram" + 0.006*"talk" + 0.005*"pic" + 0.005*"longer" + 0.005*"available" + 0.005*"mum" + 0.005*"shes"')
(5, '0.014*"engineer" + 0.014*"group" + 0.013*"premium" + 0.013*"nickname" + 0.012*"windmill" + 0.012*"support" + 0.012*"scammer" + 0.011*"racquet" + 0.010*"tech" + 0.009*"changed"')
(6, '0.013*"work" + 0.007*"uni" + 0.005*"data" + 0.005*"money" + 0.005*"job" + 0.005*"hard" + 0.004*"unit" + 0.003*"science" + 0.003*"long" + 0.003*"working"')

Based on the word distributions, the topics could be named as follows:

  • Topic 0: Gaming
  • Topic 1: Cars
  • Topic 2: Technology
  • Topic 3: Fitness
  • Topic 4: Relationships
  • Topic 5: Chat actions
  • Topic 6: Study and work

These topics are pretty unsurprising given the chat participants (18-23 year old dudes), however it’s reassuring that the results produced by LDA are as expected. As a final bit of analysis, we can plot the topic distributions over documents (conversations) to visualise how topics changed over time.

def lda_plot_topics(model: fbm_lda):
    document_topics = model.lda_model.get_document_topics(model.corpus, minimum_probability=0)
    topic_dist_df = pd.DataFrame(document_topics)
    topic_dist_df = topic_dist_df.map(lambda x: x[1])

    topics = [
        'Gaming', 
        'Cars', 
        'Technology', 
        'Fitness', 
        'Relationships', 
        'Chat actions', 
        'Study and work'
    ]
    plot = (
        topic_dist_df
        .plot(
            kind='bar',
            xticks=[],
            title=['', '', '', '', '', '', ''],
            subplots=True,
            sharey=True
        )
    )

    plot = plot.flat
    for i, p in enumerate(plot):
        p.legend([topics[i]], bbox_to_anchor=(1, 1))

    return plot

lda_plot_topics(model)