Facebook Messenger Data Part 1: Latent Dirichlet Allocation
LDA
Python
Identification and classification of topics in Facebook Messenger conversations using Latent Dirichlet Allocation (LDA).
Author
Harry Zhong
Published
January 5, 2025
Introduction
After analysing my Spotify data, I thought it would be interesting to investigate another dataset generated by my online activity: Facebook Messenger chat logs. However, since chat logs are composed of text data we’ll have to learn some basic natural language processing techniques to gain any meaningful insights from this dataset.
This is part 1 of 2 of my analysis on my Facebook Messenger conversations, where I’ll use latent Dirichlet allocation to (hopefully) identify topics within conversations.
Data Extraction
First we’ll need to load the data, which Facebook provides as a series of json files. The raw data is structured like the sample below.
{
"participants": [
{
"name": "Alice"
},
{
"name": "Bob"
}
],
"messages": [
{
"sender_name": "Alice",
"timestamp_ms": 1710762618535,
"content": "Hey, Bob! Did you see the new movie?",
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Bob",
"timestamp_ms": 1710762591958,
"content": "Yeah, I just watched it last night. What did you think of it?",
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Bob",
"timestamp_ms": 1710762583886,
"content": "I loved the plot twist at the end. It was so unexpected!",
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Alice",
"timestamp_ms": 1710749703948,
"content": "I know, right? I was totally surprised when it happened.",
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Bob",
"timestamp_ms": 1710749659963,
"content": "What did you think of the special effects?",
"reactions": [
{
"reaction": "\ud83d\ude04",
"actor": "Alice"
}
],
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Bob",
"timestamp_ms": 1710749648743,
"content": "I thought they were really well done. The explosions were so cool!",
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Alice",
"timestamp_ms": 1710749547622,
"content": "Definitely! I loved the action scenes.",
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Alice",
"timestamp_ms": 1710749516271,
"content": "I'm so glad we watched it together. It was a lot more fun that way.",
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Bob",
"timestamp_ms": 1710749502826,
"content": "Definitely! We should do it again soon.",
"reactions": [
{
"reaction": "\ud83d\ude04",
"actor": "Alice"
}
],
"is_geoblocked_for_viewer": false
},
{
"sender_name": "Alice",
"timestamp_ms": 1710749462947,
"content": "Sounds like a plan to me!",
"is_geoblocked_for_viewer": false
}
]
}
Ideally though, we’ll need to extract the relevant information from each json file and store it as a dataframe. In this case only the messages property is required. To do this, we’ll first import some libraries which we’ll need.
import osimport jsonimport pandas as pdimport reimport gensim as gs
Then, we can define a function which takes a directory containing json files as an input and returns a dataframe using the messages property of the json files in a pandas dataframe.
My analysis will focus on a group chat called “The Office” which should have the most data, starting from 2019 to early 2024 when Facebook sent me the data. Using the function we just made, we can import the chat logs and look at the resulting dataframe.
Now that we have the raw data, we can move on to additional cleaning which is required for LDA:
Convert all text to lower case.
Remove non-text characters.
Remove messages that aren’t part of a conversation (reactions and links).
Remove stop words (words which can belong to any topic, such as conjunctions and abbreviations).
Group messages into conversations which will be used as documents for LDA.
Points 1 to 3 are fairly straightforward to implement using pandas. However, point 4 brings us to our first usage of gensim, a Python topic modelling library, where it’s used to remove generic stop words from the chat log. Additional stop words specific to the chat are then defined and removed on top of the words from gensim.
The code used to implement points 1 to 4 is shown below.
# import custom stopwordsfrom modules.preprocessing import custom_stopwordsdef remove_custom_stopwords(document: str, stopwords: list) ->str:for word in stopwords: pattern =r'\b'+word+r'\b' document = re.sub(pattern, '', document).replace(' ', ' ')return documentdef lda_preprocess(dataframe: pd.DataFrame, content_col: str, cleaned_col: str, rm_stopwords: bool=True) -> pd.DataFrame: dataframe[cleaned_col] = ( dataframe[content_col] .str.lower() .str.strip() .str.replace('[^a-z\\s]', '', regex=True) # remove anything that isn't text or spaces .str.replace('\\s{2,}', ' ', regex=True) # replace 2+ spaces with single space ) chat_actions = ['reacted to your message','https' ]# remove chat actions which aren't part of the conversation dataframe = dataframe[~dataframe[cleaned_col] .str.contains('|'.join(chat_actions) ) ]if rm_stopwords:# remove gensim and custom stopwords dataframe.loc[:, cleaned_col] = ( dataframe[cleaned_col] .apply(gs.parsing.preprocessing.remove_stopwords) .apply(remove_custom_stopwords, args=(custom_stopwords,)) .str.strip() .str.replace('\\s{2,}', ' ', regex=True) )return dataframe
Next we’ll move on to separating the chat log into conversations, since LDA assumes our complete body of text is composed of a number of documents. There are a number of ways to split a chat log into sections, such as dividing into sections based on text length. However, ideally each document would have a primary topic which is picked up by LDA. Thus, I settled on grouping messages based on the length of time between each message, if the time between two messages exceeds a certain threshold then those two messages would belong to different conversations. I chose the cutoff to be 10 minutes using the educated guess methodology.
Implementing this logic in pandas is fairly simple, one approach is to:
Create a column with the difference between the current row and the previous row.
Create a boolean column indicating if the time difference is greater than the cutoff.
Count the cumulative number of True rows as the conversation number.
Group by conversation number and concatenate the message content column.
Convert conversations from dataframe to list.
This approach is implemented in the function below.
def lda_getdocs(dataframe: pd.DataFrame, content_col: str, ts_col: str, conv_cutoff: int=600):# calculate difference between each message dataframe['time_diff'] = ( dataframe[ts_col] .diff() .fillna(pd.Timedelta(seconds=0)) ) dataframe['time_diff'] = ( dataframe['time_diff'] .dt.total_seconds() )# group dataframe into different conversations based on cutoff value dataframe['new_conv'] = dataframe['time_diff'] > conv_cutoff dataframe['conv_num'] ='Conv '+ ( dataframe['new_conv'] .cumsum() .astype(str) )# join together messages in the same coversation conversations = ( dataframe .groupby('conv_num') [content_col] .apply(lambda x: ' '.join(map(str, x))) .str.strip() .str.replace('\\s{2,}', ' ', regex=True) .reset_index() ) conversations = conversations[conversations[content_col] !=''] documents = conversations[content_col].tolist()return documents
We can then apply the function to our dataset and view the new columns that were created in our dataframe.
created group hoi hoi premium tech support named g...
episodes grand tour season episode season episode ...
cute dog sohails shitzoo pubg visited nepam wild a...
Latent Dirichlet Allocation
LDA is an unsupervised machine learning algorithm used for topic discovery within a collection of documents, where a topic is considered to be a distribution of words. LDA models the creation of documents (members of the chat messaging in this case) as a generative process involving two Dirichlet distributions and two multinomial distributions (with one trial).
Dirichlet Distribution
The Dirichlet distribution is a continuous multivariate distribution, LDA typically uses the symmetric case of the Dirichlet distribution where the parameter is sparse, defined by the probability density function:
Where \(\Gamma(z)\) is the Gamma function, and the following constraints are satisfied:
\(\alpha <1\) (sparse condition).
\(K\ge 2\)
\(\sum_{i=1}^K x_i=1\)
\(x_i\in [0,1]\)
We can think of a Dirichlet distribution as a distribution of multinomial distributions, where \(x_1,...,x_K\) is the sample space of the multinomial distributions. The sparse condition (\(\alpha <1\)) indicates that the probability density function is greater around the points \(x_1,...,x_K\). Constraints 2 to 4 basically allows the domain of the Dirichlet PDF to satisfy the sample space of a multinomial distribution.
This distribution can intuitively be understood as a topic distribution for documents, as documents will typically have very few primary topics, and a sparse Dirichlet distribution will assign a higher probability to multinomial distributions that are heavily weighted to a few outcomes (topics). This works similarly for word distributions for topics, as topics will typically be defined by a handful of words.
Document Generating Process
Now that we understand the relevant type of Dirichlet distribution, we’ll assign some parameters which will define how documents are generated under LDA, let:
\(M\) be the number of documents.
\(N_i\) be the number of words in document \(i\in \{1,...,M\}\).
\(K\) be the total number of topics across the \(M\) documents.
\(\alpha\) be the symmetric parameter of the Dirichlet distribution for topics within documents.
\(\beta\) be the symmetric parameter of the Dirichlet distribution for words within topics.
\(\theta_i\) be the multinomial topic distribution for document \(i\in \{1,...,M\}\).
\(\phi_k\) be the multinomial word distribution for topic \(k\in \{1,...,K\}\)
So we have \(M\) documents which each have a length of \(N_i\). To generate documents, we’ll first sample \(\phi_k\sim Dir(\beta)\) for all \(k\). This gives us the multinomial word distribution for each of the \(K\) topics. Then, for each document \(i\):
Sample \(\theta_i\sim Dir(\alpha)\). This gives us the multinomial topic distribution for document \(i\).
Generate word \(j\in \{1,...,N_i\}\) in document \(i\) by:
Sampling a topic from \(\theta_i\).
Sampling a word from the corresponding \(\phi_k\) where \(k\) is the topic chosen previously.
Repeat for all \(i\).
Training
The goal of model training is to find the \(K\) word distributions and \(M\) topic distributions which maximises the likelihood of the model generating the training data given the hyperparameters \(\alpha\), \(\beta\) and \(K\). There are various methods to achieve this, we’ll be using gensim.models.ldamodel.LdaModel which implements online leaning.
Coherence Score
Next, to evaluate the performance of trained models we’ll use UMass coherence. The idea behind UMass coherence is that words that come from the same topic should appear together in documents more often than not, so we take each word pair from the top \(N\) (gensim sets this as 20 by default) words of each topic and compare the probability of appearing together vs the probability of appearing by itself. UMass coherence for a topic is defined by the formula below.
\(\text{P}()\) is the probability of the word/pair occurring calculated using the frequency observed in the training documents.
\(\epsilon=1\) to avoid \(\text{log}(0)\).
Implementation
We can then create a class that uses gensim to train the model, show the discovered topics and calculate the coherence score as an evaluation metric. For simplicity, we’ll leave \(\alpha\) and \(\beta\) as the default value in LdaModel which is 1/num_topics, this leave us with only num_topics as a tunable hyperparameter.
class fbm_lda:def__init__(self, documents):self.documents = documentsself.texts = [doc.split() for doc in documents]self.dictionary = gs.corpora.Dictionary(self.texts)self.corpus = [self.dictionary.doc2bow(text) for text inself.texts]def train_lda(self, num_topics, passes=10, random_state=2687):self.num_topics = num_topicsself.lda_model = gs.models.ldamodel.LdaModel( corpus=self.corpus, num_topics=num_topics, id2word=self.dictionary, passes=passes, random_state=random_state )returnself.lda_modeldef print_topics(self, num_words=5):for topic inself.lda_model.print_topics(num_topics=self.num_topics, num_words=num_words):print(topic)def get_coherence(self, coherence_type='u_mass'): coherence = gs.models.CoherenceModel( model=self.lda_model, texts=self.texts, dictionary=self.dictionary, coherence=coherence_type )return coherence.get_coherence()
This allows us to train a LDA model using our documents given a number of topics and calculate its coherence score.
model = fbm_lda(documents)model.train_lda(5)model.get_coherence()
-2.206920240598983
Hyperparameter Tuning
Naturally, the next step would be to test a range of values for num_topics to evaluate which ones gives the best UMass coherence score. This is easily done by extending the LDA class we created previously.
Note
As someone who mainly uses R at work, extending a class with additional functionality like this is pretty neat.
Based off the hyperparameter tuning plot, we can observe that the UMass coherence tends to decrease (lower performance) as the number of topics increases. While 2 to 4 topics results in the best performance as measures by UMass coherence, I found that 7 topics gives the most sensible topics based on my knowledge of the source data.
Based on the word distributions, the topics could be named as follows:
Topic 0: Gaming
Topic 1: Cars
Topic 2: Technology
Topic 3: Fitness
Topic 4: Relationships
Topic 5: Chat actions
Topic 6: Study and work
These topics are pretty unsurprising given the chat participants (18-23 year old dudes), however it’s reassuring that the results produced by LDA are as expected. As a final bit of analysis, we can plot the topic distributions over documents (conversations) to visualise how topics changed over time.