Introduction

Recently, I had the opportunity to work on two topics that I really like: data analysis and human behavior. On the professional side, I began working with behavioral data of the Bees Customer App (Bees is a part of AB InBev and operates in over 20 countries). On the personal development front, I’ve been specializing in Data and Decision Science through the Master’s program at Sirius. During this journey, I had the privilege of getting to know Jones Madruga, who served as an inspiration for the development of this project when shared the article I know what you did last session: clustering users with ML.

The steps followed by this analysis, conducted initially for Brazil and then for Mexico, Ecuador, and Peru, provide a comprehensive exploration of user behavior within mobile applications. The journey begins by addressing the fundamental question of identifying and segmenting users effectively based on session data and clustering techniques. This investigation represents a critical step toward understanding user engagement, conversion rates, and overall app performance. By leveraging session-level metrics and behavioral variables, it’s possible to uncover valuable insights that can lead to strategic decisions and drive app optimization.

Clustering and the K-Means Algorithm

Clustering is a machine learning method called unsupervised once it uses unlabeled data to group in clusters based on similarities. The points inside a cluster are more similar to each other compared to points in other clusters.

Common applications include customer segmentation for marketing, fraud detection in finance, and patient grouping in healthcare. It helps organizations gain insights, improve decision-making, and enhance personalization efforts.

There are some types of Clustering (source here):
- Centroid-based Clustering: it uses a pre-determined number of clusters (k) to group de data within the closest centroid to each data point. K-Means is the most common Algorithm of this type and of Clustering in general;
- Density-based Clustering: it groups the points based on the density of points; outliers points are not attributed to any cluster;
- Distribution-based Clustering: this approach assumes that data have distributions such as Gaussian distributions, which means that a point has more probability of being part of a cluster if it is closer to the center;
- Hierarchical Clustering: the points are grouped hierarchically in clusters, creating greater groups;
Below is a depiction of how these distinct algorithms divide the data points:

Types of Clustering. Source: Google Developers

For some Clustering Models, it’s needed to provide the ideal number of clusters to group the points. This evaluation can be obtained by different methods. The most common are the Elbow Method and the Silhouette Method, even though other methods have been widely used, as one can read here.

The K-Means Algorithm

As seen above, K-Means is a method of grouping the points based on the distance they have to k centroids. The number k is given by the user.
In the example below, 5 centroids are moved until all points have a minimal distance (normally Euclidian distance) to one centroid. All points of that centroid are called a cluster.

How K-Means Clustering Works. Source: Tech-Ni Simulator

Why choose K-Means for this type of analysis?

Pros:

  • Is effective for datasets with large numbers of data (while other types, such as Density-Based and Hierarchical Methods are not effective for that);
  • It’s simple and effective, and doesn’t imply high time of processing / high costs;
  • The results are easy to interpret and explain;

Cons:

  • The number of K (number of clusters) must be given considering the suitable method;
  • Is sensible to outliers — this is an analysis one must run before training the model mainly using session data and then make a decision: run the model with outliers or not? If you decide to run with outliers, maybe you should choose another method; In my case, I decided to remove the outliers and look at them separately.

Data Preprocessing

Data Sources

Currently, there are a couple of platforms that enable businesses to capture, manage, and route customer and app event data to various tools and destinations, including data lakes, analytics, and marketing services. Examples include Segment, used by Domino’s and Fox Sports; Snowplow Analytics, favored by Datadog and Strava; and mParticle, utilized by Airbnb and Spotify. These platforms facilitate efficient data utilization in data-driven business strategies.

Once one has the raw data generated by one of the platforms above, they can be organized logically using the medallion architecture, as seen below. The bronze layer is the raw, unprocessed data stage, where data is collected and stored. The silver layer involves data cleansing and transformation, making it more structured and usable. Finally, the gold layer represents the highest quality data, typically used for advanced analytics and reporting.

Medallion Architecture: Source

This kind of structure is necessary for each of the event tables that are triggered from the app. Below is an example of events that are triggered between the moment a user got an Ad Impression and the moment the page was viewed:

A sequence of events: Source

The sequence of events can be called a Session. According to Google,
a session is a group of user interactions with your website that take place within a given time frame”. Sessions usually start when the first event is triggered, but there are 2 methods to define session ends: time-based expiration (at midnight, after a couple of minutes of inactivity) or campaign change (each distinct campaign accessed is a new session).

A session can be summarized by a single table of grouped events, the session level table:

Session Level table with aggregated events: Source

Sessions datasets can be as big as the number of users that access the application. Considering that users can trigger dozens or hundreds of events each session and that each application can have thousands or millions of sessions daily, it’s easy to create a huge dataset. Koalas or Dask can be good library options to substitute the ones that don’t support this data volume. Another option is using Apache Spark, which is an open-source framework widely used for large-scale data processing and distributed data analysis.

Creating Aggregated Session-Level Metrics

The Session Level table can be then used to create a Fact Table containing metrics related to each session, identified by the primary key “Session ID”.

Session Fact Table: Source

As the goal of our analysis is clustering at the user level, then it's needed to aggregate the metrics at this level or granularity, filtering the specific time period chosen. One example of an aggregated table at User ID can be seen below:

User Level Table: Source

The User Level table contains all variables that can represent the most important metrics to understand user behavior, like the number of sessions, number of events, clicks per session, number of orders, etc. Some core metrics that help to correlate behavior and business are:

  • Behavior Metrics: Session Duration, Frequency of Use, Pages Viewed, App Actions (like Clicks), Conversion Rate, Time Spent on Specific Features, Engagement in Specific Features, Cart Abandonment Rate, etc;
  • Business Metrics: Revenue, Retention Rate, Customer Lifecycle, Average Ticket, Number of Orders, Gross Profit Margin, Customer Satisfaction Score (CSAT), etc.

Standardizing the Data

Why is it a good practice to standardize or normalize the data?

Clustering relies on measuring distances between data points, and when the features have varying scales, it can impact distance calculations and, subsequently, the formation of clusters. In the context of this analysis, several variables may exhibit different scales, like Clicks (dozens), Conversion Rate (decimal), or Revenue (thousands).

To address this issue, data standardization methods were employed. In particular, the StandardScaler and MinMaxScaler techniques.

StandardScaler standardizes the data with an average of 0 and a standard deviation of 1. For this transformation, one needs the library sklearn.preprocessing:

from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

#Applyng the transformation to the previous dataset
df_scaled = scaler.fit_transform(df_filtered)
df_scaled

Below it’s a chart that shows the data before and after standardization with an average of 0 and a standard deviation of 1. Note the change in the scale of the metric:

Comparison between Original Data and after StandardScaler. Source: Author

The 2nd method, MinMaxScaler standardizes the data between 0 and 1. Note the change in the scale of the metric:

from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

#Applyng the transformation to the previous dataset
df_scaled = scaler.fit_transform(df_filtered)
df_scaled
Comparison between Original Data and after MixMaxScaler. Source: Author

Selecting Variables Using PCA and Correlation

The original dataset had 21 columns with metrics. Running a Clustering with such amount of metrics can generate very bad clusters, beyond generating extra costs while training the models with variables that can be classified as redundant.
In order to reduce the number of variables within the dataset, some steps can be taken:

  • Eliminating variables that are less representative for explaining the variance in the dataset using Principal Component Analysis (PCA);
  • Eliminating variables that are highly correlated with others, once they can be redundant to define clusters — as the K-Means uses the distance to group in clusters, variables with high correlation will have similar behavior;

The Principal Component Analysis(PCA) explains the variance of each variable within the dataset, showing what are the most relevant metrics. In order to do this, one needs the library sklearn.decomposition and
sklearn.preprocessing to scale the data:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Apply PCA to the scaled data
df_pca = pca.fit_transform(df_scaled)

# Explained variance ratio for each component
explained_variance_ratio = pca.explained_variance_ratio_

# Cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

For the variables I’ve selected, 9 of them were responsible for explaining 69% of the cumulative variance, which led me from 21 to 9 variables.

The Correlation is useful to select variables with high correlation that have similar behavior to others and can be redundant for the analysis. In the example below, there are 4 variables highly correlated:

  • As Session Time and Pages Visited are both behavior variables and have a high correlation, one could select one of them to keep in the model;
  • Conversion Rate and Revenue, on the other hand, are also “peers”; you could choose one of them to keep and the other one to remove; Note that you may choose distinct types of variables to exclude, otherwise one can lose an important metric for the model;
Correlation Heatmap with Example Variables. Source: Author

In my analysis, I could remove 2 more metrics, leading to a final dataset with 7 variables. In the example above, 2 variables could be removed, resulting in 6 for running the clustering.

Training the Model and evaluating the Clusters

For this analysis, the Elbow Method and Silhouette Method were used to evaluate the k-number of clusters considering the 2 types of standardization detailed above.

Elbow Method

To plot the Elbow Method, one needs to calculate the inertia, that is the sum of squares for all points within-cluster. It’s calculated using the library sklearn.cluster :

#Analyzing the elbow method to choose the number of clusters - calculating the inertia
from sklearn.cluster import KMeans

inertia = []
k_min = 2
k_max = 8

for i in range(k_min,k_max+1):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(df_filtered_scaled)
inertia.append(kmeans.inertia_)

When plotting the elbow chart, there isn’t a clear elbow. When calculating the maximum distance between each point and the red line, we have the maximum at position 4, which means this should be the number of clusters. Other approaches when the Elbow Method is not good, are trying other methods of identifying the number of clusters or even reviewing the variables one utilizes.

Elbow Method Example Chart. Source: Author

Silhouette Method

This metric compares how similar each point is to its clusters compared to the others. This is a very time-demanding method, which means that depending on the dataset size, you won’t be able to run it properly. For big datasets, a possibility is to run them using the spark functions for Machine Learning, which accelerates the training process using multiple machines.

In Silhouette Method, the values come from 1 to -1: values closer to 1 indicate that the clusters are very distant from each other; value 0 means that the cluster is on the limit board between the other clusters; negative values indicate a poor clustering setting.
It’s calculated using the library sklearn.metrics :

#Importing and running the silhouette metric
from sklearn.metrics import silhouette_score

#Defining the number of clusters to test
silhouette = []
k_min = 2
k_max = 8

#Calculating the euclidean distance between the points and the clusters
n = [i for i in range(k_min, k_max+1)]
for i in range(k_min,k_max+1):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(df_sample_scaled)
silhouette.append(silhouette_score(df_sample_scaled,
kmeans.labels_,
metric='euclidean'))

As one can see in the example below, the maximum value is 0.8 in k=4 number of clusters.

Silhouette Method Example Chart. Source: Author

Once the k-number of clusters is selected, all that is needed to do is train the model using the standardized data:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Setting the number of clusters
num_clusters = 4

# Apply the K-means algorithm
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(df_filtered_scaled)

# Add the cluster column to the original data
df_filtered['cluster'] = kmeans.labels_
df_filtered

You can plot the clustering results using a pair plot chart (the variables selected above are shown on this chart):

Clusters considering the Example Metrics. Source: Author

In the chart above, clusters are plotted in pairs, with each metric used for cluster creation. One can compare how the groups are divided considering distinct KPIs. Additionally, the distribution of each cluster group across variables can be observed.

Creating Data Personas

Data personas are fictitious representations of user groups based on demographic, behavioral, and usage data. These personas are crucial for segmenting the target audience, tailoring marketing strategies, developing products, and enhancing the user experience. They help personalize content, target marketing campaigns, and analyze data more specifically, contributing to informed and effective business decisions.

According to Amplitude, the key points of understanding data personas are:

  • Understanding user engagement by learning which product features are popular and why users interact with the product as they do;
  • Developing user profiles to cater to different groups more effectively;
  • Gathering quantifiable user data to identify trends and behavior;
  • Utilizing data for product design and development decisions to enhance the user experience, retention, and conversion;

Considering the set of variables used in the examples above, the final possible table with the metrics grouped at the cluster level is:

Example of metrics by Cluster. Source: Author

Note that one can name the clusters based on the results they have:

  • Casual Users: Users with a low click rate and session time who don’t generate much revenue;
  • Window Shoppers: People with a high click rate and time on the pages but low conversion/revenue;
  • Struggling Engagement: Users with low session time and high bounce rates or low conversions;
  • Loyal Customers: Loyal customers with high revenue, a high number of transactions, and reasonable session time;
The User Profiles. Source: Author

Other examples of data personas that you can classify are:

  • High-Value Users: This cluster can group users with high revenue and a high number of transactions.
  • Engaged Explorers: This group may include users with a high click rate and session time, indicating active exploration of the app.
  • Quality Time Spent: Users who spend significant time in the app with a low bounce rate.
  • Conversion Champions: Users with a high conversion rate compared to other metrics.
  • Ad Click Enthusiasts: Users who click frequently and have high conversions (Transactions), showing a strong response to advertising.
  • Cart Abandoners: Users who click and add items to the cart but have low transaction rates, indicating potential issues with the checkout process.
  • Content Explorers: Users who engage deeply with app content (Time_on_Page) and have a low bounce rate, indicating a strong interest in exploring content.

Although the current article is focused on App Interaction specifically, the concept of Data Personas can be very useful for classifying personas in many other sectors: Healthcare, to categorize patients based on medical history, treatment preferences, and health conditions, providing personalized healthcare solutions and treatment plans; Media and Entertainment, to create audiences based on viewing habits, content preferences, and interaction with media platforms, enabling the use of content recommendation and content production; Public Services, applied to citizens to better understand their needs, preferences, and interactions with government services, aiming to improve public services and policies. policies; and many others.

Conclusion

In this journey from exploring app interactions to harnessing the power of machine learning for user segmentation, there are a couple of data-driven insights that offer valuable perspectives for enhancing user engagement and tailoring strategies. Clustering, which can be translated to personas representations, is a powerful tool to understand the specificities of a group of users or any group of data one can put together. Speaking of sessions, the size of datasets can be a challenge to train models, as well as the metrics that need to be chosen.

As k-means is very dependent on the variables that are selected and on the dimension of data, data transformation represents an important role in the process of creating good clusters. Across the board, Reduction of Dimensionality, Principal Component Analysis, and Correlation can be some techniques applied to create better clusters and less redundant data. The evaluation of k-numbers of clusters doesn’t need to be done necessarily alone; it’s possible to cross techniques and visualizations to understand what is performative or not and what makes sense for the business. Bad results need to be avoided and previous steps need to be reconsidered.

After achieving satisfactory cluster results and establishing data personas using the metrics, tailored approaches can be employed for each group. Ultimately, the objective is to gain a comprehensive understanding of the behavior within each cluster and to identify strategies for personalizing the user experience. Additionally, this insight can drive the creation of initiatives aimed at developing new products, features, services, or public policies.

--

--

Jadson Barbosa
Jadson Barbosa

Written by Jadson Barbosa

Hello, I'm Jadson! An engineer passionate about understanding people beyond the numbers!

No responses yet