Hi Glen, thanks for reading and sharing your ideas.

You’re right one of the approaches is to treat the data as is, just uncovering the natural pattern hidden in the data, imbalanced classes is more of a problem for binary classification. At the same time, imbalanced classes can pop up when the dataset itself was unbalanced or as an indication of the outliers. In case of the imbalanced dataset, the whole cluster distribution might be skewed towards the majority cluster, not leaving anything insightful to say about the minority one(s). So, the Outliers might not require any special treatment, but it’s worth digging deeper to understand where they come from.

From another point of view, it is hard to make strong assumptions of minority clusters when their size is not comparable with the rest (e.g., a comparison between clusters of size 15 and 150 should be done carefully). Also, it is hard to draw conclusions and hypotheses from minority classes — you can treat them as samples, and the sample size should be adequate to the size of your general population (e.g., if we have 10,000 users, 10 observations in a cluster does not seem to be satisfactory).

Lastly, imbalanced clusters mean they are likely to overlap, which means it is hard to see how they differ, especially if you want to plot them.

I found this link to be quite interesting https://stats.stackexchange.com/questions/223767/class-imbalance-in-clustering.

My wording in the post should have been different, it’s not that you prefer clusters of the same size, but that clusters of the same size might be easier to work with and draw conclusions from, which means you dataset is more likely to be clean, you have meaningful information within clusters and so on.

As for interpreting the clusters, in case you have numerical data, you would usually try to go for the mean for each variable in every cluster (by the way, variables are kind of randomly distributed and invented by me, so no hidden meaning behind 🙂 ). As we operate with categorical variables, we are looking for modes — the most frequently appearing value for each variable.

You can see that in some sections, there are some dark blue spots (mode is prominently different from other values, e.g. area. origin, dish type) and no blue spots at all (e.g. day of the week) — which means the values are equally distributed and variation is low. This is the first hint. Some variables contribute to finding the differences between clusters (they vary more, have a “prominent” mode), others don’t, and they might as well not add value to the clustering itself. So, what you might do is exclude them from the analysis and run it all over again as they do not add any meaningful information to differentiate between customer groups (picturing it bold, it’s like saying “customers from cluster 1 are humans, customers from cluster 2 are humans too” — doesn’t tell us anything special about any of them).

From there, you can infer your customers’ differences and make assumptions about their behaviour. For me, it was a prep step for the face-to-face interviews, so I could think about which questions to ask.

Let me know your thoughts!

Data Lab | Growth Hacking & Data Science

Data Lab | Growth Hacking & Data Science