Probabilistic Supervised Clustering with Conditional Random Fields

Predicting group emergence in spatial- and non-spatial settings [2018]
machine-learning spatial-statistics

Work in progress.

Abstract: We often encounter data where units are assigned to mutually exclusive categories, e.g.

and we would like to model and explain the underlying clustering process. Often, it is useful to express the (suspected) predictors determining the sorting process as inter-unit distances: Individuals may group in teams based on the distances in their skill sets; citizens may group into political factions based on differences in issue positions; regions are sorted into states based on demographic and geographic distances.

What these situations also have in common is that the number of groups (or clusters) resulting from the sorting process is not predetermined. A company consisting of a 100 employees may set up two large teams with 50 members, 10 teams with 10 members, etc. Similarly, a continental landmass may be organized into many small countries, one large country, or anything inbetween. Explaining the data generating process - i.e. the clustering mechanism – implies modeling these data in a fashion such that the number of clusters arises naturally from the measured inter-unit distances.

In machine learning, training a model that explains a partitioning of units into an a-priori unnkown number of clusters based on a set of distance predictors is known as supervised clustering. However, supervised clustering remains a poorly explored topic – especially in the social sciences. Existing models are largely unknown outside of specific domain settings (e.g. identity uncertainty in NLP, citation-document matching), are difficult to interpret, and there exist no readily-available software implementations. This paper proposes a supervised clustering method that (1) permits interpretation and prediction, (2) is easy to implement, (3) performs fast for small and medium-sized datasets, and (4) is suitable for spatial and non-spatial data.