rainette is an R implementation of the Reinert textual clustering algorithm. This algorithm is already available in softwares like Iramuteq or Alceste, the goal of
rainette being to provide it as an R package.
This is not a new algorithm, as the first articles describing it date back to 1983. Here are some characteristics of the method :
The Reinert method is a divisive hierarchical clustering algorithm whose aim is to maximise the inter-cluster Chi-squared distance.
The algorithm is applied to the document-term matrix computed on the corpus. Documents are called
uce (elementary context units). If an
uce doesn’t include enough terms, it can be merged with the following or previous one into an
uc (context unit). The resulting matrix is then weighted as binary, so only the presence/absence of terms are taken into account, not their frequencies.
The aim is to split this matrix into two clusters by maximizing the Chi-squared distance between those. As an exaustive search would be too compute-intensive, the following method is used to get a good approximation :
ucare ordered according to their coordinates on the first axis of the correspondance analysis of the binary matrix.
ucare grouped in two clusters based on this order, and the grouping with the maximum inter-cluster Chi-squared distance is kept.
ucis in turn assigned to the other cluster. If this new assignment gives a higher inter-cluster Chi-squared value, it is kept. The operation is repeated until no new assignment gives a higher Chi-squared.
The Reinert method suggests to do a double clustering to get more robust clusters. This method is also implemented in
The principle is to run two simple clusterings by varying the minimum
uc size. For example, the first one will be run with a minimum size of 10 terms, and the second one with a minimum size of 15.
The two sets of clusters are then “crossed” : every pair of clusters of each clustering are crossed together, even if they are not on the same hierarchical level. We then compute the number of
uc present in both clusters, and a Chi-squared value of association between them.
Only a subset of all these “crossed-clusters” are kept : those with different elements, with a minimum number of
uc or with a minimum association value. Then, for a given number of clusters
k, the algorithm looks for the optimal partition of crossed-clusters, ie it keeps the set of crossed-clusters with no common elements, and with either the higher total number of elements, or the higher sum of Chi-squared association coefficients.
Then, this optimal partition is used either as the final clustering (with potentially quite a high proportion of
NA), or as a starting point for a k-nearest-neighbour clustering for non-assigned documents.
As explained before, as it doesn’t take into account terms frequencies and assign each document to only one cluster, the Reinert method must be applied to short and “homogeneous” documents. It could be ok if you work on tweets or short answers to a specific question, but on longer documents the corpus must first be split into short textual segments.
split_segments function does just that, and can be applied on a
On this article we will apply it to the sample
quanteda corpus :
library(quanteda) library(rainette) corpus <- split_segments(data_corpus_inaugural, segment_size = 40)
split_segments will split the original texts into smaller chunks, attempting to respect sentences and punctuation when possible. The function takes two arguments :
segment_size: the preferred segment size, in words
segment_size_window: the “window” into which looking for the best segment split, in words. If
NULL, it is set to 0.4*
The result of the function is a
quanteda corpus, which keeps the original corpus metadata with an additional
segment_source variable :
## Corpus consisting of 3,584 documents and 5 docvars. ## 1789-Washington_1 : ## "Fellow-Citizens of the Senate and of the House of Representa..." ## ## 1789-Washington_2 : ## "On the one hand, I was summoned by my Country, whose voice I..." ## ## 1789-Washington_3 : ## "as the asylum of my declining years - a retreat which was re..." ## ## 1789-Washington_4 : ## "On the other hand, the magnitude and difficulty of the trust..." ## ## 1789-Washington_5 : ## "could not but overwhelm with despondence one who (inheriting..." ## ## 1789-Washington_6 : ## "In this conflict of emotions all I dare aver is that it has ..." ## ## [ reached max_ndoc ... 3,578 more documents ]
## Year President FirstName Party segment_source ## 1 1789 Washington George none 1789-Washington ## 2 1789 Washington George none 1789-Washington ## 3 1789 Washington George none 1789-Washington ## 4 1789 Washington George none 1789-Washington ## 5 1789 Washington George none 1789-Washington ## 6 1789 Washington George none 1789-Washington
Next step is to compute the document-feature matrix. As our
corpus object is a
quanteda corpus, we can tokenize it and then use the
dfm function :
tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE)
We only keep the terms that appear at least in 10 segments by using
dtm <- dfm_trim(dtm, min_docfreq = 10)
We are now ready to compute a simple Reinert clustering by using the
rainette function. Its main arguments are :
k: the number of clusters to compute.
min_segment_size: the minimum number of terms in each context unit at startup. If a
dfmsegment contains less than this number of terms, it will be merged with the following one (if they come from the same source document). The default value is 0, ie no merging is done.
min_split_members: if a cluster is smaller than this value, it won’t be split afterwards (default : 5).
Here we will compute 5 clusters with a
min_segment_size of 15 :
res <- rainette(dtm, k = 5, min_segment_size = 15)
To help exploring the clustering results,
rainette offers an interactive interface which can be launched with
rainette_explor(res, dtm, corpus)
The interactive interface should look something like this :
You can change the number of clusters, the displayed statistic, etc., and see the result in real time. By default the most specific terms are displayed with a blue bar or a red one for those with a negative keyness (if Show negative values has been checked).
The Cluster documents tab allows to browse the documents of a given cluster. You can filter them by giving a term or a regular expression in the Filter by term field :
In the Summary tab, you can click on the Get R code button to get the R code to reproduce the current plot and to compute cluster membership.
You can also directly use
cutree to get each document cluster at level
cluster <- cutree(res, k = 5)
This vector can be used, for example, as a new corpus metadata variable :
## Year President FirstName Party segment_source cluster ## 1 1789 Washington George none 1789-Washington 4 ## 2 1789 Washington George none 1789-Washington 4 ## 3 1789 Washington George none 1789-Washington 3 ## 4 1789 Washington George none 1789-Washington 3 ## 5 1789 Washington George none 1789-Washington 5 ## 6 1789 Washington George none 1789-Washington 5
Here, the clusters have been assigned to the segments, not to the original documents as a whole. The
clusters_by_doc_table allows to display, for each original document, the number of segment belonging to each cluster :
clusters_by_doc_table(corpus, clust_var = "cluster")
## # A tibble: 59 × 6 ## doc_id clust_1 clust_2 clust_3 clust_4 clust_5 ## <chr> <int> <int> <int> <int> <int> ## 1 1789-Washington 0 0 4 27 6 ## 2 1793-Washington 0 0 0 0 4 ## 3 1797-Adams 8 0 5 44 2 ## 4 1801-Jefferson 18 0 3 21 2 ## 5 1805-Jefferson 2 3 6 43 3 ## 6 1809-Madison 0 0 4 17 7 ## 7 1813-Madison 0 5 5 20 3 ## 8 1817-Monroe 2 0 27 47 12 ## 9 1821-Monroe 2 0 37 66 10 ## 10 1825-Adams 8 2 13 46 8 ## # … with 49 more rows
prop = TRUE, the same table is displayed with row percentages :
clusters_by_doc_table(corpus, clust_var = "cluster", prop = TRUE)
## # A tibble: 59 × 6 ## doc_id clust_1 clust_2 clust_3 clust_4 clust_5 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1789-Washington 0 0 10.8 73.0 16.2 ## 2 1793-Washington 0 0 0 0 100 ## 3 1797-Adams 13.6 0 8.47 74.6 3.39 ## 4 1801-Jefferson 40.9 0 6.82 47.7 4.55 ## 5 1805-Jefferson 3.51 5.26 10.5 75.4 5.26 ## 6 1809-Madison 0 0 14.3 60.7 25 ## 7 1813-Madison 0 15.2 15.2 60.6 9.09 ## 8 1817-Monroe 2.27 0 30.7 53.4 13.6 ## 9 1821-Monroe 1.74 0 32.2 57.4 8.70 ## 10 1825-Adams 10.4 2.60 16.9 59.7 10.4 ## # … with 49 more rows
docs_by_cluster_table allows to display, for each cluster, the number and proportion of original document including at least one segment of this cluster :
docs_by_cluster_table(corpus, clust_var = "cluster")
## # A tibble: 5 × 3 ## cluster n `%` ## <chr> <int> <dbl> ## 1 clust_1 52 88.1 ## 2 clust_2 40 67.8 ## 3 clust_3 40 67.8 ## 4 clust_4 44 74.6 ## 5 clust_5 34 57.6
rainette also offers a “double clustering” algorithm, as described above : two simple clusterings are computed with varying
min_segment_size, and then combined to get a better partition and more robust clusters.
This can be done with the
rainette2 function. It can be applied to two already computed simple clusterings. Here, we compute them with
min_segment_size at 10 and 15 :
We then use
rainette2 to combine them. The main function arguments are
max_k, the maximum number of clusters, and
min_members, the minimum cluster size :
res <- rainette2(res1, res2, max_k = 5, min_members = 10)
Another way is to call
rainette2 directly on our
dtm matrix by giving it two
min_segment_size2 arguments :
res <- rainette2(dtm, min_segment_size1 = 10, min_segment_size2 = 15, max_k = 5, min_members = 10)
The resulting object is a tibble with, for each level k, the optimal partitions and their characteristics. Another interactive interface is available to explore the results. It is launched with
rainette2_explor(res, dtm, corpus)
The interface is very similar to the previous one, except there is no dendrogram anymore, but a single barplot with each cluster size instead. Be careful of the number of
NA (not assigned documents), as it can be quite high.
If some points are not assigned to any cluster, you can use
rainette2_complete_groups to assign them to the nearest one by using a k-nearest-neighbors algorithm (with k=1) :