The package website can be found at : https://juba.github.io/rainette/.
Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other software such as Iramuteq (free software) or Alceste (commercial, closed source).
The package is installable from CRAN :
The development version is installable from Github :
Let’s start with an example corpus provided by the excellent quanteda package :
First, we’ll use
split_segments to split each text in the corpus into segments of about 40 words (punctuation is taken into account) :
corpus <- split_segments(data_corpus_inaugural, segment_size = 40)
Next, we’ll compute a document-term matrix and apply some treatments with
quanteda functions :
dtm <- dfm(corpus, remove = stopwords("en"), tolower = TRUE, remove_punct = TRUE) dtm <- dfm_wordstem(dtm, language = "english") dtm <- dfm_trim(dtm, min_termfreq = 3)
We can then apply a simple clustering on this dtm with the
rainette function. We specify the number of clusters (
k), the minimum size for a cluster to be splitted at next step (
min_split_members) and the minimum number of forms in each segment (
res <- rainette(dtm, k = 6, min_uc_size = 15, min_split_members = 20)
We can use the
rainette_explor shiny interface to visualise and explore the different clusterings at each
We can then use the generated R code to reproduce the displayed clustering visualisation plot :
rainette_plot(res, dtm, k = 5, type = "bar", n_terms = 20, free_scales = FALSE, measure = "chi2", show_negative = "TRUE", text_size = 10)
Or cut the tree at chosen
k and add a group membership variable to our corpus metadata :
docvars(corpus)$group <- cutree_rainette(res, k = 5)
In addition to this, you can also perform a double clustering, ie two simple clusterings produced with different
min_uc_size which are then “crossed” to generate more solid clusters. To do this, use
rainette2 either on two
rainette results :
Or directly on a dtm with
uc_size2 arguments :
rainette2(dtm, max_k = 10, uc_size1 = 10, uc_size2 = 15, min_members = 20)
You can then use
cutree_rainette2 to explore and visualise the results.
Three vignettes are available, an introduction in english :
And an introduction and an algorithm description, in french :
This classification method has been created by Max Reinert, and is described in several articles. Here are two references :
Thanks to Pierre Ratineau, the author of Iramuteq, for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.
Many thanks to Sébastien Rochette for the creation of the hex logo.
Many thanks to Florian Privé for his work on rewriting and optimizing Rcpp code.