Skip to contents

Corpus clustering based on the Reinert method - Double clustering


  y = NULL,
  max_k = 5,
  min_segment_size1 = 10,
  min_segment_size2 = 15,
  doc_id = NULL,
  min_members = 10,
  min_chi2 = 3.84,
  parallel = FALSE,
  full = TRUE,



either a quanteda dfm object or the result of rainette()


if x is a rainette() result, this must be another rainette() result from same dfm but with different uc size.


maximum number of clusters to compute


if x is a dfm, minimum uc size for first clustering


if x is a dfm, minimum uc size for second clustering


character name of a dtm docvar which identifies source documents.


minimum members of each cluster


minimum chi2 for each cluster


if TRUE, use parallel::mclapply to compute partitions (won't work on Windows, uses more RAM)


if TRUE, all crossed groups are kept to compute optimal partitions, otherwise only the most mutually associated groups are kept.


deprecated, use min_segment_size1 instead


deprecated, use min_segment_size2 instead


if x is a dfm object, parameters passed to rainette() for both simple clusterings


A tibble with optimal partitions found for each available value of k as rows, and the following columns :

  • clusters list of the crossed original clusters used in the partition

  • k the number of clusters

  • chi2 sum of the chi2 value of each cluster

  • n sum of the size of each cluster

  • groups group membership of each document for this partition (NA if not assigned)


You can pass a quanteda dfm as x object, the function then performs two simple clustering with varying minimum uc size, and then proceed to find optimal partitions based on the results of both clusterings.

If both clusterings have already been computed, you can pass them as x and y arguments and the function will only look for optimal partitions.

doc_id must be provided unless the corpus comes from split_segments, in this case segment_source is used by default.

If full = FALSE, computation may be much faster, but the chi2 criterion will be the only one available for best partition detection, and the result may not be optimal.

For more details on optimal partitions search algorithm, please see package vignettes.


  • Reinert M, Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte, Cahiers de l'analyse des données, Volume 8, Numéro 2, 1983.

  • Reinert M., Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de Méthodologie Sociologique, Volume 26, Numéro 1, 1990. doi:10.1177/075910639002600103


# \donttest{
corpus <- data_corpus_inaugural
corpus <- head(corpus, n = 10)
corpus <- split_segments(corpus)
#>   Splitting...
#>   Done.
tok <- tokens(corpus, remove_punct = TRUE)
tok <- tokens_remove(tok, stopwords("en"))
dtm <- dfm(tok, tolower = TRUE)
dtm <- dfm_trim(dtm, min_docfreq = 3)

res1 <- rainette(dtm, k = 5, min_segment_size = 10)
#>   Merging segments to comply with min_segment_size...
#>   Clustering...
#>   Done.
res2 <- rainette(dtm, k = 5, min_segment_size = 15)
#>   Merging segments to comply with min_segment_size...
#>   Clustering...
#>   Done.

res <- rainette2(res1, res2, max_k = 4)
#>   Searching for best partitions...
#>   Computing size 2 partitions...
#>   Computing size 3 partitions...
#>   Computing size 4 partitions...
#>   Selecting best partitions...
#>   Done.
# }