Skip to contents

Split a character string or corpus into segments, taking into account punctuation where possible


split_segments(obj, segment_size = 40, segment_size_window = NULL)

# S3 method for character
split_segments(obj, segment_size = 40, segment_size_window = NULL)

# S3 method for Corpus
split_segments(obj, segment_size = 40, segment_size_window = NULL)

# S3 method for corpus
split_segments(obj, segment_size = 40, segment_size_window = NULL)

# S3 method for tokens
split_segments(obj, segment_size = 40, segment_size_window = NULL)



character string, quanteda or tm corpus object


segment size (in words)


window around segment size to look for best splitting point


If obj is a tm or quanteda corpus object, the result is a quanteda corpus.


# \donttest{
#>   Splitting...
#>   Done.
#> Corpus consisting of 3,584 documents and 5 docvars.
#> 1789-Washington_1 :
#> "Fellow-Citizens of the Senate and of the House of Representa..."
#> 1789-Washington_2 :
#> "On the one hand, I was summoned by my Country, whose voice I..."
#> 1789-Washington_3 :
#> "as the asylum of my declining years - a retreat which was re..."
#> 1789-Washington_4 :
#> "On the other hand, the magnitude and difficulty of the trust..."
#> 1789-Washington_5 :
#> "could not but overwhelm with despondence one who (inheriting..."
#> 1789-Washington_6 :
#> "In this conflict of emotions all I dare aver is that it has ..."
#> [ reached max_ndoc ... 3,578 more documents ]
# }