Split a character string or corpus into segments, taking into account punctuation where possible
Usage
split_segments(obj, segment_size = 40, segment_size_window = NULL)
# S3 method for character
split_segments(obj, segment_size = 40, segment_size_window = NULL)
# S3 method for Corpus
split_segments(obj, segment_size = 40, segment_size_window = NULL)
# S3 method for corpus
split_segments(obj, segment_size = 40, segment_size_window = NULL)
# S3 method for tokens
split_segments(obj, segment_size = 40, segment_size_window = NULL)
Arguments
- obj
character string, quanteda or tm corpus object
- segment_size
segment size (in words)
- segment_size_window
window around segment size to look for best splitting point
Examples
# \donttest{
require(quanteda)
split_segments(data_corpus_inaugural)
#> Splitting...
#> Done.
#> Corpus consisting of 3,584 documents and 5 docvars.
#> 1789-Washington_1 :
#> "Fellow-Citizens of the Senate and of the House of Representa..."
#>
#> 1789-Washington_2 :
#> "On the one hand, I was summoned by my Country, whose voice I..."
#>
#> 1789-Washington_3 :
#> "as the asylum of my declining years - a retreat which was re..."
#>
#> 1789-Washington_4 :
#> "On the other hand, the magnitude and difficulty of the trust..."
#>
#> 1789-Washington_5 :
#> "could not but overwhelm with despondence one who (inheriting..."
#>
#> 1789-Washington_6 :
#> "In this conflict of emotions all I dare aver is that it has ..."
#>
#> [ reached max_ndoc ... 3,578 more documents ]
# }