Skip to contents

Converts a tibble or data frame into multiple DTMs (one for each min_freq supplied), or one DTM is one value given or `min_freq =`. A DTM is a matrix where each row is a document and each column is a term. The values in the matrix are the frequency of each term in each document.

Usage

make_DTMs(
  df,
  text_var,
  url_var = Permalink,
  min_freq = 10,
  clean_text = FALSE,
  in_parallel = TRUE,
  remove_stops = TRUE
)

Arguments

df

A dataframe where each row is a separate post.

text_var

The variable containing the text which you want to explore.

url_var

Name of your URL var for exemplars in explore_LDAs

min_freq

The minimum number of times a term must be observed to be considered.

clean_text

Whether to clean the text variable or not?

in_parallel

whether to run the function in parallel?

remove_stops

Should English stopwords be removed?

Value

A nested tibble in which each row contains a document-term matrix.

Details

This function is the first of three functions which will take a data frame with a text variable and a URL variable and turn it into an explorable Latent Dirichlet Allocation (LDA) topic model.

It's generally advisable to be mindful about how you clean your text variable, and to execute all cleaning steps, including removal of stop-words, prior to running make_DTMs. We advise this approach rather than relying on the `clean_text` argument, because you have more control over what cleaning steps do in fact take place.

Examples


library(SegmentR)
data <- SegmentR::sprinklr_export[1:100, ] # Slice for time and memory savings
make_DTMs(data, text_var = Message, url_var = Permalink, min_freq = 1)
#> removing stopwords
#> Making DTMs
#> # A tibble: 1 × 5
#>   data               freq_cutoff dtm               n_terms n_docs
#>   <list>                   <dbl> <list>              <dbl>  <dbl>
#> 1 <tibble [100 × 3]>           1 <DcmntTrM[,1349]>    1349    100

make_DTMs(data, text_var = Message, url_var = Permalink, min_freq = c(10, 20))
#> removing stopwords
#> Making DTMs
#> # A tibble: 2 × 5
#>   data               freq_cutoff dtm             n_terms n_docs
#>   <list>                   <dbl> <list>            <dbl>  <dbl>
#> 1 <tibble [100 × 3]>          10 <DcmntTrM[,13]>      13     96
#> 2 <tibble [100 × 3]>          20 <DcmntTrM[,6]>        6     90