Converts a tibble or data frame into multiple DTMs (one for each min_freq supplied), or one DTM is one value given or `min_freq =`. A DTM is a matrix where each row is a document and each column is a term. The values in the matrix are the frequency of each term in each document.
Usage
make_DTMs(
df,
text_var,
url_var = Permalink,
min_freq = 10,
clean_text = FALSE,
in_parallel = TRUE,
remove_stops = TRUE
)
Arguments
- df
A dataframe where each row is a separate post.
- text_var
The variable containing the text which you want to explore.
- url_var
Name of your URL var for exemplars in explore_LDAs
- min_freq
The minimum number of times a term must be observed to be considered.
- clean_text
Whether to clean the text variable or not?
- in_parallel
whether to run the function in parallel?
- remove_stops
Should English stopwords be removed?
Details
This function is the first of three functions which will take a data frame with a text variable and a URL variable and turn it into an explorable Latent Dirichlet Allocation (LDA) topic model.
It's generally advisable to be mindful about how you clean your text variable, and to execute all cleaning steps, including removal of stop-words, prior to running make_DTMs. We advise this approach rather than relying on the `clean_text` argument, because you have more control over what cleaning steps do in fact take place.
Examples
library(SegmentR)
data <- SegmentR::sprinklr_export[1:100, ] # Slice for time and memory savings
make_DTMs(data, text_var = Message, url_var = Permalink, min_freq = 1)
#> removing stopwords
#> Making DTMs
#> # A tibble: 1 × 5
#> data freq_cutoff dtm n_terms n_docs
#> <list> <dbl> <list> <dbl> <dbl>
#> 1 <tibble [100 × 3]> 1 <DcmntTrM[,1349]> 1349 100
make_DTMs(data, text_var = Message, url_var = Permalink, min_freq = c(10, 20))
#> removing stopwords
#> Making DTMs
#> # A tibble: 2 × 5
#> data freq_cutoff dtm n_terms n_docs
#> <list> <dbl> <list> <dbl> <dbl>
#> 1 <tibble [100 × 3]> 10 <DcmntTrM[,13]> 13 96
#> 2 <tibble [100 × 3]> 20 <DcmntTrM[,6]> 6 90