fit LDA topic models on Document Term Matrices

Fits multiple LDA models - one for each combination of k_opts, iter_opts and the min_freq selected in make_DTMs. So if we have 2 settings for each, we have 2^3 models = 8 models. The function can take a long time to run if using large data and/or many combinations of parameters.

Usage

fit_LDAs(
  dtms,
  k_opts = 2:3,
  iter_opts = 2000,
  in_parallel = TRUE,
  coherence_n = 10
)

Arguments

dtms: A nested tibble where each row is a dtm
k_opts: A vector of the different values of k to fit an LDA model with
iter_opts: A vector of the different values of iter to fit an LDA model with
in_parallel: Whether to run the LDA set up with parallel processing or not. TRUE disables progress bars.
coherence_n: The n for perplexity calculation of coherence_n, default is 10, sqrt(n_terms) is another default option.

Value

A nested tibble in which each row contains an LDA model.

Details

Higher values for iter_opts will increase function run time, but should produce more accurate topic models. There are no hard and fast rules for selecting the value of this parameter, we advise experimentation.

For small(ish) data sets (say < 10k rows), the in_parallel function will not deliver much of a speed gain as there are significant overheads in setting up the parallel processing. For larger data sets, the speed gains are usually significant, but this comes at the cost of losing the progress bar (at least for now).

Examples

library(SegmentR)
data <- SegmentR:::test_data(lda = FALSE)
#> removing stopwords
#> Making DTMs
seg_dtms <- data$dtm

#We select a low value of 5 for iter_opts to speed up the example.
fit_LDAs(seg_dtms, k = 3, iter = 5, coherence_n = 10)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 2 × 11
#>   data     dtm               freq_cutoff n_terms n_docs     k alpha  delta  iter
#>   <list>   <list>                  <dbl>   <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1349]>           1    1349    100     3 0.333 0.0333     5
#> 2 <tibble> <DcmntTrM[,54]>             5      54    100     3 0.333 0.0333     5
#> # ℹ 2 more variables: lda <list>, coherence <list>

#Once again, we set a low value for speed.
fit_LDAs(seg_dtms, k = 4:6, iter = 5, coherence_n = 5)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 6 × 11
#>   data     dtm               freq_cutoff n_terms n_docs     k alpha  delta  iter
#>   <list>   <list>                  <dbl>   <dbl>  <dbl> <int> <dbl>  <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1349]>           1    1349    100     4 0.25  0.025      5
#> 2 <tibble> <DcmntTrM[,1349]>           1    1349    100     5 0.2   0.02       5
#> 3 <tibble> <DcmntTrM[,1349]>           1    1349    100     6 0.167 0.0167     5
#> 4 <tibble> <DcmntTrM[,54]>             5      54    100     4 0.25  0.025      5
#> 5 <tibble> <DcmntTrM[,54]>             5      54    100     5 0.2   0.02       5
#> 6 <tibble> <DcmntTrM[,54]>             5      54    100     6 0.167 0.0167     5
#> # ℹ 2 more variables: lda <list>, coherence <list>
if(interactive()){
# Run in parallel:
fit_LDAs(seg_dtms, k = 4:8, iter = 5, in_parallel = TRUE)}