Skip to contents

Fits multiple LDA models - one for each combination of k_opts, iter_opts and the min_freq selected in make_DTMs. So if we have 2 settings for each, we have 2^3 models = 8 models. The function can take a long time to run if using large data and/or many combinations of parameters.

Usage

fit_LDAs(
  dtms,
  k_opts = 2:3,
  iter_opts = 2000,
  in_parallel = FALSE,
  coherence_n = 10
)

Arguments

dtms

A nested tibble where each row is a dtm

k_opts

A vector of the different values of k to fit an LDA model with

iter_opts

A vector of the different values of iter to fit an LDA model with

in_parallel

Whether to run LDA fitting in parallel using mirai. Requires mirai to be installed and daemons to be set up beforehand (see Details). `TRUE` disables progress bars.

coherence_n

The n for perplexity calculation of coherence_n, default is 10, sqrt(n_terms) is another default option.

Value

A nested tibble in which each row contains an LDA model.

Details

Higher values for iter_opts will increase function run time, but should produce more accurate topic models. There are no hard and fast rules for selecting the value of this parameter, we advise experimentation.

## Parallel execution

When `in_parallel = TRUE`, individual LDA models are fitted concurrently using mirai via `purrr::in_parallel()`. This can deliver meaningful speed-ups when fitting many model combinations on larger data sets (say > 10k rows). For smaller jobs the overhead of launching workers may outweigh the gains; experiment to find the break-even point for your data.

SegmentR does **not** start or stop mirai daemons for you. You must call [mirai::daemons()] before running `fit_LDAs(..., in_parallel = TRUE)` and clean up with `mirai::daemons(0)` afterwards. This is deliberate: daemon lifecycle depends on your hardware, the number of models you are fitting, and whether you want to reuse the same workers across multiple calls. Leaving session management to the caller avoids hidden side-effects and gives you full control over resource allocation.

“` mirai::daemons(4) ldas <- fit_LDAs(dtms, k = 4:8, iter = 2000, in_parallel = TRUE) mirai::daemons(0) “`

Progress bars are not available during parallel execution.

Examples

library(SegmentR)
data <- SegmentR:::test_data(lda = FALSE)
#> Making DTMs
seg_dtms <- data$dtm

#We select a low value of 5 for iter_opts to speed up the example.
fit_LDAs(seg_dtms, k = 3, iter = 5, coherence_n = 10)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 2 × 11
#>   data     dtm               freq_cutoff n_terms n_docs     k alpha  delta  iter
#>   <list>   <list>                  <dbl>   <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1300]>           1    1300    100     3 0.333 0.0333     5
#> 2 <tibble> <DcmntTrM[,48]>             5      48     99     3 0.333 0.0333     5
#> # ℹ 2 more variables: lda <list>, coherence <list>

#Once again, we set a low value for speed.
fit_LDAs(seg_dtms, k = 4:6, iter = 5, coherence_n = 5)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 6 × 11
#>   data     dtm               freq_cutoff n_terms n_docs     k alpha  delta  iter
#>   <list>   <list>                  <dbl>   <dbl>  <dbl> <int> <dbl>  <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1300]>           1    1300    100     4 0.25  0.025      5
#> 2 <tibble> <DcmntTrM[,1300]>           1    1300    100     5 0.2   0.02       5
#> 3 <tibble> <DcmntTrM[,1300]>           1    1300    100     6 0.167 0.0167     5
#> 4 <tibble> <DcmntTrM[,48]>             5      48     99     4 0.25  0.025      5
#> 5 <tibble> <DcmntTrM[,48]>             5      48     99     5 0.2   0.02       5
#> 6 <tibble> <DcmntTrM[,48]>             5      48     99     6 0.167 0.0167     5
#> # ℹ 2 more variables: lda <list>, coherence <list>
if(interactive()){
# Run in parallel — start daemons first:
mirai::daemons(4)
fit_LDAs(seg_dtms, k = 4:8, iter = 5, in_parallel = TRUE)
mirai::daemons(0)}