Fits multiple LDA models - one for each combination of k_opts, iter_opts and the min_freq selected in make_DTMs. So if we have 2 settings for each, we have 2^3 models = 8 models. The function can take a long time to run if using large data and/or many combinations of parameters.
Arguments
- dtms
A nested tibble where each row is a dtm
- k_opts
A vector of the different values of k to fit an LDA model with
- iter_opts
A vector of the different values of iter to fit an LDA model with
- in_parallel
Whether to run LDA fitting in parallel using mirai. Requires mirai to be installed and daemons to be set up beforehand (see Details). `TRUE` disables progress bars.
- coherence_n
The n for perplexity calculation of coherence_n, default is 10, sqrt(n_terms) is another default option.
Details
Higher values for iter_opts will increase function run time, but should produce more accurate topic models. There are no hard and fast rules for selecting the value of this parameter, we advise experimentation.
## Parallel execution
When `in_parallel = TRUE`, individual LDA models are fitted concurrently using mirai via `purrr::in_parallel()`. This can deliver meaningful speed-ups when fitting many model combinations on larger data sets (say > 10k rows). For smaller jobs the overhead of launching workers may outweigh the gains; experiment to find the break-even point for your data.
SegmentR does **not** start or stop mirai daemons for you. You must call [mirai::daemons()] before running `fit_LDAs(..., in_parallel = TRUE)` and clean up with `mirai::daemons(0)` afterwards. This is deliberate: daemon lifecycle depends on your hardware, the number of models you are fitting, and whether you want to reuse the same workers across multiple calls. Leaving session management to the caller avoids hidden side-effects and gives you full control over resource allocation.
“` mirai::daemons(4) ldas <- fit_LDAs(dtms, k = 4:8, iter = 2000, in_parallel = TRUE) mirai::daemons(0) “`
Progress bars are not available during parallel execution.
Examples
library(SegmentR)
data <- SegmentR:::test_data(lda = FALSE)
#> Making DTMs
seg_dtms <- data$dtm
#We select a low value of 5 for iter_opts to speed up the example.
fit_LDAs(seg_dtms, k = 3, iter = 5, coherence_n = 10)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 2 × 11
#> data dtm freq_cutoff n_terms n_docs k alpha delta iter
#> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1300]> 1 1300 100 3 0.333 0.0333 5
#> 2 <tibble> <DcmntTrM[,48]> 5 48 99 3 0.333 0.0333 5
#> # ℹ 2 more variables: lda <list>, coherence <list>
#Once again, we set a low value for speed.
fit_LDAs(seg_dtms, k = 4:6, iter = 5, coherence_n = 5)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 6 × 11
#> data dtm freq_cutoff n_terms n_docs k alpha delta iter
#> <list> <list> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1300]> 1 1300 100 4 0.25 0.025 5
#> 2 <tibble> <DcmntTrM[,1300]> 1 1300 100 5 0.2 0.02 5
#> 3 <tibble> <DcmntTrM[,1300]> 1 1300 100 6 0.167 0.0167 5
#> 4 <tibble> <DcmntTrM[,48]> 5 48 99 4 0.25 0.025 5
#> 5 <tibble> <DcmntTrM[,48]> 5 48 99 5 0.2 0.02 5
#> 6 <tibble> <DcmntTrM[,48]> 5 48 99 6 0.167 0.0167 5
#> # ℹ 2 more variables: lda <list>, coherence <list>
if(interactive()){
# Run in parallel — start daemons first:
mirai::daemons(4)
fit_LDAs(seg_dtms, k = 4:8, iter = 5, in_parallel = TRUE)
mirai::daemons(0)}