Fits multiple LDA models - one for each combination of k_opts, iter_opts and the min_freq selected in make_DTMs. So if we have 2 settings for each, we have 2^3 models = 8 models. The function can take a long time to run if using large data and/or many combinations of parameters.
Arguments
- dtms
A nested tibble where each row is a dtm
- k_opts
A vector of the different values of k to fit an LDA model with
- iter_opts
A vector of the different values of iter to fit an LDA model with
- in_parallel
Whether to run the LDA set up with parallel processing or not. TRUE disables progress bars.
- coherence_n
The n for perplexity calculation of coherence_n, default is 10, sqrt(n_terms) is another default option.
Details
Higher values for iter_opts will increase function run time, but should produce more accurate topic models. There are no hard and fast rules for selecting the value of this parameter, we advise experimentation.
For small(ish) data sets (say < 10k rows), the in_parallel function will not deliver much of a speed gain as there are significant overheads in setting up the parallel processing. For larger data sets, the speed gains are usually significant, but this comes at the cost of losing the progress bar (at least for now).
Examples
library(SegmentR)
data <- SegmentR:::test_data(lda = FALSE)
#> removing stopwords
#> Making DTMs
seg_dtms <- data$dtm
#We select a low value of 5 for iter_opts to speed up the example.
fit_LDAs(seg_dtms, k = 3, iter = 5, coherence_n = 10)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 2 × 11
#> data dtm freq_cutoff n_terms n_docs k alpha delta iter
#> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1349]> 1 1349 100 3 0.333 0.0333 5
#> 2 <tibble> <DcmntTrM[,54]> 5 54 100 3 0.333 0.0333 5
#> # ℹ 2 more variables: lda <list>, coherence <list>
#Once again, we set a low value for speed.
fit_LDAs(seg_dtms, k = 4:6, iter = 5, coherence_n = 5)
#> making tuning grid
#> setting up LDAs
#> # A tibble: 6 × 11
#> data dtm freq_cutoff n_terms n_docs k alpha delta iter
#> <list> <list> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 <tibble> <DcmntTrM[,1349]> 1 1349 100 4 0.25 0.025 5
#> 2 <tibble> <DcmntTrM[,1349]> 1 1349 100 5 0.2 0.02 5
#> 3 <tibble> <DcmntTrM[,1349]> 1 1349 100 6 0.167 0.0167 5
#> 4 <tibble> <DcmntTrM[,54]> 5 54 100 4 0.25 0.025 5
#> 5 <tibble> <DcmntTrM[,54]> 5 54 100 5 0.2 0.02 5
#> 6 <tibble> <DcmntTrM[,54]> 5 54 100 6 0.167 0.0167 5
#> # ℹ 2 more variables: lda <list>, coherence <list>
if(interactive()){
# Run in parallel:
fit_LDAs(seg_dtms, k = 4:8, iter = 5, in_parallel = TRUE)}