For choosing topic_cutoff, it's important to understand there is simply not a one-size fits all solution. The topic probabilities follow probability distribution axioms but are not best understood as exact probabilities.
Arguments
- linked_df
Data Frame which has been linked to probabilities through topics_link
- topic_cutoff
Your cut-off probability, 0.5 is the default value.
Details
For example, if k = 8, 1/k = 0.125, a score of 0.250 is x2 the mean probability if distributed randomly. However, if k = 4, 1/k = 0.250 which is equal to the mean probability if distributed randomly, so all 0.250s are not equal.
A degree of domain knowledge and data set exploration is required to settle on the correct cut-off points.
Examples
list_data <- SegmentR:::test_data()
#> removing stopwords
#> Making DTMs
#> making tuning grid
#> setting up LDAs
probabilities <- list_data$explore$probabilities[[1]]
data <- list_data$lda$data[[1]]
linked <- topics_link(data, probabilities)
classified <- topics_classify(linked, 0.75)