Skip to contents

An opinionated function which offers some helpful defaults for common text cleaning needs. The function removes most URLs by default.

Usage

clean_text(
  df,
  text_var = message,
  tolower = TRUE,
  remove_hashtags = TRUE,
  remove_mentions = TRUE,
  remove_all_non_ascii = TRUE,
  remove_punctuation = TRUE,
  remove_digits = TRUE,
  in_parallel = FALSE
)

Arguments

df

A tibble or data frame object containing the text variable the user wants to perform cleaning steps upon

text_var

The text variable with the message assigned to the observation that the user wishes to clean

tolower

Whether to convert all text to lower case?

remove_hashtags

Should hashtags be removed?

remove_mentions

Should any user/profile mentions be removed?

remove_all_non_ascii

Should non-ASCII characters be removed? Includes some accents (but not latin), foreign characters, emojis etc.

remove_punctuation

Should punctuation be removed?

remove_digits

Should digits be removed?

in_parallel

Deprecated. Parallelism is no longer used as sequential processing is faster for typical workloads.

Value

The data frame provided, with a cleaned text variable.

Details

The function will remove rows from the data frame if those rows result in NA values once cleaning steps have been applied.

The function tries to remove emojis, non-ASCII characters, symbols etc. without removing latin accented letters.

The remove_emojis argument was replaced with 'remove_all_non_ascii' to better reflect what the original emoji removal RegEx was doing.

Examples

if(interactive()){
cleaned_data <- clean_text(df = ParseR::sprinklr_export,
  text_var = Message)

# Keep capital letters and punctuation
cleaned_data <- clean_text(df = ParseR::sprinklr_export,
  text_var = Message,
  tolower = FALSE,
  remove_punctuation = FALSE)
}