The function performs text cleansing by removing escape characters, non alphanumeric, long-words, excess space, and turns all letters to lower case.
cleansing_corpus( text, escape_chars = TRUE, nonalphanum = TRUE, longwords = TRUE, whitespace = TRUE, tolower = TRUE )
text | Character vector of free text to be cleansed. |
---|---|
escape_chars | If TRUE, removes escape characters for |
nonalphanum | If TRUE, removes non-alphanumeric characters. |
longwords | If TRUE, removes words with more than 35 characters. |
whitespace | If TRUE, removes excess whitespace. |
tolower | If TRUE, turns letters to lower. |
A character vector of the cleansed text.
txt <- "It has roots in a piece of classical Latin literature from 45 BC" cleansing_corpus(txt)#> [1] "it has roots in a piece of classical latin literature from 45 bc"