Measure weighted amount of information concerning the specificity of terms in a corpus. Term frequency–Inverse document frequency is one of the most frequently applied weighting schemes in information retrieval systems. The tf–idf is a statistical measure proportional to the number of times a word appears in the document, but is offset by the number of documents in the corpus that contain the word. Variations of the tf–idf are often used to estimate a document's relevance given a free-text query.

tf_idf(
  corpus,
  stopwords = NULL,
  id_col = "id",
  text_col = "text",
  tf_weight = "double_norm",
  idf_weight = "idf_smooth",
  min_chars = 2,
  norm = TRUE
)

Arguments

corpus

Input data, with an id column and a text column. Can be of type data.frame or data.table.

stopwords

A character vector of stopwords. Stopwords are filtered out before calculating numerical statistics.

id_col

Input data column name with the ids of the documents.

text_col

Input data column name with the documents.

tf_weight

Weighting scheme of term frequency. Choices are raw_count, double_norm or log_norm for raw count, double normalization at 0.5 and log normalization respectively.

idf_weight

Weighting scheme of inverse document frequency. Choices are idf and idf_smooth for inverse document frequency and inverse document frequency smooth respectively.

min_chars

Words with less characters than min_chars are filtered out before calculating numerical statistics.

norm

Boolean value for document normalization.

Value

A data.table with three columns, namely class derived from given document ids, term and tfIdf.

Examples

library(data.table) corpus <- copy(occupations_bundle) invisible(corpus[, text := paste(preferredLabel, altLabels)]) invisible(corpus[, text := cleansing_corpus(text)]) corpus <- corpus[ , .(conceptUri, text)] setnames(corpus, c("id", "text")) tf_idf(corpus)
#> class term tfIdf #> 1: 1d8f8111-79dd-41dc-aa2a-12f3192dde3c 2d 0.06854643 #> 2: 364d9e9f-b14c-4905-b9e5-5de623dba268 2d 0.07274106 #> 3: 48c8eccb-4bd0-487e-a187-4953cbbe956e 2d 0.08415213 #> 4: 951b9d94-9bf1-4771-877d-3a26b75d9e53 2d 0.08537603 #> 5: eedbf8b1-9432-4f42-ab7f-db5cd3eb2a05 360 0.10033651 #> --- #> 34672: ff4c28e2-66b1-4ae7-9053-fba8ec3428be zoological 0.07271548 #> 34673: eee68e13-8248-43bb-a9fb-959832a6c217 zoologist 0.05222478 #> 34674: ff4c28e2-66b1-4ae7-9053-fba8ec3428be zoologist 0.07546777 #> 34675: ff4c28e2-66b1-4ae7-9053-fba8ec3428be zoology 0.12342853 #> 34676: e1fdca51-a0e1-4bea-822f-82a090ece780 zootechnologist 0.14447950