Measure weighted amount of information concerning the specificity of terms in a corpus. Term frequency–Inverse document frequency is one of the most frequently applied weighting schemes in information retrieval systems. The tf–idf is a statistical measure proportional to the number of times a word appears in the document, but is offset by the number of documents in the corpus that contain the word. Variations of the tf–idf are often used to estimate a document's relevance given a free-text query.
tf_idf( corpus, stopwords = NULL, id_col = "id", text_col = "text", tf_weight = "double_norm", idf_weight = "idf_smooth", min_chars = 2, norm = TRUE )
corpus | Input data, with an id column and a text column. Can be of type data.frame or data.table. |
---|---|
stopwords | A character vector of stopwords. Stopwords are filtered out before calculating numerical statistics. |
id_col | Input data column name with the ids of the documents. |
text_col | Input data column name with the documents. |
tf_weight | Weighting scheme of term frequency. Choices are |
idf_weight | Weighting scheme of inverse document frequency. Choices are |
min_chars | Words with less characters than |
norm | Boolean value for document normalization. |
A data.table with three columns, namely class
derived from given document ids, term
and tfIdf
.
library(data.table) corpus <- copy(occupations_bundle) invisible(corpus[, text := paste(preferredLabel, altLabels)]) invisible(corpus[, text := cleansing_corpus(text)]) corpus <- corpus[ , .(conceptUri, text)] setnames(corpus, c("id", "text")) tf_idf(corpus)#> class term tfIdf #> 1: 1d8f8111-79dd-41dc-aa2a-12f3192dde3c 2d 0.06854643 #> 2: 364d9e9f-b14c-4905-b9e5-5de623dba268 2d 0.07274106 #> 3: 48c8eccb-4bd0-487e-a187-4953cbbe956e 2d 0.08415213 #> 4: 951b9d94-9bf1-4771-877d-3a26b75d9e53 2d 0.08537603 #> 5: eedbf8b1-9432-4f42-ab7f-db5cd3eb2a05 360 0.10033651 #> --- #> 34672: ff4c28e2-66b1-4ae7-9053-fba8ec3428be zoological 0.07271548 #> 34673: eee68e13-8248-43bb-a9fb-959832a6c217 zoologist 0.05222478 #> 34674: ff4c28e2-66b1-4ae7-9053-fba8ec3428be zoologist 0.07546777 #> 34675: ff4c28e2-66b1-4ae7-9053-fba8ec3428be zoology 0.12342853 #> 34676: e1fdca51-a0e1-4bea-822f-82a090ece780 zootechnologist 0.14447950