简体   繁体   English

R tm包和Spark / python为文档术语频率任务提供了不同的词汇量

[英]R tm package and Spark/python give different vocabulary size for Document Term Frequency task

I have a csv with a single column, each row is a text document. 我有一个单列的csv,每一行都是一个文本文档。 All text has been normalized: 所有文字均已标准化:

  • all lowercase 全部小写
  • no punctuation 没有标点符号
  • no numbers 没有数字
  • no more than one whitespace between words 单词之间不超过一个空格
  • no tags(xml, html) 没有标签(xml,html)

I have also this R script which constructs the Document Term Matrix on these documents and does some machine learning analysis. 我也有这个R脚本,可以在这些文档上构造文档术语矩阵,并进行一些机器学习分析。 I need to convert this in Spark. 我需要在Spark中进行转换。

The first step is to produce the Document Term Matrix where for each term there is the relative frequency count in the document. 第一步是生成文档术语矩阵,其中每个术语在文档中都有相对频率计数。 The problem is that I am getting different vocabularies size using R, respect to spark api or python sklearn (spark and python are consistent in the result). 问题是,相对于spark api或python sklearn,我使用R获得了不同的词汇量(火花和python的结果一致)。

This is the relevant code for R: 这是R的相关代码:

library(RJDBC)
library(Matrix)
library(tm)
library(wordcloud)
library(devtools)
library(lsa)
library(data.table)
library(dplyr)
library(lubridate)

corpus <- read.csv(paste(inputDir, "corpus.csv", sep="/"), stringsAsFactors=FALSE)
DescriptionDocuments<-c(corpus$doc_clean)
DescriptionDocuments <- VCorpus(VectorSource(DescriptionDocuments))
DescriptionDocuments.DTM <- DocumentTermMatrix(DescriptionDocuments, control = list(tolower = FALSE,
                                                                                    stopwords = FALSE,
                                                                                    removeNumbers = FALSE,
                                                                                    removePunctuation = FALSE,
                                                                                    stemming=FALSE))

# VOCABULARY SIZE = 83758



This is the relevant code in Spark (1.6.0, Scala 2.10): 这是Spark(1.6.0,Scala 2.10)中的相关代码:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer}

var corpus = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "false").load("/path/to/corpus.csv")

// RegexTokenizer splits by default on one or more spaces, which is ok
val rTokenizer = new RegexTokenizer().setInputCol("doc").setOutputCol("words")
val words = rTokenizer.transform(corpus)

val cv = new CountVectorizer().setInputCol("words").setOutputCol("tf")
val cv_model = cv.fit(words)
var dtf = cv_model.transform(words)

// VOCABULARY SIZE = 84290



I've also checked in python sklearn and I got consistent result with spark: 我也检查了python sklearn,得到了与spark一致的结果:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

corpus = pd.read_csv("/path/to/corpus.csv")
docs = corpus.loc[:, "doc"].values

def tokenizer(text):
    return text.split

cv = CountTokenizer(tokenizer=tokenizer, stop_words=None)
dtf = cv.fit_transform(docs)
print len(dtf.vocabulary_)

# VOCABULARY SIZE = 84290



I don't know very much R tm package but it seems to me that by default should tokenize on white spaces by default. 我不太了解R tm包,但是在我看来,默认情况下默认情况下应该对空白进行标记。 Someone has any hint why am I getting different vocabulary size? 有人暗示为什么我的词汇量会有所不同?

The reason for the difference is a default option within the creation of a document term matrix. 产生差异的原因是创建文档术语矩阵时的默认选项。 If you check ?termFreq you can find the option wordLengths: 如果选中?termFreq ,则可以找到选项wordLengths:

An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. 长度为2的整数向量。短于最小单词长度wordLengths [1]或长于最大单词长度wordLengths [2]的单词将被丢弃。 Defaults to c(3, Inf), ie, a minimum word length of 3 characters. 默认为c(3,Inf),即最小单词长度为3个字符。

The default setting of c(3, Inf) removes all words shorter than 3, like "at", "in", "I", etc etc. c(3,Inf)的默认设置将删除所有少于3个单词,例如“ at”,“ in”,“ I”等。

This default is what is causing the difference between tm and spark / python 此默认值是造成tm和spark / python之间差异的原因

See the difference in the wordLengths setting in the example below. 请参见以下示例中wordLengths设置的差异。

library(tm)

data("crude")

dtm <- DocumentTermMatrix(crude)
nTerms(dtm)
[1] 1266

dtm2 <- DocumentTermMatrix(crude, control = list(wordLengths = c(1, Inf)))
nTerms(dtm2)
[1] 1305

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM