简体   繁体   English

如何使用Spark为文本分类创建TF-IDF?

[英]How can I create a TF-IDF for Text Classification using Spark?

I have a CSV file with the following format : 我有一个CSV文件,格式如下:

product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]

The product_idX is a integer and the product_titleX is a String, example : product_idX是一个整数,product_titleX是一个String,例如:

453478692, Apple iPhone 4 8Go

I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib. 我正在尝试从我的文件创建TF-IDF,所以我可以将它用于MLlib中的朴素贝叶斯分类器。

I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4 . 到目前为止,我正在使用Spark for Scala并使用我在官方页面和Berkley AmpCamp 34上找到的教程。

So I'm reading the file : 所以我正在读文件:

val file = sc.textFile("offers.csv")

Then I'm mapping it in tuples RDD[Array[String]] 然后我将它映射到元组RDD[Array[String]]

val tuples = file.map(line => line.split(",")).cache

and after I'm transforming the tuples into pairs RDD[(Int, String)] 在我将元组转换成对RDD[(Int, String)]

val pairs = tuples.(line => (line(0),line(1)))

But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF. 但我被困在这里,我不知道如何从它创建Vector将其变成TFIDF。

Thanks 谢谢

To do this myself (using pyspark), I first started by creating two data structures out of the corpus. 为了自己这样做(使用pyspark),我首先从语料库中创建两个数据结构。 The first is a key, value structure of 第一个是关键的价值结构

document_id, [token_ids]

The second is an inverted index like 第二个是反向索引

token_id, [document_ids]

I'll call those corpus and inv_index respectively. 我将分别称为语料库和inv_index。

To get tf we need to count the number of occurrences of each token in each document. 为了得到这个,我们需要计算每个文档中每个标记的出现次数。 So 所以

from collections import Counter
def wc_per_row(row):
    cnt = Counter()
    for word in row:
        cnt[word] += 1
    return cnt.items() 

tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))

The df is simply the length of each term's inverted index. df只是每个术语倒排索引的长度。 From that we can calculate the idf. 由此我们可以计算出idf。

df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()

# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df. 
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()

Now we just have to do a join on the term_id: 现在我们只需要在term_id上进行连接:

def calc_tfidf(tf_tuples, idf_tuples):
    return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
        (k2, v2) in idf_tuples if k1 == k2]

tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))

This isn't a particularly performant solution, though. 不过,这不是一个特别高效的解决方案。 Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do. 调用collect将idf带入驱动程序,以便它可用于连接似乎是错误的事情。

And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id. 当然,它需要首先标记并创建从词汇表中的每个uniq标记到某个token_id的映射。

If anyone can improve on this, I'm very interested. 如果有人能改进这一点,我很感兴趣。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM