简体   繁体   English

计算按列分组的TF-IDF

[英]Calculate TF-IDF grouped by column

How can i calculate tf-idf grouped by column not on the whole dataframe? 如何计算按列分组的tf-idf,而不是整个数据帧?

Suppose in dataframe like below 假设如下所示的数据框

private val sample = Seq(
    (1, "A B C D E"),
    (1, "B C D"),
    (1, "B C D E"),
    (2, "B C D F"),
    (2, "A B C"),
    (2, "B C E F G")
  ).toDF("id","sentences")

In the above sample, IDF should be calculated for sentences with id = 1 by considering first three elements. 在上面的示例中,应通过考虑前三个元素为id = 1的句子计算IDF。 Same way IDF should be calculated for sentences with Id=2 by considering last three elements. 通过考虑后三个元素,以相同的方式为ID = 2的句子计算IDF。 Is it possible in Spark ml's tf-idf implementation. Spark ml的tf-idf实现是否可能。

Just a lame attempt: you could filter your sequence by id and and convert each filter to dataframe and save them inside a list, then use a loop to apply your tf-idf to each dataframe in your list. 只是一个la脚的尝试:您可以按ID过滤序列,然后将每个过滤器转换为数据框并将其保存在列表中,然后使用循环将tf-idf应用于列表中的每个数据框。

var filters=List[org.apache.spark.sql.DataFrame]()
val mySeq=Seq((1, "A B C D E"),(1, "B C D"),(1, "B C D E"),(2, "B C D F"),(2, "A B C"),(2, "B C E F G")) 
for(i<-List(1,2)){filters=filters:+s.filter{case x=>x._1==i}.toDF("id","sentences")}   

So for example you have 例如,你有

scala> filters(0).show()
+---+---------+
| id|sentences|
+---+---------+
|  1|A B C D E|
|  1|    B C D|
|  1|  B C D E|
+---+---------+

scala> filters(1).show()
+---+---------+
| id|sentences|
+---+---------+
|  2|  B C D F|
|  2|    A B C|
|  2|B C E F G|
+---+---------+

and you can do your TF-IDF calculation on each dataframe by using a loop or a map . 并且您可以使用循环或map在每个数据帧上进行TF-IDF计算。

You could also use some sort of groupBy but this operation requires shuffles which could decrease your performance in a cluster 您还可以使用某种groupBy但是此操作需要改组,这可能会降低群集的性能

You can group the dataframe by id and flatten the corresponding tokenized words prior to the TF-IDF computation. 您可以按id对数据帧进行分组,并在TF-IDF计算之前展平相应的标记词。 Below is a snippet using the sample code from the Spark TF-IDF doc: 以下是使用Spark TF-IDF文档的示例代码的代码段:

val sample = Seq(
  (1, "A B C D E"),
  (1, "B C D"),
  (1, "B C D E"),
  (2, "B C D F"),
  (2, "A B C"),
  (2, "B C E F G")
).toDF("id","sentences")

import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val tokenizer = new Tokenizer().setInputCol("sentences").setOutputCol("words")
val wordsDF = tokenizer.transform(sample)

def flattenWords = udf( (s: Seq[Seq[String]]) => s.flatMap(identity) )

val groupedDF = wordsDF.groupBy("id").
  agg(flattenWords(collect_list("words")).as("grouped_words"))

val hashingTF = new HashingTF().
  setInputCol("grouped_words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(groupedDF)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)

rescaledData.show
// +---+--------------------+--------------------+--------------------+
// | id|       grouped_words|         rawFeatures|            features|
// +---+--------------------+--------------------+--------------------+
// |  1|[a, b, c, d, e, b...|(20,[1,2,10,14,18...|(20,[1,2,10,14,18...|
// |  2|[b, c, d, f, a, b...|(20,[1,2,8,10,14,...|(20,[1,2,8,10,14,...|
// +---+--------------------+--------------------+--------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM