使用词典在PySpark中进行情感分析

Question

在一开始我想说我是编程新手。 我花了很多时间来改造我的数据集，但后来我卡住了。 目标是在PySpark中对2011-2019的时间段进行情感分析。

我想要做的是检查Body的语句是否存在负面或正面情绪。 该数据存储在一个数据帧中。 为了得到正确的情感分析，我将使用Loughran-McDonald情感词汇表 - 因为Body的文本将包含一些（或许多）财务术语。 具有单词和指定情绪的字典存储在第二数据帧中。 每个数据框（一个具有列：'Body'，第二个具有LM字典）包含数千行（每行约80个）。

为了进行情感分析，我必须使用第二个数据框中的单词按列Body迭代第一个数据框中的每一行 - >查看存储在“Body”列中的句子中是否存在特定单词。 考虑到一个句子中可能既有否定词和正面词，我们假设一个“否定”词等于-1，句子中的一个正词等于+1。 最终结果（ n(-1)/(+1)p字的和）将存储在第一个数据帧的新列中。

例如 - 如果Body中的特定行包含单词abandon ，其被标记为negative （在第二个df中，数字不等于0（在本例中为2009）意味着该单词被分配给特定的情绪列 - 在这种情况下：否定）新列中的结果应为-1。 希望我以一种可以理解的方式描述我的问题。

尽管花了几天时间寻找解决方案，我还没有找到符合我问题的答案:(我将不胜感激任何提示。

当前的第一个数据框：

+---+--------------------+--------------------+----+-----+--------+---------+--------+
| Id|        CreationDate|                Body|Year|Month|Day_of_Y|Week_of_Y|Year_adj|
+---+--------------------+--------------------+----+-----+--------+---------+--------+
|  1|2011-08-30 21:12:...|What open source ...|2011|    8|     242|       35|    2011|
|  2|2011-08-30 21:14:...|GPU mining is the...|2011|    8|     242|       35|    2011|
|  8|2011-08-30 21:18:...|I would like to d...|2011|    8|     242|       35|    2011|
|  9|2011-08-30 21:18:...|I didn't get it. ...|2011|    8|     242|       35|    2011|
| 10|2011-08-30 21:19:...|Poclbm: An open s...|2011|    8|     242|       35|    2011|
+---+--------------------+--------------------+----+-----+--------+---------+--------+

第二个数据框（Loughran-McDonald字典）：

+---------+--------+--------+-----------+---------+------------+-----------+-----------+-----+
|     Word|Negative|Positive|Uncertainty|Litigious|Constraining|Superfluous|Interesting|Modal|
+---------+--------+--------+-----------+---------+------------+-----------+-----------+-----+
| aardvark|       0|       0|          0|        0|           0|          0|          0|    0|
| abalones|       0|       0|          0|        0|           0|          0|          0|    0|
|  abandon|    2009|       0|          0|        0|           0|          0|          0|    0|
+---------+--------+--------+-----------+---------+------------+-----------+-----------+-----+

Answer 1

一种方法（不确定它是否是最高性能的）是从您的情感字典创建一个实际的python字典并将其应用于用户定义的函数（UDF）。 鉴于你的情绪词典有大约80k行，这应该是可行的。 通过首先删除中性词，您可以进一步加快速度。
代码大纲如下：

from pyspark.sql import functions as f
# filter neutral words
filtered_sentiment_df = sentiment_df.filter((f.col("negative") > 0) | (f.col("positive") > 0))
# the following assumes that there are no words both positive and negative
sentiments = filtered_sentiment_df.select(f.col("word"), f.when(f.col("negative") > 0, -1).otherwise(1).alias("sentiment"))

# now we got the dict and can apply it via a UDF
sentiment_dict = {row["word"]: row["sentiment"] for row in sentiments.collect()}

def calculate_sentiment_score(sentence, sentiment_dict):
    return sum([sentiment_dict.get(w, 0) for w in sentence.split(" ")])

sentiment_udf = f.udf(lambda x: calculate_sentiment_score(x, sentiment_dict))
bodies_df = bodies_df.withColumn("total_sentiment", sentiment_udf(f.col("body")))
bodies_df.show()

使用词典在PySpark中进行情感分析

问题描述

1 个解决方案

解决方案1
0 2019-08-11 20:52:00

使用词典在PySpark中进行情感分析

问题描述

1 个解决方案

解决方案1 0 2019-08-11 20:52:00

解决方案1
0 2019-08-11 20:52:00