將PySpark RDD添加為pyspark.sql.dataframe的新列

Question

我有一個pyspark.sql.dataframe，其中每一行都是一篇新聞文章。 然后我有一個RDD代表每篇文章中包含的單詞。 我想將單詞的RDD添加為名為“words”的列到我的新文章的數據框中。 我試過了

df.withColumn('words', words_rdd )

但我得到了錯誤

AssertionError: col should be Column

DataFrame看起來像這樣

Articles
the cat and dog ran
we went to the park
today it will rain

但我有3k新聞文章。

我應用了一個函數來清理文本，例如刪除停用詞，我有一個如下所示的RDD：

[[cat, dog, ran],[we, went, park],[today, will, rain]]

我試圖讓我的Dataframe看起來像這樣：

Articles                 Words
the cat and dog ran      [cat, dog, ran]
we went to the park      [we, went, park]
today it will rain       [today, will, rain]

Answer 1

免責聲明 ：

Spark DataFrame一般沒有嚴格定義的順序。 使用風險由您自己承擔。

將索引添加到現有DataFrame ：

from pyspark.sql.types import *

df_index = spark.createDataFrame(
    df.rdd.zipWithIndex(),
    StructType([StructField("data", df.schema), StructField("id", LongType())])
)

將索引添加到RDD並轉換為DataFrame ：

words_df = spark.createDataFrame(
    words_rdd.zipWithIndex(),
    StructType([
        StructField("words", ArrayType(StringType())),
        StructField("id", LongType())
    ])
)

加入並選擇必填字段：

df_index.join(words_df, "id").select("data.*", "words")

警告

有不同的解決方案，可能在特定情況下有效，但不保證性能和/或正確性。 這些包括：

使用monotonically_increasing_id作為join鍵 - 通常情況下不正確。
使用row_number()窗口函數作為連接鍵 - 不可接受的性能影響，如果沒有定義特定的順序，通常不正確。
在RDDs上使用zip - 當且僅當兩個結構具有相同的數據分布時才能工作（在這種情況下應該有效）。

注意：

在這種特定情況下，您不應該需要RDD 。 pyspark.ml.feature提供各種Transformers ，應該適合你。

from pyspark.ml.feature import *
from pyspark.ml import Pipeline

df = spark.createDataFrame(
     ["the cat and dog ran", "we went to the park", "today it will rain"],
         "string"
).toDF("Articles")

Pipeline(stages=[
    RegexTokenizer(inputCol="Articles", outputCol="Tokens"), 
    StopWordsRemover(inputCol="Tokens", outputCol="Words")
]).fit(df).transform(df).show()
# +-------------------+--------------------+---------------+
# |           Articles|              Tokens|          Words|
# +-------------------+--------------------+---------------+
# |the cat and dog ran|[the, cat, and, d...|[cat, dog, ran]|
# |we went to the park|[we, went, to, th...|   [went, park]|
# | today it will rain|[today, it, will,...|  [today, rain]|
# +-------------------+--------------------+---------------+

可以使用StopWordsRemover stopWords參數提供停用詞列表，例如：

StopWordsRemover(
    inputCol="Tokens",
    outputCol="Words",
    stopWords=["the", "and", "we", "to", "it"]
)

Answer 2

為什么要將rdd加入到數據框中，我寧願直接從“Articles”創建一個新列。 有多種方法可以做到這一點，這是我的5美分：

from pyspark.sql import Row
from pyspark.sql.context import SQLContext
sqlCtx = SQLContext(sc)    # sc is the sparkcontext

x = [Row(Articles='the cat and dog ran'),Row(Articles='we went to the park'),Row(Articles='today it will rain')]
df = sqlCtx.createDataFrame(x)

df2 = df.map(lambda x:tuple([x.Articles,x.Articles.split(' ')])).toDF(['Articles','words'])
df2.show()

您將獲得以下輸出：

Articles                 words
the cat and dog ran      [the, cat, and, dog, ran]
we went to the park      [we, went, to, the, park]
today it will rain       [today, it, will, rain]

如果你想要實現其他目標，請告訴我。

Answer 3

一個簡單的方法，但有效的是使用udf 。 您可以：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = spark.createDataFrame(["the cat and dog ran", "we went to the park", "today it will rain", None], 
"string" ).toDF("Articles")

split_words = udf(lambda x : x.split(' ') if x is not None else x, StringType())
df = df.withColumn('Words', split_words(df['Articles']))

df.show(10,False)
>>
+-------------------+-------------------------+
|Articles           |Words                    |
+-------------------+-------------------------+
|the cat and dog ran|[the, cat, and, dog, ran]|
|we went to the park|[we, went, to, the, park]|
|today it will rain |[today, it, will, rain]  |
|null               |null                     |
+-------------------+-------------------------+

我添加了檢查無，因為通常在您的數據中有壞行。 您可以在拆分之后或之前使用dropna輕松放下它們。

但在我看來，如果您想將此作為文本分析的准備任務，那么建立管道可能符合您的最佳利益，因為@ user9613318在他的回答中建議

Answer 4

rdd1 = spark.sparkContext.parallelize([1, 2, 3, 5])
# make some transformation on rdd1:
rdd2 = rdd.map(lambda n: True if n % 2 else False)
# Append each row in rdd2 to those in rdd1.
rdd1.zip(rdd2).collect()

將PySpark RDD添加為pyspark.sql.dataframe的新列

問題描述

4 個解決方案

解決方案1
8 2018-05-11 13:36:34

解決方案2
3 2017-02-09 08:32:36

解決方案3
2 2018-05-17 09:49:38

解決方案4
-2 2017-08-03 07:45:32

將PySpark RDD添加為pyspark.sql.dataframe的新列

問題描述

4 個解決方案

解決方案1 8 2018-05-11 13:36:34

解決方案2 3 2017-02-09 08:32:36

解決方案3 2 2018-05-17 09:49:38

解決方案4 -2 2017-08-03 07:45:32

解決方案1
8 2018-05-11 13:36:34

解決方案2
3 2017-02-09 08:32:36

解決方案3
2 2018-05-17 09:49:38

解決方案4
-2 2017-08-03 07:45:32