简体   繁体   English

如何将 spark Streaming dataframe 列转换为 Python 列表

[英]How to convert spark Streaming dataframe column into a Python list

I have a spark streaming dataframe like below.我有一个火花流 dataframe 如下所示。 I want to convert我想转换

+----------------+-------+-----------+-----------------+
|application_name|     id|syntheticid|          Journey|
+----------------+-------+-----------+-----------------+
|            test|   24  |   12392234|      Activation |
|            test|   24  |   12392234|          LOAD   |
+----------------+-------+-----------+-----------------+
  1. How do I convert this to a normal dataframe?如何将其转换为普通的 dataframe?

  2. How to convert the streaming dataframe column to a list?如何将流式 dataframe 列转换为列表? For example I want to convert the column journey into python list ['Activation','Load'] .例如,我想将列旅程转换为 python list ['Activation','Load']

Any help would be appreciated.任何帮助,将不胜感激。

With the limited description that you provided, one option could be saving the file first as csv with pipe |根据您提供的有限描述,一个选项可能是先将文件另存为 csv和 pipe | as delimiter and then reading it back using pandas.read_csv().作为分隔符,然后使用 pandas.read_csv() 将其读回。 Suppose that you have saved file as "streaming.csv".假设您已将文件保存为“streaming.csv”。 Then you can get your python list as,然后你可以得到你的 python 列表,

    df = pd.read_csv("streaming.csv", sep="|")
    journey = df['Journey'].tolist()

Consider selecting this as "working solution" if it solves the purpose.如果它解决了目的,请考虑将其选为“工作解决方案”。

In terms of your first point, you're not asking the correct question.就您的第一点而言,您没有提出正确的问题。 Since Spark 2.0, the APIs mostly overlap, therefore a Spark Streaming DataFrame is essentially the same thing as a Spark (SQL) DataFrame, albeit Spark Streaming DataFrame is unbounded.由于 Spark 2.0,API 大多重叠,因此 Spark Streaming DataFrame 本质上与 Spark (SQL) DataFrame 相同,尽管 Spark Streaming ZBA834BA059A9A379459C1121.Z5EB8 不受限制

Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data.从 Spark 2.0 开始,DataFrames 和 Datasets 可以表示 static、有界数据,以及流式、无界数据。

Therefore, you should be able to perform majority of your necessary manipulations on your (streaming) DataFrame.因此,您应该能够对(流式传输)DataFrame 执行大部分必要的操作。

In terms of your second point, try to have a look at aggregation functions such as collect_list() and collect_set() .关于您的第二点,请尝试查看诸如collect_list()collect_set()类的聚合函数。 Try this code:试试这个代码:

from pyspark.sql import SparkSession
from pyspark.sql import functions as f

>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark._sc.parallelize([
 ["test","24","12392234","Activation"], 
 ["test","24","12392234","Load"]]
).toDF(["application_name","id","syntheticid","journey"])

>>> df.show()
+----------------+---+-----------+----------+
|application_name| id|syntheticid|   journey|
+----------------+---+-----------+----------+
|            test| 24|   12392234|Activation|
|            test| 24|   12392234|      Load|
+----------------+---+-----------+----------+


>>> grouped_df = df.groupBy('application_name').agg(f.collect_list('journey').alias('collection'))
>>> grouped_df.show()
+----------------+------------------+
|application_name|        collection|
+----------------+------------------+
|            test|[Activation, Load]|
+----------------+------------------+


>>> python_list = [item for sublist in [row.collection for row in grouped_df.collect()] for item in sublist]

>>> python_list
['Activation', 'Load']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM