简体   繁体   English

如何遍历pyspark中的每一行dataFrame

[英]How to loop through each row of dataFrame in pyspark

Eg例如

sqlContext = SQLContext(sc)

sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()

The above statement prints theentire table on terminal.上面的语句在终端上打印整个表。 But I want to access each row in that table using for or while to perform further calculations.但我想使用forwhile访问该表中的每一行以执行进一步的计算。

You simply cannot.你根本不能。 DataFrames , same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods. DataFrames与其他分布式数据结构一样,不可迭代,只能使用专用的高阶函数和/或 SQL 方法访问。

You can of course collect你当然可以collect

for row in df.rdd.collect():
    do_something(row)

or convert toLocalIterator或转换为toLocalIterator

for row in df.rdd.toLocalIterator():
    do_something(row)

and iterate locally as shown above, but it beats all purpose of using Spark.并在本地进行迭代,如上所示,但它超越了使用 Spark 的所有目的。

To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.要“循环”并利用 Spark 的并行计算框架,您可以定义自定义函数并使用 map。

def customFunction(row):

   return (row.name, row.age, row.city)

sample2 = sample.rdd.map(customFunction)

or

sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))

The custom function would then be applied to every row of the dataframe.然后将自定义函数应用于数据帧的每一行。 Note that sample2 will be a RDD , not a dataframe.请注意, sample2 将是RDD ,而不是数据帧。

Map may be needed if you are going to perform more complex computations.如果您要执行更复杂的计算,则可能需要 Map。 If you just need to add a simple derived column, you can use the withColumn , with returns a dataframe.如果您只需要添加一个简单的派生列,您可以使用withColumn ,并返回一个数据withColumn

sample3 = sample.withColumn('age2', sample.age + 2)

Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:在 Python 中使用列表推导式,您只需使用两行就可以将一整列值收集到一个列表中:

df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]

In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().在上面的示例中,我们返回数据库“default”中的表列表,但可以通过替换 sql() 中使用的查询来调整相同的表。

Or more abbreviated:或者更缩写:

tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]

And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.对于您的三列示例,我们可以创建一个字典列表,然后在 for 循环中遍历它们。

sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]} 
             for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
    print("{} is a {} year old from {}".format(
        row["name"],
        row["age"],
        row["city"]))

Give A Try Like this像这样试一试

    result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]); 
    for f in result.collect(): 
        print (f.col_name)

If you want to do something to each row in a DataFrame object, use map .如果您想对 DataFrame 对象中的每一行执行某些操作,请使用map This will allow you to perform further calculations on each row.这将允许您对每一行执行进一步的计算。 It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1 .这相当于从0len(dataset)-1遍历整个数据len(dataset)-1

Note that this will return a PipelinedRDD, not a DataFrame.请注意,这将返回一个 PipelinedRDD,而不是一个 DataFrame。

It might not be the best practice, but you can simply target a specific column using collect() , export it as a list of Rows, and loop through the list.这可能不是最佳实践,但您可以简单地使用collect()定位特定列,将其导出为行列表,然后遍历列表。

Assume this is your df:假设这是您的 df:

+----------+----------+-------------------+-----------+-----------+------------------+ 
|      Date|  New_Date|      New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+ 
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148            | 
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252           |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548         |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148         |
+----------+----------+-------------------+-----------+-----------+------------------+

to loop through rows in Date column:循环遍历日期列中的行:

rows = df3.select('Date').collect()

final_list = []
for i in rows:
    final_list.append(i[0])

print(final_list)

above以上

tupleList = [{name:x["name"], age:x["age"], city:x["city"]} 

should be应该是

tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]} 

for name , age , and city are not variables but simply keys of the dictionary.因为nameagecity不是变量,而只是字典的键。

我不确定在撰写本文时这是否不可能,但是有多种方法可以通过 spark DataFrame 进行迭代,请参阅此处的所有文档: https ://sparkbyexamples.com/pyspark/pyspark-loop-iterate -通过数据帧中的行/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PySpark-遍历数据帧的每一行并运行配置单元查询 - PySpark - loop through each row of dataframe and run a hive query 如何循环遍历dataFrame的每一行,并根据条件删除该行 - How to loop through each row of dataFrame, and remove the row based on a condition PySpark:如何处理DataFrame的每一行 - PySpark:How to process each row of DataFrame 如何在PySpark中为一个组遍历Dataframe / RDD的每一行? - How to iterate over each row of an Dataframe / RDD in PySpark for a group.? 如何将函数应用于PySpark DataFrame的指定列的每一行 - How to apply function to each row of specified column of PySpark DataFrame 循环遍历分组火花 dataframe 中的每一行并解析为函数 - Loop through each row in a grouped spark dataframe and parse to functions 如何遍历Spark数据帧的所有行并将函数应用于每行? - How can I loop through all the rows of a Spark dataframe and apply a function to each row? 将值列表添加到 PySpark 数据框中的每一行 - Add a list of values to each row in a PySpark dataframe PySpark - 如何遍历数据帧并匹配另一个数据帧中的另一个常见值 - PySpark - How to loop through the dataframe and match against another common value in another dataframe 如何从 PySpark 中一个 DataFrame 的每一行生成,然后减少一组大量的 DataFrame? - How to generate, then reduce, a massive set of DataFrames from each row of one DataFrame in PySpark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM