使用pySpark迭代数据帧的每一行

Question

I need to iterate over a dataframe using pySpark just like we can iterate a set of values using for loop. 我需要使用pySpark遍历dataframe ，就像我们可以使用for循环迭代一组值一样。 Below is the code I have written. 以下是我编写的代码。 The problem with this code is 这段代码的问题是

I have to use collect which breaks the parallelism 我必须使用collect打破并行性
I am not able to print any values from the DataFrame in the function funcRowIter 我无法从数据框的功能打印任何值funcRowIter
I cannot break the loop once I have the match found. 找到匹配项后就无法中断循环。

I have to do it in pySpark and cannot use pandas for this : 我必须在pySpark中这样做，并且不能为此使用熊猫：

from pyspark.sql.functions import *
from pyspark.sql import HiveContext
from pyspark.sql import functions
from pyspark.sql import DataFrameWriter
from pyspark.sql.readwriter import DataFrameWriter
from pyspark import SparkContext

sc = SparkContext()
hive_context = HiveContext(sc)

tab = hive_context.sql("select * from update_poc.test_table_a")

tab.registerTempTable("tab")
print type(tab)

df = tab.rdd

def funcRowIter(rows):
    print type(rows)
        if(rows.id == "1"):
            return 1

df_1 = df.map(funcRowIter).collect()
print df_1

Answer 1

Seems that your goal is to display a specific row. 似乎您的目标是显示特定的行。 You could use .filter then a .collect . 您可以使用.filter然后使用.collect 。

For instance, 例如，

row_1 = rdd.filter(lambda x: x.id==1).collect()

However, it won't be efficient to try iterate over your dataframe this way. 但是，以这种方式尝试遍历您的数据框并不是有效的方法。

Answer 2

Instead of using df_1 = df.map(funcRowIter).collect() you should try UDF. 而不是使用df_1 = df.map(funcRowIter).collect() ，应尝试使用UDF。 Hope this will help. 希望这会有所帮助。

from pyspark.sql.functions import struct
from pyspark.sql.functions import *
def funcRowIter(rows):
    print type(rows)
    if(row is nor None and row.id is not None)
        if(rows.id == "1"):
            return 1
A = udf(funcRowIter, ArrayType(StringType()))
z = df.withColumn(data_id, A(struct([df[x] for x in df.columns])))
z.show()

collect() will never be the good options for very big data ie millions of records 对于大数据（即数百万条记录collect()将永远不是一个好的选择

使用pySpark迭代数据帧的每一行

问题描述

2 个解决方案

解决方案1
1 2017-01-30 17:40:58

解决方案2
0 2017-11-18 18:11:02

使用pySpark迭代数据帧的每一行

问题描述

2 个解决方案

解决方案1 1 2017-01-30 17:40:58

解决方案2 0 2017-11-18 18:11:02

解决方案1
1 2017-01-30 17:40:58

解决方案2
0 2017-11-18 18:11:02