[英]Iterating each row of Data Frame using pySpark
I need to iterate over a dataframe
using pySpark just like we can iterate a set of values using for loop. 我需要使用pySpark遍历
dataframe
,就像我们可以使用for循环迭代一组值一样。 Below is the code I have written. 以下是我编写的代码。 The problem with this code is
这段代码的问题是
funcRowIter
funcRowIter
I have to do it in pySpark and cannot use pandas for this : 我必须在pySpark中这样做,并且不能为此使用熊猫:
from pyspark.sql.functions import *
from pyspark.sql import HiveContext
from pyspark.sql import functions
from pyspark.sql import DataFrameWriter
from pyspark.sql.readwriter import DataFrameWriter
from pyspark import SparkContext
sc = SparkContext()
hive_context = HiveContext(sc)
tab = hive_context.sql("select * from update_poc.test_table_a")
tab.registerTempTable("tab")
print type(tab)
df = tab.rdd
def funcRowIter(rows):
print type(rows)
if(rows.id == "1"):
return 1
df_1 = df.map(funcRowIter).collect()
print df_1
Seems that your goal is to display a specific row. 似乎您的目标是显示特定的行。 You could use
.filter
then a .collect
. 您可以使用
.filter
然后使用.collect
。
For instance, 例如,
row_1 = rdd.filter(lambda x: x.id==1).collect()
However, it won't be efficient to try iterate over your dataframe this way. 但是,以这种方式尝试遍历您的数据框并不是有效的方法。
Instead of using df_1 = df.map(funcRowIter).collect()
you should try UDF. 而不是使用
df_1 = df.map(funcRowIter).collect()
,应尝试使用UDF。 Hope this will help. 希望这会有所帮助。
from pyspark.sql.functions import struct
from pyspark.sql.functions import *
def funcRowIter(rows):
print type(rows)
if(row is nor None and row.id is not None)
if(rows.id == "1"):
return 1
A = udf(funcRowIter, ArrayType(StringType()))
z = df.withColumn(data_id, A(struct([df[x] for x in df.columns])))
z.show()
collect()
will never be the good options for very big data ie millions of records 对于大数据(即数百万条记录
collect()
将永远不是一个好的选择
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.