简体   繁体   English

如何遍历 pyspark 中未知数据帧的列的行

[英]How to iterate through rows of a column of a unknown data-frame in pyspark

I am new to Data Science and I am working on a simple self project using Google Colab.我是数据科学的新手,我正在使用 Google Colab 进行一个简单的自我项目。 I took a data from a something.csv file and the file's columns are encrypted with #### , so I don't know the names of the columns.我从something.csv文件中获取数据,并且文件的列使用####加密,所以我不知道列的名称。 I took我拿了

Here is my attempt to solve it using pyspark这是我尝试使用 pyspark 解决它

df = spark.read.csv('something.csv', header=True)
col = df[df.columns[len(df.columns)-1]] #Taking last column of data-frame

Now I want to iterate through rows of 'col' column and print the rows that has a number less than 100. I searched other stackoverflow posts but didn't understand how to iterate through the column with no name.现在我想遍历“col”列的行并打印数字小于 100 的行。我搜索了其他 stackoverflow 帖子,但不明白如何遍历没有名称的列。

In pyspark use .filter method on dataframe to filter records < 100.在 pyspark 中,对 dataframe 使用.filter方法来过滤小于 100 的记录。

#sample data po column is int
df.show()
#+---+----+---+
#| id|name| po|
#+---+----+---+
#|  1|   2|300|
#|  2|   1| 50|
#+---+----+---+

last_col = df[df.columns[len(df.columns)-1]]

from pyspark.sql.functions import *

df.filter(last_col < 100).show()
#+---+----+---+
#| id|name| po|
#+---+----+---+
#|  2|   1| 50|
#+---+----+---+

UPDATE:

#getting rows into list
lst=df.filter(last_col < 100).select(last_col).rdd.flatMap(lambda x:x)
lst.collect()
#[50]

to get all rows into list
lst=df.filter(last_col < 100).rdd.flatMap(lambda x:x)
lst.collect()
#[u'2', u'1', 50]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将 pyspark 数据帧中的时间戳列值减少 1 毫秒 - how to reduce timestamp column value in pyspark data-frame by 1 ms Pyspark:如何遍历数据框列? - Pyspark: How to iterate through data frame columns? 如何遍历数据框单列中的行? - how to iterate through rows within single column of data frame? 遍历数据框中的字典并使用条件进行评估 - Iterate through a dictionary in data-frame and evaluate using condition 如何将变量值分配为 pyspark 数据框中的新列值? - How to assign variable value as new column value in pyspark data-frame? 如何将一个数据帧的连接值插入Pyspark中的另一个数据帧? - How to insert concatenated values from a data-frame into another data-frame in Pyspark? 复制数据框单元格并将其追加到所有行中数据框末尾的新列 - Copy data-frame cell and append to new column on end of data-frame in all rows 如何用 lambda function 为 pyspark 数据帧编写 reduce? - How to write reduce with lambda function for pyspark data-frame? 如何通过 pyspark 中的列向另一个数据帧中的数据帧添加行 - how to add rows to a data frame that are in another data frame by a column in pyspark Pandas Data-frame合并列上的行以形成字典列表 - Pandas Data-frame Merge rows on a column to form list of dictionary
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM