如何遍歷 pyspark 中未知數據幀的列的行

Question

我是數據科學的新手，我正在使用 Google Colab 進行一個簡單的自我項目。 我從something.csv文件中獲取數據，並且文件的列使用####加密，所以我不知道列的名稱。 我拿了

這是我嘗試使用 pyspark 解決它

df = spark.read.csv('something.csv', header=True)
col = df[df.columns[len(df.columns)-1]] #Taking last column of data-frame

現在我想遍歷“col”列的行並打印數字小於 100 的行。我搜索了其他 stackoverflow 帖子，但不明白如何遍歷沒有名稱的列。

Answer 1

在 pyspark 中，對 dataframe 使用.filter方法來過濾小於 100 的記錄。

#sample data po column is int
df.show()
#+---+----+---+
#| id|name| po|
#+---+----+---+
#|  1|   2|300|
#|  2|   1| 50|
#+---+----+---+

last_col = df[df.columns[len(df.columns)-1]]

from pyspark.sql.functions import *

df.filter(last_col < 100).show()
#+---+----+---+
#| id|name| po|
#+---+----+---+
#|  2|   1| 50|
#+---+----+---+

UPDATE:

#getting rows into list
lst=df.filter(last_col < 100).select(last_col).rdd.flatMap(lambda x:x)
lst.collect()
#[50]

to get all rows into list
lst=df.filter(last_col < 100).rdd.flatMap(lambda x:x)
lst.collect()
#[u'2', u'1', 50]

如何遍歷 pyspark 中未知數據幀的列的行

問題描述

1 個解決方案

解決方案1
1 2020-05-01 21:11:50

如何遍歷 pyspark 中未知數據幀的列的行

問題描述

1 個解決方案

解決方案1 1 2020-05-01 21:11:50

解決方案1
1 2020-05-01 21:11:50