根據字符串列表過濾 pyspark dataframe

Question

我對 Pyspark 很陌生。 我希望我能在這里得到答案。 我需要使用 DataFrame API 的答案

我的問題是查找文本文件 test.txt 中包含單詞“testA”或“testB”或“testC”的行數

lines=spark.read.text("C:\test.txt")
listStr=["testA","testB","testC"]

lines.filter(lines.isin(listStr)).count()  --> this is showing all the lines in the textfile

PS：如果可以使用“lambda”解決更好

Answer 1

要查找包含列表中某個字符串的所有行，您可以使用方法rlike 。 例如：

+----------+
|     value|
+----------+
|      text|
|text testA|
|text testB|
|text testC|
|      text|
+----------+

listStr=["testA","testB","testC"]
lines.filter(F.col('value').rlike('|'.join(listStr))).show()

Output：

+----------+
|     value|
+----------+
|text testA|
|text testB|
|text testC|
+----------+

您的解決方案不起作用，因為方法isin測試單元格值是否等於列表中的值之一。 您只能將此方法用於列對象（在 PySpark 3 中），否則您會得到AttributeError 。 它將適用於以下數據框：

+-----+
|value|
+-----+
| text|
|testA|
|testB|
|testC|
| text|
+-----+

listStr=["testA","testB","testC"]
lines.filter(F.col('value').isin(*listStr)).show()

Output：

+-----+
|value|
+-----+
|testA|
|testB|
|testC|
+-----+

Answer 2

你也可以like使用：

from functools import reduce

df.filter(
    reduce(lambda a, b: a | b, [F.col("value").like(f"%{word}%") for word in listStr])
).count()

Answer 3

如果要使用 lambda function，可以使用 RDD：

lines.rdd.filter(lambda r: any(s in r[0] for s in listStr)).count()

根據字符串列表過濾 pyspark dataframe

問題描述

3 個解決方案

解決方案1
1 2021-02-05 07:38:30

解決方案2
1 2021-02-05 08:34:50

解決方案3
1 已采納 2021-02-05 08:56:05

根據字符串列表過濾 pyspark dataframe

問題描述

3 個解決方案

解決方案1 1 2021-02-05 07:38:30

解決方案2 1 2021-02-05 08:34:50

解決方案3 1 已采納 2021-02-05 08:56:05

解決方案1
1 2021-02-05 07:38:30

解決方案2
1 2021-02-05 08:34:50

解決方案3
1 已采納 2021-02-05 08:56:05