如何在 Apache Spark 中的 Dataframe 上运行 Python 中的正则表达式

Question

I'm trying run a Regex in Python on a dataframe in Apache Spark.我正在尝试在 Apache Spark 中的 dataframe 上运行 Python 中的正则表达式。

The df is df是

The regex is as follows:正则表达式如下：

import re
m = re.search("[Pp]ython", df)
print(m)

I'm getting the following error message:我收到以下错误消息：

TypeError: expected string or bytes-like object

The following will work以下将起作用

import re m = re.search("[Pp]ython", 'Python python') print(m)导入 re m = re.search("[Pp]ython", 'Python python') print(m)

But I would like the regex to work on a dataframe但我希望正则表达式适用于 dataframe

Answer 1

You can use regexp_extract :您可以使用regexp_extract ：

from pyspark.sql import functions as F

data = [["Python"],["python"], ["Scala"], ["PYTHON"]]
schema= ["language"]

df = spark.createDataFrame(data, schema)

df = df.withColumn("extracted", F.regexp_extract("language", "[Pp]ython", 0))

Result:结果：

+--------+---------+
|language|extracted|
+--------+---------+
|  Python|   Python|
|  python|   python|
|   Scala|         |
|  PYTHON|         |
+--------+---------+

The definition for re.search is re.search的定义是

re.search(pattern, string, flags=0) re.search（模式，字符串，标志=0）

The second parameter being a string, this function cannot work with Spark dataframes.第二个参数是一个字符串，这个 function 不能用于 Spark 数据帧。 However (at least most) patterns that work with re.search will also work for regexp_extract .但是（至少大多数）适用于re.search的模式也适用于regexp_extract 。 So testing the patterns with re.search first might be a way.因此，首先使用re.search测试模式可能是一种方法。

如何在 Apache Spark 中的 Dataframe 上运行 Python 中的正则表达式

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-29 15:33:22

如何在 Apache Spark 中的 Dataframe 上运行 Python 中的正则表达式

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-29 15:33:22

解决方案1
1 已采纳 2021-04-29 15:33:22