[英]How to run Regex in Python on a Dataframe in Apache Spark
I'm trying run a Regex in Python on a dataframe in Apache Spark.我正在尝试在 Apache Spark 中的 dataframe 上运行 Python 中的正则表达式。
The df is df是
The regex is as follows:正则表达式如下:
import re
m = re.search("[Pp]ython", df)
print(m)
I'm getting the following error message:我收到以下错误消息:
TypeError: expected string or bytes-like object
The following will work以下将起作用
import re m = re.search("[Pp]ython", 'Python python') print(m)导入 re m = re.search("[Pp]ython", 'Python python') print(m)
But I would like the regex to work on a dataframe但我希望正则表达式适用于 dataframe
You can use regexp_extract :您可以使用regexp_extract :
from pyspark.sql import functions as F
data = [["Python"],["python"], ["Scala"], ["PYTHON"]]
schema= ["language"]
df = spark.createDataFrame(data, schema)
df = df.withColumn("extracted", F.regexp_extract("language", "[Pp]ython", 0))
Result:结果:
+--------+---------+
|language|extracted|
+--------+---------+
| Python| Python|
| python| python|
| Scala| |
| PYTHON| |
+--------+---------+
The definition for re.search is re.search的定义是
re.search(pattern, string, flags=0)
re.search(模式,字符串,标志=0)
The second parameter being a string, this function cannot work with Spark dataframes.第二个参数是一个字符串,这个 function 不能用于 Spark 数据帧。 However (at least most) patterns that work with
re.search
will also work for regexp_extract
.但是(至少大多数)适用于
re.search
的模式也适用于regexp_extract
。 So testing the patterns with re.search
first might be a way.因此,首先使用
re.search
测试模式可能是一种方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.