I'm trying run a Regex in Python on a dataframe in Apache Spark.
The df is
The regex is as follows:
import re
m = re.search("[Pp]ython", df)
print(m)
I'm getting the following error message:
TypeError: expected string or bytes-like object
The following will work
import re m = re.search("[Pp]ython", 'Python python') print(m)
But I would like the regex to work on a dataframe
You can use regexp_extract :
from pyspark.sql import functions as F
data = [["Python"],["python"], ["Scala"], ["PYTHON"]]
schema= ["language"]
df = spark.createDataFrame(data, schema)
df = df.withColumn("extracted", F.regexp_extract("language", "[Pp]ython", 0))
Result:
+--------+---------+
|language|extracted|
+--------+---------+
| Python| Python|
| python| python|
| Scala| |
| PYTHON| |
+--------+---------+
The definition for re.search is
re.search(pattern, string, flags=0)
The second parameter being a string, this function cannot work with Spark dataframes. However (at least most) patterns that work with re.search
will also work for regexp_extract
. So testing the patterns with re.search
first might be a way.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.