[英]PySpark - Regular expression to find URLs and correct address
我有日志文件。 他们或多或少看起来像这样。 我想清理一下它们并获得正确的顺序,因为它是真正的链接。
想知道是否有人知道如何在 py(spark) 中编写正则表达式以获得所需的输出。
1:
https%3A%2F%2Fwww.btv.com%2Fnews%2Ffinland%2Fartikel%2F5174938%2Fzwemmer-zoekactie-julianadorp-kinderen-gered
Desired Output
https://www.btv.com/news/finland/artikel/5174938/zwemmer-zoekactie-julianadorp-kinderen-gered
2:
https%3A%2F%2Fwww.weather.com%2F
Desired Output
https://www.weather.com
3:
https%3A%2F%2Fwww.weather.com%2Ffinland%2Fneerslag%2Fweather%2F3uurs
Desired Output
https://www.weather.com/finland/neerslag/ weather /uurs
我已经尝试了几个解决方案,但没有太多理解。
\b\w+\b(?!\/)
from pyspark.sql.functions import regexp_extract, col
regexp_extract(column_name, regex, group_number)
regex('(.)(by)(\s+)(\w+)')
提前致谢
您可以使用urlib.parse.unqoute
并且您必须制作一个 udf 才能将它与 pyspark 一起使用。
from urllib.parse import unquote
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
df = spark.createDataFrame([['https%3A%2F%2Fwww.btv.com%2Fnews%2Ffinland%2Fartikel%2F5174938%2Fzwemmer-zoekactie-julianadorp-kinderen-gered'],
['https%3A%2F%2Fwww.weather.com%2F'],
['https%3A%2F%2Fwww.weather.com%2Ffinland%2Fneerslag%2Fweather%2F3uurs']],['url'])
urldecode_udf = udf(lambda x:unquote(x) , StringType())
df = df.withColumn("decodedurl",urldecode_udf(df.url))
df.select('decodedurl').show(3,False)
输出:
+---------------------------------------------------------------------------------------------+
|decodedurl |
+---------------------------------------------------------------------------------------------+
|https://www.btv.com/news/finland/artikel/5174938/zwemmer-zoekactie-julianadorp-kinderen-gered|
|https://www.weather.com/ |
|https://www.weather.com/finland/neerslag/weather/3uurs |
+---------------------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.