使用正则表达式仅捕获 Pyspark 中特定字母后跟的数字

Question

I have a dataframe that look like this:我有一个看起来像这样的 dataframe：

from pyspark.sql.functions import *
from pyspark.sql.types import *
    data = [('1',"12345 soda bottle 1500ml"),\
      ('2',"6789 beer can 450ml"),\
       ("3","beer with no number before 375ml")\
      ]
columnname = ['id','product']
df = spark.createDataFrame(data=data, schema = columnname)

+---+--------------------------------+
|id |product                         |
+---+--------------------------------+
|1  |12345 soda bottle 1500ml        |
|2  |6789 beer can 450ml             |
|3  |beer with no number before 375ml|
+---+--------------------------------+

I want the volume in another column, but some of the values has numbers that I don't need.我想要另一列中的音量，但有些值有我不需要的数字。

So far I tried this:到目前为止，我试过这个：

df = df.withColumn('volume',regexp_extract(col('product'), '([0-9]{3,5}.*ml)', 1)
              ).withColumn('volume_number',regexp_extract(col('volume'), '^[^m]+', 0))

But the result is not what I spected:但结果不是我想象的那样：

+---+--------------------------------+------------------------+----------------------+
|id |product                         |volume                  |volume_number         |
+---+--------------------------------+------------------------+----------------------+
|1  |12345 soda bottle 1500ml        |12345 soda bottle 1500ml|12345 soda bottle 1500|
|2  |6789 beer can 450ml             |6789 beer can 450ml     |6789 beer can 450     |
|3  |beer with no number before 375ml|375ml                   |375                   |
+---+--------------------------------+------------------------+----------------------+

Desired output:所需的 output：

+---+--------------------------------+------------------------+----------------------+
|id |product                         |volume                  |volume_number         |
+---+--------------------------------+------------------------+----------------------+
|1  |12345 soda bottle 1500ml        |1500ml                  |1500                  |
|2  |6789 beer can 450ml             |450ml                   |450                   |
|3  |beer with no number before 375ml|375ml                   |375                   |
+---+--------------------------------+------------------------+----------------------+

Answer 1

You're almost there.您快到了。 Adding a .* before your pattern should do what you're looking for.在您的模式之前添加一个.*应该可以满足您的需求。

df = df.withColumn('volume',regexp_extract(col('product'), '.*([0-9]{3,5}.*ml)', 1))
df = df.withColumn('volume_number',regexp_extract(col('volume'), '^[^m]+', 0))
df.show()

+---+--------------------+------+-------------+
| id|             product|volume|volume_number|
+---+--------------------+------+-------------+
|  1|12345 soda bottle...| 500ml|          500|
|  2| 6789 beer can 450ml| 450ml|          450|
|  3|beer with no numb...| 375ml|          375|
+---+--------------------+------+-------------+

Explanation:解释：

.* captures everything until the first capture group .*捕获所有内容，直到第一个捕获组
([0-9]{3,5}.*ml) defines the first capture, that's formed of numbers between 3 and 5 digits, any amount of characters followed by the ml string. ([0-9]{3,5}.*ml)定义第一个捕获，它由 3 到 5 位数字之间的数字组成，任意数量的字符后跟ml字符串。

Answer 2

df = df.withColumn('volume',regexp_extract(col('product'), '[0-9]+ml', 0))
df = df.withColumn('volume_number',regexp_extract(col('volume'), '^[0-9]+', 0))
df.show()


+---+--------------------+------+-------------+
| id|             product|volume|volume_number|
+---+--------------------+------+-------------+
|  1|12345 soda bottle...|1500ml|         1500|
|  2| 6789 beer can 450ml| 450ml|          450|
|  3|beer with no numb...| 375ml|          375|
+---+--------------------+------+-------------+

[0-9]+ means that at least one number will be captured. [0-9]+表示将捕获至少一个数字。 (the + symbol means one or more). （ +符号表示一个或多个）。 Followed by the literal 'ml'.后跟文字“ml”。 If you know the column will always end with the volume you can use [0-9]+ml$ where the $ matches the end of the line如果您知道该列将始终以卷结尾，您可以使用[0-9]+ml$其中$匹配行尾

使用正则表达式仅捕获 Pyspark 中特定字母后跟的数字

问题描述

2 个解决方案

解决方案1
0 2022-09-06 20:53:18

解决方案2
0 2022-09-11 19:15:43

使用正则表达式仅捕获 Pyspark 中特定字母后跟的数字

问题描述

2 个解决方案

解决方案1 0 2022-09-06 20:53:18

解决方案2 0 2022-09-11 19:15:43

解决方案1
0 2022-09-06 20:53:18

解决方案2
0 2022-09-11 19:15:43