[英]Capture only the number that is followed by a specific letter in Pyspark with regex
I have a dataframe that look like this:我有一个看起来像这样的 dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [('1',"12345 soda bottle 1500ml"),\
('2',"6789 beer can 450ml"),\
("3","beer with no number before 375ml")\
]
columnname = ['id','product']
df = spark.createDataFrame(data=data, schema = columnname)
+---+--------------------------------+
|id |product |
+---+--------------------------------+
|1 |12345 soda bottle 1500ml |
|2 |6789 beer can 450ml |
|3 |beer with no number before 375ml|
+---+--------------------------------+
I want the volume in another column, but some of the values has numbers that I don't need.我想要另一列中的音量,但有些值有我不需要的数字。
So far I tried this:到目前为止,我试过这个:
df = df.withColumn('volume',regexp_extract(col('product'), '([0-9]{3,5}.*ml)', 1)
).withColumn('volume_number',regexp_extract(col('volume'), '^[^m]+', 0))
But the result is not what I spected:但结果不是我想象的那样:
+---+--------------------------------+------------------------+----------------------+
|id |product |volume |volume_number |
+---+--------------------------------+------------------------+----------------------+
|1 |12345 soda bottle 1500ml |12345 soda bottle 1500ml|12345 soda bottle 1500|
|2 |6789 beer can 450ml |6789 beer can 450ml |6789 beer can 450 |
|3 |beer with no number before 375ml|375ml |375 |
+---+--------------------------------+------------------------+----------------------+
Desired output:所需的 output:
+---+--------------------------------+------------------------+----------------------+
|id |product |volume |volume_number |
+---+--------------------------------+------------------------+----------------------+
|1 |12345 soda bottle 1500ml |1500ml |1500 |
|2 |6789 beer can 450ml |450ml |450 |
|3 |beer with no number before 375ml|375ml |375 |
+---+--------------------------------+------------------------+----------------------+
You're almost there.您快到了。 Adding a
.*
before your pattern should do what you're looking for.在您的模式之前添加一个
.*
应该可以满足您的需求。
df = df.withColumn('volume',regexp_extract(col('product'), '.*([0-9]{3,5}.*ml)', 1))
df = df.withColumn('volume_number',regexp_extract(col('volume'), '^[^m]+', 0))
df.show()
+---+--------------------+------+-------------+
| id| product|volume|volume_number|
+---+--------------------+------+-------------+
| 1|12345 soda bottle...| 500ml| 500|
| 2| 6789 beer can 450ml| 450ml| 450|
| 3|beer with no numb...| 375ml| 375|
+---+--------------------+------+-------------+
Explanation:解释:
.*
captures everything until the first capture group .*
捕获所有内容,直到第一个捕获组([0-9]{3,5}.*ml)
defines the first capture, that's formed of numbers between 3 and 5 digits, any amount of characters followed by the ml
string. ([0-9]{3,5}.*ml)
定义第一个捕获,它由 3 到 5 位数字之间的数字组成,任意数量的字符后跟ml
字符串。df = df.withColumn('volume',regexp_extract(col('product'), '[0-9]+ml', 0))
df = df.withColumn('volume_number',regexp_extract(col('volume'), '^[0-9]+', 0))
df.show()
+---+--------------------+------+-------------+
| id| product|volume|volume_number|
+---+--------------------+------+-------------+
| 1|12345 soda bottle...|1500ml| 1500|
| 2| 6789 beer can 450ml| 450ml| 450|
| 3|beer with no numb...| 375ml| 375|
+---+--------------------+------+-------------+
[0-9]+
means that at least one number will be captured. [0-9]+
表示将捕获至少一个数字。 (the +
symbol means one or more). (
+
符号表示一个或多个)。 Followed by the literal 'ml'.后跟文字“ml”。 If you know the column will always end with the volume you can use
[0-9]+ml$
where the $
matches the end of the line如果您知道该列将始终以卷结尾,您可以使用
[0-9]+ml$
其中$
匹配行尾
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.