简体   繁体   English

使用正则表达式仅捕获 Pyspark 中特定字母后跟的数字

[英]Capture only the number that is followed by a specific letter in Pyspark with regex

I have a dataframe that look like this:我有一个看起来像这样的 dataframe:

from pyspark.sql.functions import *
from pyspark.sql.types import *
    data = [('1',"12345 soda bottle 1500ml"),\
      ('2',"6789 beer can 450ml"),\
       ("3","beer with no number before 375ml")\
      ]
columnname = ['id','product']
df = spark.createDataFrame(data=data, schema = columnname)
+---+--------------------------------+
|id |product                         |
+---+--------------------------------+
|1  |12345 soda bottle 1500ml        |
|2  |6789 beer can 450ml             |
|3  |beer with no number before 375ml|
+---+--------------------------------+

I want the volume in another column, but some of the values has numbers that I don't need.我想要另一列中的音量,但有些值有我不需要的数字。

So far I tried this:到目前为止,我试过这个:

df = df.withColumn('volume',regexp_extract(col('product'), '([0-9]{3,5}.*ml)', 1)
              ).withColumn('volume_number',regexp_extract(col('volume'), '^[^m]+', 0))

But the result is not what I spected:但结果不是我想象的那样:

+---+--------------------------------+------------------------+----------------------+
|id |product                         |volume                  |volume_number         |
+---+--------------------------------+------------------------+----------------------+
|1  |12345 soda bottle 1500ml        |12345 soda bottle 1500ml|12345 soda bottle 1500|
|2  |6789 beer can 450ml             |6789 beer can 450ml     |6789 beer can 450     |
|3  |beer with no number before 375ml|375ml                   |375                   |
+---+--------------------------------+------------------------+----------------------+

Desired output:所需的 output:

+---+--------------------------------+------------------------+----------------------+
|id |product                         |volume                  |volume_number         |
+---+--------------------------------+------------------------+----------------------+
|1  |12345 soda bottle 1500ml        |1500ml                  |1500                  |
|2  |6789 beer can 450ml             |450ml                   |450                   |
|3  |beer with no number before 375ml|375ml                   |375                   |
+---+--------------------------------+------------------------+----------------------+

You're almost there.您快到了。 Adding a .* before your pattern should do what you're looking for.在您的模式之前添加一个.*应该可以满足您的需求。

df = df.withColumn('volume',regexp_extract(col('product'), '.*([0-9]{3,5}.*ml)', 1))
df = df.withColumn('volume_number',regexp_extract(col('volume'), '^[^m]+', 0))
df.show()

+---+--------------------+------+-------------+
| id|             product|volume|volume_number|
+---+--------------------+------+-------------+
|  1|12345 soda bottle...| 500ml|          500|
|  2| 6789 beer can 450ml| 450ml|          450|
|  3|beer with no numb...| 375ml|          375|
+---+--------------------+------+-------------+

Explanation:解释:

  • .* captures everything until the first capture group .*捕获所有内容,直到第一个捕获组
  • ([0-9]{3,5}.*ml) defines the first capture, that's formed of numbers between 3 and 5 digits, any amount of characters followed by the ml string. ([0-9]{3,5}.*ml)定义第一个捕获,它由 3 到 5 位数字之间的数字组成,任意数量的字符后跟ml字符串。
df = df.withColumn('volume',regexp_extract(col('product'), '[0-9]+ml', 0))
df = df.withColumn('volume_number',regexp_extract(col('volume'), '^[0-9]+', 0))
df.show()


+---+--------------------+------+-------------+
| id|             product|volume|volume_number|
+---+--------------------+------+-------------+
|  1|12345 soda bottle...|1500ml|         1500|
|  2| 6789 beer can 450ml| 450ml|          450|
|  3|beer with no numb...| 375ml|          375|
+---+--------------------+------+-------------+

[0-9]+ means that at least one number will be captured. [0-9]+表示将捕获至少一个数字。 (the + symbol means one or more). +符号表示一个或多个)。 Followed by the literal 'ml'.后跟文字“ml”。 If you know the column will always end with the volume you can use [0-9]+ml$ where the $ matches the end of the line如果您知道该列将始终以卷结尾,您可以使用[0-9]+ml$其中$匹配行尾

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式:在数字后跟字母时添加空格 - Regex: Add space after a number when followed by letter 正则表达式 - 仅当一个字母后跟逗号 + 逗号 + 忽略括号中的逗号时才拆分 - regex - split only if a letter followed by comma + comma + ignore comma in bracket 正则表达式仅在后跟数字时有条件地替换值 - Regex to Replace Value Conditionally Only if Followed by a Number 正则表达式捕获最多2位数字和逗号(如果后面跟另一个单词和数字) - Regex to capture numbers up to 2 digits and coma if followed by another word and number 正则表达式为字母,后跟空格和引号(“) - Regex for a letter followed by space and quotations ( ") python regex字母后必须跟另一个字母 - python regex letter must be followed by another letter Python正则表达式:仅在逗号后没有空格的情况下添加空格 - Python regex : adding space after comma only if not followed by a number Regex python - 仅当换行符后跟数字或特殊字符和空格时才匹配换行符 - Regex python - Match newline only if it is followed by number or special character and space 仅当带空格,句点或什么都没有正则表达式时,才使用Python匹配字符串中的字母吗? - Use Python to match a letter in a string only when followed by a space, period, or nothing, without regex? 正则表达式定位数字后跟空格后跟字符 - Regex locate number followed by space followed by character
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM