[英]Creating a Regex Pattern to Extract Floats and Integers
I am having an issue in creating a pattern recognition function to extract all the numbers from a data frame column and print them. 我在创建模式识别功能以从数据框列中提取所有数字并进行打印时遇到问题。
I have tried to create a regex pattern after looking at the data camp tutorial and the other questions on stack overflow, but I have not been able to create a pattern that will extract all the numbers and print them. 在查看了数据训练营教程和其他有关堆栈溢出的问题之后,我试图创建一个正则表达式模式,但是我无法创建一个将提取所有数字并打印出来的模式。 Essentially, the EA patterns that I created and the HR patterns with floats like say 1.12 are not returning results.
本质上,我创建的EA模式和带有浮点数(例如1.12)的HR模式不会返回结果。
import re
import pandas as pd
data = ['1EA @ 3217.45;', 'ST - .63HR@165;', 'ST - .5HR@123;', 'ST - 1.08HR@165;', '1EA @ 3217.45;', 'ST - .85HR@165;', 'ST - .85HR@165;', '1EA @ 3217.45;', 'ST - .12HR@165;', 'OT - 1.12HR @ 165;', 'ST - .55HR@123;OT - 0.82HR @ 123;', 'ST - .5HR@165;', 'OT - 0.45HR @ 123;', 'ST - .6HR@123;', 'ST - 1.42HR@123;', '1EA @ 1500;', 'ST - .3HR@123;', 'ST - 1HR@111;OT - 0.25HR @ 111;']
Travel = pd.DataFrame(data, columns=['Rate Breakup Description'])
for a in Travel['Rate Breakup Description']:
print(re.search('.(\d+)HR | (\d+)EA | (\d+)HR | (\d+)EA', a, re.I|re.M))
My objective is to be able to have a pattern recognition function that will extract all the numbers regardless of the different string patterns and print them in the order they appear. 我的目标是能够拥有一种模式识别功能,该功能将提取所有数字,而与不同的字符串模式无关,并按出现的顺序打印它们。
You may use 您可以使用
Travel['Result'] = Travel['Rate Breakup Description'].str.findall(r'\d*\.?\d+(?=HR|EA)').apply(', '.join)
The pattern will match 模式将匹配
\\d*
- 0+ digits \\d*
-0+个数字 \\.?
- an optional .
.
\\d+
- 1+ digits \\d+
-1个以上数字 (?=HR|EA)
- followed with HR
or EA
. (?=HR|EA)
-后跟HR
或EA
。 The .str.findall
will return all matches it finds in an input string, and .apply(', '.join)
will join the results with a comma+space. .str.findall
将返回它在输入字符串中找到的所有匹配项,而.apply(', '.join)
join .apply(', '.join)
将结果加逗号+空格。
If there is a single match expected in each input, you might use an alternative solution: 如果每个输入中期望有一个匹配项,则可以使用替代解决方案:
Travel['Result'] = Travel['Rate Breakup Description'].str.extract(r'(\d*\.?\d+)(?:HR|EA)', expand=False)
Here, (\\d*\\.?\\d+)
is a capturing group due to the parentheses, this part is returned by .str.extract
and (?:HR|EA)
is a non-capturing group (so that it is not returned) matching either HR
or EA
. 在这里,
(\\d*\\.?\\d+)
是捕获组由于括号,这部分是由返回.str.extract
和(?:HR|EA)
是一个非捕获组(使其不返回)匹配HR
或EA
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.