[英]Text - extract sentences from text file that have certain characters python
[英]How to extract numbers from sentences with certain conditions in python?
这是我输入句子的一个例子。 我想从以mm或cm结尾的句子中提取数字。 这是我试图制作的正则表达式。
sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size'
re.findall(r'(\d+) cm',sen)
这给出了输出
['0']
然后我只是尝试在没有条件的情况下提取数字
print (re.findall('\d+', sen ))
这给出了输出
['1', '9', '1', '4', '2', '0']
我的预期产量是
['1.9x1.4x2.0'] or ['1.9', '1.4', '2.0']
不重复,因为我也在寻找cm,mm加浮点数的方法。
您可以使用3个捕获组来获取数字,并确保使用字符类以cm
或mm
结束测量。
(?<!\S)(\d+\.\d+)x(\d+\.\d+)x(\d+\.\d+) [cm]m(?!\S)
在部分
(?<!\\S)
负面的lookbehind,断言左边的内容不是非空白字符 (\\d+\\.\\d+)x
捕获组1 ,匹配1+位和小数部分,然后匹配x (\\d+\\.\\d+)x
捕获组2与上面相同 (\\d+.\\d+)
捕获组3匹配1+位和小数部分 [cm]m
匹配cm或mm (?!\\S)
否定前瞻,断言左边的内容不是非空白字符 例如
import re
regex = r"(?<!\S)(\d+\.\d+)x(\d+\.\d+)x(\d+\.\d+) [cm]m(?!\S)"
test_str = "The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size"
print(re.findall(regex, test_str))
产量
[('1.9', '1.4', '2.0')]
要获得包含x
可以使用的输出
(?<!\S)(\d+\.\d+x\d+\.\d+x\d+\.\d+) [cm]m(?!\S)
产量
['1.9x1.4x2.0']
编辑
要仅匹配值并允许数字和值之间的一个或多个空格,您可以使用正向前瞻:
\d+(?:\.\d+)?(?:(?:x\d+(?:\.\d+)?)*)?(?=[ \t]+[cm]m)
您可以使用re.findall
前瞻:
import re
sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size'
result = re.findall(r'[\dx\.]+(?=\scm)', sen)
输出:
['1.9x1.4x2.0']
尝试这个 :
sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size'
import re
re.findall('\d+\.\d+', sen)
输出 :
['1.9', '1.4', '2.0']
这是另一种方法:
import re
sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size'
output = re.findall('\d.\d', sen)
输出:
['1.9', '1.4', '2.0']
import re
sen = '''The study reveals a speculated nodule with pleural tagging at anterior basal
segment of LLL, measured 1.9x1.4x2.0 cm in size'''
print (re.findall('[\d\.]+', sen ))
['1.9', '1.4', '2.0']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.