![](/img/trans.png)
[英]How do I extract floating point values from .txt file to use for calculation in python?
[英]How do I extract the pattern out from huge txt file
我有一个巨大的文本文件,其中包含来自 AZ 的随机字母,我想提取一些字符。 棘手的部分是给定以下输入:
AFVAJFLDVAJPQDVAJDSNJKVAJGHD
和模式VAJ
,我想提取每个匹配项,直到字符串结束。 我想要以下输出:
[ "VAJFLDVAJPQDVAJDSNJKVAJGHD", "VAJPQDVAJDSNJKVAJGHD", "VAJDSNJKVAJGHD", "VAJGHD" ]
您可以使用str.find()
来查找您的模式出现的索引。 然后,您可以相应地对字符串进行切片。 一个实现可能是这样的:
def find(inp, what):
matches = []
while what in inp:
idx = inp.find(what)
matches.append(inp[idx:])
# remove the previous pattern from the string
inp = inp[idx+len(what):]
return matches
您可以将它与find("AFVAJFLDVAJPQDVAJDSNJKVAJGHD", "VAJ")
。
这需要具有子组匹配的正则表达式。 ( https://docs.python.org/3.5/library/re.html#match-objects )
我的测试文件data.txt
:
QWEEEFVAJFLDVAJPQDVAJDSNJKVAJGHD
AFVAJFLDVAJPQDVAJDSNJKHFGHERQWFS
ONLY_TWO_VAJsOOVAJ123VAQQWERTY
START_VAJs_with_more_VAJ123VAJ_space_between
AAPVAJRCGVAJJKYVAJJJJJJJJVAJOOOO
AAPVAJRCGVAJJKYVAJJJJJJJJQQQOOOOO
蟒蛇代码:
import re
pattern = "VAJ"
re_str = pattern + "..." + "(" + pattern + "..." +"(" + pattern + "(.*)))"
regex = re.compile(re_str)
regex_extra = re.compile(pattern + ".*")
for line in open("data.txt"):
line = line.strip()
match = regex.search(line)
if match:
result = list()
result.append(match.group(0)) # entire regex match
result.append(match.group(1)) # outer regex parenthesis'ed group
result.append(match.group(2)) # middle regex parenthesis'ed group
# Most inner regex parenthesis'ed group contains rest of the line.
# Use this to find extra pattern.
#
the_rest = match.group(3)
match_extra = regex_extra.search(the_rest)
if match_extra: # If one more <pattern> in the rest of the line
result.append(match_extra.group(0)) # add it to the result list
# Output
print(result)
输出:
['VAJFLDVAJPQDVAJDSNJKVAJGHD', 'VAJPQDVAJDSNJKVAJGHD', 'VAJDSNJKVAJGHD', 'VAJGHD']
['VAJFLDVAJPQDVAJDSNJKHFGHERQWFS', 'VAJPQDVAJDSNJKHFGHERQWFS', 'VAJDSNJKHFGHERQWFS']
['VAJRCGVAJJKYVAJJJJJJJJVAJOOOO', 'VAJJKYVAJJJJJJJJVAJOOOO', 'VAJJJJJJJJVAJOOOO', 'VAJOOOO']
['VAJRCGVAJJKYVAJJJJJJJJQQQOOOOO', 'VAJJKYVAJJJJJJJJQQQOOOOO', 'VAJJJJJJJJQQQOOOOO']
文件的巨大不是这个代码的问题,只要最长的一行在内存中适合几次就可以了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.