如何从巨大的txt文件中提取模式

Question

I have huge text file which consists of random letters from AZ and I want to extract some characters out.我有一个巨大的文本文件，其中包含来自 AZ 的随机字母，我想提取一些字符。 The tricky part is that given the following input:棘手的部分是给定以下输入：

AFVAJFLDVAJPQDVAJDSNJKVAJGHD

and the pattern VAJ , I want to extract each match until the end of the string.和模式VAJ ，我想提取每个匹配项，直到字符串结束。 I want the following output:我想要以下输出：

[ "VAJFLDVAJPQDVAJDSNJKVAJGHD", "VAJPQDVAJDSNJKVAJGHD", "VAJDSNJKVAJGHD", "VAJGHD" ]

Answer 1

You can use str.find() to find the index, where your pattern occurs.您可以使用str.find()来查找您的模式出现的索引。 You can then slice the string accordingly.然后，您可以相应地对字符串进行切片。 An implementaion could look like this:一个实现可能是这样的：

def find(inp, what):
  matches = []
  while what in inp:
    idx = inp.find(what)
    matches.append(inp[idx:])
    # remove the previous pattern from the string
    inp = inp[idx+len(what):]

  return matches

You can use it with find("AFVAJFLDVAJPQDVAJDSNJKVAJGHD", "VAJ") .您可以将它与find("AFVAJFLDVAJPQDVAJDSNJKVAJGHD", "VAJ") 。

Answer 2

This calls for regular expressions with sub-group matching.这需要具有子组匹配的正则表达式。 ( https://docs.python.org/3.5/library/re.html#match-objects ) ( https://docs.python.org/3.5/library/re.html#match-objects )

My test file data.txt :我的测试文件data.txt ：

QWEEEFVAJFLDVAJPQDVAJDSNJKVAJGHD
AFVAJFLDVAJPQDVAJDSNJKHFGHERQWFS
ONLY_TWO_VAJsOOVAJ123VAQQWERTY
START_VAJs_with_more_VAJ123VAJ_space_between
AAPVAJRCGVAJJKYVAJJJJJJJJVAJOOOO
AAPVAJRCGVAJJKYVAJJJJJJJJQQQOOOOO

Python code:蟒蛇代码：

import re

pattern = "VAJ"

re_str = pattern + "..." + "(" + pattern + "..." +"(" +  pattern + "(.*)))"
regex = re.compile(re_str)

regex_extra = re.compile(pattern + ".*")

for line in open("data.txt"):
    line = line.strip()
    match = regex.search(line)
    if match:
        result = list()
        result.append(match.group(0))   # entire regex match
        result.append(match.group(1))   # outer regex parenthesis'ed group
        result.append(match.group(2))   # middle regex parenthesis'ed group

        # Most inner regex parenthesis'ed group contains rest of the line.
        # Use this to find extra pattern.
        #
        the_rest = match.group(3)
        match_extra = regex_extra.search(the_rest)
        if match_extra:   # If one more <pattern> in the rest of the line
            result.append(match_extra.group(0))   # add it to the result list

        # Output         
        print(result)

Ouput:输出：

['VAJFLDVAJPQDVAJDSNJKVAJGHD', 'VAJPQDVAJDSNJKVAJGHD', 'VAJDSNJKVAJGHD', 'VAJGHD']
['VAJFLDVAJPQDVAJDSNJKHFGHERQWFS', 'VAJPQDVAJDSNJKHFGHERQWFS', 'VAJDSNJKHFGHERQWFS']
['VAJRCGVAJJKYVAJJJJJJJJVAJOOOO', 'VAJJKYVAJJJJJJJJVAJOOOO', 'VAJJJJJJJJVAJOOOO', 'VAJOOOO']
['VAJRCGVAJJKYVAJJJJJJJJQQQOOOOO', 'VAJJKYVAJJJJJJJJQQQOOOOO', 'VAJJJJJJJJQQQOOOOO']

Hugeness of the file is not a problem with this code, as long as the longest line fits a few times in memory it should be OK.文件的巨大不是这个代码的问题，只要最长的一行在内存中适合几次就可以了。

如何从巨大的txt文件中提取模式

问题描述

2 个解决方案

解决方案1
0 2018-08-21 08:01:59

解决方案2
0 2018-08-21 09:49:03

如何从巨大的txt文件中提取模式

问题描述

2 个解决方案

解决方案1 0 2018-08-21 08:01:59

解决方案2 0 2018-08-21 09:49:03

解决方案1
0 2018-08-21 08:01:59

解决方案2
0 2018-08-21 09:49:03