簡體   English   中英

如何從巨大的txt文件中提取模式

[英]How do I extract the pattern out from huge txt file

我有一個巨大的文本文件,其中包含來自 AZ 的隨機字母,我想提取一些字符。 棘手的部分是給定以下輸入:

AFVAJFLDVAJPQDVAJDSNJKVAJGHD

和模式VAJ ,我想提取每個匹配項,直到字符串結束。 我想要以下輸出:

[ "VAJFLDVAJPQDVAJDSNJKVAJGHD", "VAJPQDVAJDSNJKVAJGHD", "VAJDSNJKVAJGHD", "VAJGHD" ]

您可以使用str.find()來查找您的模式出現的索引。 然后,您可以相應地對字符串進行切片。 一個實現可能是這樣的:

def find(inp, what):
  matches = []
  while what in inp:
    idx = inp.find(what)
    matches.append(inp[idx:])
    # remove the previous pattern from the string
    inp = inp[idx+len(what):]

  return matches

您可以將它與find("AFVAJFLDVAJPQDVAJDSNJKVAJGHD", "VAJ")

這需要具有子組匹配的正則表達式。 ( https://docs.python.org/3.5/library/re.html#match-objects )

我的測試文件data.txt

QWEEEFVAJFLDVAJPQDVAJDSNJKVAJGHD
AFVAJFLDVAJPQDVAJDSNJKHFGHERQWFS
ONLY_TWO_VAJsOOVAJ123VAQQWERTY
START_VAJs_with_more_VAJ123VAJ_space_between
AAPVAJRCGVAJJKYVAJJJJJJJJVAJOOOO
AAPVAJRCGVAJJKYVAJJJJJJJJQQQOOOOO

蟒蛇代碼:

import re

pattern = "VAJ"

re_str = pattern + "..." + "(" + pattern + "..." +"(" +  pattern + "(.*)))"
regex = re.compile(re_str)

regex_extra = re.compile(pattern + ".*")

for line in open("data.txt"):
    line = line.strip()
    match = regex.search(line)
    if match:
        result = list()
        result.append(match.group(0))   # entire regex match
        result.append(match.group(1))   # outer regex parenthesis'ed group
        result.append(match.group(2))   # middle regex parenthesis'ed group

        # Most inner regex parenthesis'ed group contains rest of the line.
        # Use this to find extra pattern.
        #
        the_rest = match.group(3)
        match_extra = regex_extra.search(the_rest)
        if match_extra:   # If one more <pattern> in the rest of the line
            result.append(match_extra.group(0))   # add it to the result list

        # Output         
        print(result)

輸出:

['VAJFLDVAJPQDVAJDSNJKVAJGHD', 'VAJPQDVAJDSNJKVAJGHD', 'VAJDSNJKVAJGHD', 'VAJGHD']
['VAJFLDVAJPQDVAJDSNJKHFGHERQWFS', 'VAJPQDVAJDSNJKHFGHERQWFS', 'VAJDSNJKHFGHERQWFS']
['VAJRCGVAJJKYVAJJJJJJJJVAJOOOO', 'VAJJKYVAJJJJJJJJVAJOOOO', 'VAJJJJJJJJVAJOOOO', 'VAJOOOO']
['VAJRCGVAJJKYVAJJJJJJJJQQQOOOOO', 'VAJJKYVAJJJJJJJJQQQOOOOO', 'VAJJJJJJJJQQQOOOOO']

文件的巨大不是這個代碼的問題,只要最長的一行在內存中適合幾次就可以了。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM