如何提取與 python 中的模式匹配的兩個連續行

Question

我正在嘗試從test.txt中提取與兩種不同模式匹配的行。
首先，我想提取匹配>> fbat -v1的行，然后匹配p-value(2-sided)下方的相應行。

這是我嘗試過的代碼，但它只提取第一個匹配項。

import re

file = open('test.txt')
for line in file:
    match = re.findall('^>> fbat -v1', line)
    if match:
        print line

我也嘗試在 R 中執行此操作，但似乎 R 不太適合執行此操作。 我不熟悉 python，所以有人可以幫我解決一下。 先感謝您。

測試.txt：

>> fbat -v1 1:939467:A:G
trait STATUS; offset 0.150; model additive; test bi-allelic; minsize 2; min_freq 0.000; p 1.000; maxcmh 1000

Marker            afreq     fam#       weight     S-E(S)      Var(S)      Z        P
----------------------------------------------------------------------------------------

Weighted FBAT rare variant statistics for the SNPs:

W           Var(W)      Z           p-value(2-sided)
----------------------------------------------------
0.400       0.240       0.816       4.14216178e-01
----------------------------------------------------


>> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A
trait STATUS; offset 0.150; model additive; test bi-allelic; minsize 2; min_freq 0.000; p 1.000; maxcmh 1000

Marker            afreq     fam#       weight     S-E(S)      Var(S)      Z        P
----------------------------------------------------------------------------------------

Weighted FBAT rare variant statistics for the SNPs:

W           Var(W)      Z           p-value(2-sided)
----------------------------------------------------
0.333       0.444       0.500       6.17075077e-01
----------------------------------------------------

結果：

>> fbat -v1 1:939467:A:G 0.400       0.240       0.816       4.14216178e-01
>> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A 0.333       0.444       0.500       6.17075077e-01

Answer 1

您可以使用從多行中挑選所需數據的正則表達式來執行此操作。 只有兩個樣本，很難知道這一個是否會匹配所有情況：您的某些數據可能不像樣本所暗示的那樣有規律。

這不遵循for line in file:的一次一行模式：因為您的數據由多行行組成。

file = open('test.txt')
data = file.read()
rex = re.compile(r"(>> fbat -v1.+?\n).+?p-value\(2-sided\)\n-+\n(.+?)\n-", re.DOTALL)
for header, numbers in rex.findall(data):
    print (header.rstrip(), numbers)

Output 是

>> fbat -v1 1:939467:A:G 0.400       0.240       0.816       4.14216178e-01
>> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A 0.333       0.444       0.500       6.17075077e-01

我順便注意到你在 Python 工作 2. 除非這是一次性的，否則請考慮切換到 Python 3. 你不應該花時間學習 ZA7F5F35426B9273217Z 2.31736

Answer 2

import re

file = open('test.txt')
for idx, line in enumerate(file):
    match = re.findall('^>> fbat -v1', line)
    if match:
        match = re.findall('p-value(2-sided)', file[idx+1])

當然，您需要注意最后一行，因為如果它匹配^>> fbat -v1 ，您將嘗試訪問不存在的下一行。

Answer 3

如果您不想使用正則表達式，可以使用生成器，如果您讀取大量數據（和 10GB 大文件），可以減少 RAM 使用量

f = open("input.txt")

# you can replace f.readline() by string.splitlines()  by string_to_parse.splitlines() or f.readlines()
content = (line.replace("\n", "") for line in f.readlines())
result = []
try:
    # you can replace content by string.splitlines() if you read from a file
    for line in content: 
        #We try to find a line that starts with >> fbat -v1 
        if line.startswith(">> fbat -v1"):
            result_line = line
            # Jump lines until we find the one that ends with p-value(2-sided)
            while not next(content).endswith("p-value(2-sided)"):
                pass
            # jump one line to ignore the ----------------------------------------------------
            next(content) 
            # We add the line to our result
            result_line += next(content)
            # finally we add our result to a list 
            result.append(result_line) 
# this will happen if there is a >> fbat -v1 without p-value(2-sided) after
except StopIteration: 
    print('Could not find "p-value(2-sided)" after ">> fbat -v1" ')

# print the result
print("\n".join(result))

我在這里使用了一個文件來包含數據（如果它是一個日志文件）

如何提取與 python 中的模式匹配的兩個連續行

問題描述

3 個解決方案

解決方案1
1 已采納 2021-06-10 16:02:26

解決方案2
0 2021-06-10 15:15:17

解決方案3
0 2021-06-11 10:08:45

如何提取與 python 中的模式匹配的兩個連續行

問題描述

3 個解決方案

解決方案1 1 已采納 2021-06-10 16:02:26

解決方案2 0 2021-06-10 15:15:17

解決方案3 0 2021-06-11 10:08:45

解決方案1
1 已采納 2021-06-10 16:02:26

解決方案2
0 2021-06-10 15:15:17

解決方案3
0 2021-06-11 10:08:45