使用正则表达式将结构化但非表格文本解析为 pandas

Question

我正在尝试将以下数据从文本文件解析为 pandas：

genome Bacteroidetes_4
reference B650
source carotenoid

genome Desulfovibrio_3
reference B123
source Polyketide
reference B839
source flexirubin

我希望 output 像这样：

genome,reference,source
Bacteroidetes_4,B650,carotenoid
Desulfovibrio_3,B123,Polyketide
Desulfovibrio_3,B839,flexirubin

我已经修改了一些代码（ https://www.vipinajayakumar.com/parsing-text-with-python/ by Vipin Ajayakumar）

import pandas as pd
import re

# write regular expressions
rx_dict = {
    'genome': re.compile(r'genome (?P<genome>.*)\n'),
    'source': re.compile(r'source (?P<source>.*)\n'),
    'reference': re.compile(r'reference (?P<reference>.*)\n'),
}

# line parser
def parse_line(line):
    for key, rx in rx_dict.items():
        match = rx.search(line)
        if match:
            return key, match
    # if there are no matches
    return None, None

我相信文件解析器有问题，在while循环的某个地方。

def parse_file(filepath):    
    data = []  
    # open the file and read through it line by line
    with open(filepath, 'r') as file_object:        
        line = file_object.readline()        
        while line:           
            # at each line check for a match with a regex
            key, match = parse_line(line)

            # extract from each line
            if key == 'genome':
                genome = match.group('genome')
            if key == 'Source':
                Source = match.group('Source')               
            if key == 'reference':
                Type = match.group('reference')

                while line.strip():
                    row = {
                        'genome': genome,
                        'reference': reference,
                        'Source': Source,
                        }
                    data.append(row)
        data = pd.DataFrame(data)
    return data

if __name__ == '__main__':
    filepath = '/path/file.txt'
    data = parse_file(filepath)
    data.to_csv('output.csv', sep=',', index=False)

当我运行此代码时，它会连续返回而不会结束。 任何有关如何纠正或解决此问题的提示将不胜感激。

Answer 1

您是否尝试过在 while 循环中设置断点并使用调试器查看发生了什么？

您可以使用：

breakpoint()

Python >= 3.7。 对于旧版本：

import pdb

# your code

# for each part you are
# interested in the while 
# loop:
pdb.set_trace()

然后在启用调试器的情况下运行您的脚本：

>>> python3 -m pdb yourscript.py

使用 'c' 继续到下一个断点。 有关命令的更多信息，请参阅文档。

如果您使用 IDE 也可以使用集成调试器，它有一个，使用起来不那么麻烦。

顺便说一句，这可能是因为您使用了while line ，然后似乎永远不会读取新行，因此只要第一行不是空字符串，语句的计算结果为 True 并无限期地停留在 while 循环中。 您可以尝试使用 for 循环来迭代文件。

例如

with open('file.suffix', 'r') as fileobj:
    for line in fileobj:
        # your logic

Answer 2

您的问题是在此处读取文件时

with open(filepath, 'r') as file_object:        
    line = file_object.readline()        
    while line:

line 的值永远不会改变，所以 while 循环会不停地运行

改成这样：

with open(filepath, 'r') as file_object: 
    lines = file_object.readlines()
    for line in lines:

Answer 3

仅使用 pandas，我们可以使用str.split

df = pd.read_csv('tmp.txt',sep='|',header=None)
s = df[0].str.split(' ',expand=True)

df_new = s.set_index([0,s.groupby(0).cumcount()]).unstack(0)

print(df_new)

                 1                      
0           genome reference      source
0  Bacteroidetes_4      B650  carotenoid
1  Desulfovibrio_3      B123  Polyketide
2              NaN      B839  flexirubin

使用正则表达式将结构化但非表格文本解析为 pandas

问题描述

3 个解决方案

解决方案1
0 2021-05-12 00:53:29

解决方案2
0 已采纳 2021-05-12 00:54:56

解决方案3
0 2021-05-12 01:13:28

使用正则表达式将结构化但非表格文本解析为 pandas

问题描述

3 个解决方案

解决方案1 0 2021-05-12 00:53:29

解决方案2 0 已采纳 2021-05-12 00:54:56

解决方案3 0 2021-05-12 01:13:28

解决方案1
0 2021-05-12 00:53:29

解决方案2
0 已采纳 2021-05-12 00:54:56

解决方案3
0 2021-05-12 01:13:28