简体   繁体   English

将非csv文本文件解析为数据框

[英]Parsing non-csv text file into dataframe

I have a text file where all columns are merged into a single column and 'rows' are separated by two long rows of '-'. 我有一个文本文件,其中所有列都合并为一个列,“行”由两行长的“-”分隔。 It looks like this: 看起来像这样:

Hash: some_hash_id
Author: some_author
Message: Message about the update


Reviewers: jimbo

Reviewed By: jimbo

Test Plan: Auto-generated

@bypass-lint
Commit Date: 2019-06-30 20:12:38
Modified path: path/to/my/file.php
Modified path: some/other/path/to/my/file.php
Modified path: path/to/other/file.php
-------------------------------------------------------
-------------------------------------------------------
Hash: some_other_hash_id
Author: different_author
Message: Auto generated message



Reviewers: broseph

Reviewed By: broseph

Test Plan: Auto-generated by Sam

@bypass-lint
Commit Date: 2019-06-30 18:09:12
Modified path: my/super/file.php
Modified path: totally/awesome/file.php
Modified path: file/path.json
-------------------------------------------------------
-------------------------------------------------------
Hash: hash_id_4
Author: new_author
Message: Auto DB big update



Reviewers: foo

Reviewed By: foo

Test Plan: Auto-generated by Tom

@bypass-lint
Commit Date: 2019-06-30 11:08:59
Modified path: big/scripts/file.json

expected output for this example is is a dataframe with just 3 rows. 此示例的预期输出是只有3行的数据框。 dataframe columns: Hash (str), Author (str), Message (str), Reviewers (str), Reviewed By (str), Test Plan (str), Commit Date (timestamp), Modified path (array(str)) 数据帧列:哈希(str),作者(str),消息(str),审阅者(str),审阅者(str),测试计划(str),提交日期(timestamp),修改路径(array(str))

Load the whole file content into a variable named txt . 将整个文件内容加载到名为txt的变量中。

Then, to generate a DataFrame, it is enough to run a single (although quite complex) instruction: 然后,生成一个数据帧,它足以运行单个 (尽管十分复杂)指令:

pd.DataFrame([ collections.OrderedDict(
    { m.group('key').strip(): re.sub(r'\n', ' ', m.group('val').strip())
        for m in re.finditer(
            r'^(?P<key>[^:\n]+):\s*(?P<val>.+?(?:\n[^:\n]+)*)$', chunk, re.M)})
    for chunk in re.split(r'(?:\n\-+)+\n', txt) ])

Start reading of the code from the last line. 从最后一行开始阅读代码。 It splits txt into chunks, on a sequence of lines containing only - chars. 它在仅包含-字符的一系列行上将txt分成多个块。

Then finditer takes over, dividing each chunk into key and value capturing groups. 然后finditer接管,将每个块划分为捕获组。

The next step is a dictionary comprehension, stripping / substituting each key and value and creating an OrderedDict (import collections ). 下一步是字典理解,剥离/替换每个并创建OrderedDict (导入集合 )。

All these dictionaries are enclosed in a list comprehension. 所有这些字典都包含在列表理解中。

And the last step is to create a DataFrame. 最后一步是创建一个DataFrame。

To avoid multi-line items, in each value (the piece of text after the colon) newlines were replaced with a space (you are free to change it). 为了避免多行项目,在每个值(冒号后的文本)中,换行符都用空格代替(可以自由更改)。

Here's one implementation. 这是一个实现。 Loop through each line and when the line contains : split the line as columnname:columnval and add columnname as key and columnval as value to a temp dictionary. 遍历每行,当行包含时:将行拆分为columnname:columnval并将columnname作为键添加,并将columnval作为值添加到临时字典中。 use if statements to detect when you encountered special keys Hash (for the start of new row), Modified path (add it to an array) and Commit Date (converting it to datetime) 使用if语句检测何时遇到特殊键Hash (用于新行的开始), Modified path (将其添加到数组中)和Commit Date (将其转换为datetime)

import pandas as pd
from datetime import datetime

test_path = '/home/kkawabat/.PyCharmCE2018.1/config/scratches/test.txt'
with open(test_path, 'r') as ofile:
    lines = ofile.readlines()
row_list = []
cur_row_dict = {}
for line in lines:
    line_split = line.split(':', 1)
    if len(line_split) == 2:
        colname, colval = line_split[0].strip(), line_split[1].strip()
        if colname == 'Hash': #assuming Hash is always the first element
            if len(cur_row_dict) != 0:
                row_list.append(cur_row_dict)
                cur_row_dict = {}
        elif colname == 'Commit Date':
            cur_row_dict[colname] = datetime.strptime(colval, '%Y-%m-%d %H:%M:%S')
        elif colname == 'Modified path':
            if colname not in cur_row_dict:
                cur_row_dict[colname] = [colval]
            else:
                cur_row_dict[colname].append(colval)
        else:
            cur_row_dict[colname] = colval
row_list.append(cur_row_dict)

df = pd.DataFrame(row_list)
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM