简体   繁体   English

解析带有Pandas标头问题的脏文本文件

[英]Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. 我正在尝试解析一个早于99年创建的文本文件,该文件有点难以处理。 The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). 标题位于第一行,并以'^'分隔(整个文件以^分隔)。 The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999). 问题是有些字符似乎被抛出(例如,长行空格似乎将标题与文件中其余数据点分开。(示例文件位于https://www.chicagofed.org / applications / bhc / bhc-home我的示例引用的是1999年第3季度)。

Issues: 1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series 2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. 问题:1)太多标题无法手动创建,并且在整个时间序列中前后移动时,我需要对许多可能具有新标题的文件执行此操作2)我需要从文件中重新创建标题,然后将其删除这样我就不会用重复的标题污染整个第一行。 I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way. 我意识到我可以在事后将数据帧[1:]切片,然后将其删除,但这很草率,我敢肯定有更好的方法。 3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario? 3)公司未报告的字段似乎显示为“ ^^^^^^^^^^”,这很好,但是在这种情况下,熊猫会自动填充NaN吗?

My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. 我在下面的尝试只是试图隔离标头,但是我真的在更大的问题上停留在文本文件的结构方式上。 Any recommendations or obvious easy tricks i'm missing? 有什么建议或我不知道的简单技巧吗?

from zipfile import ZipFile
import pandas as pd

def main():
    #Driver

    FILENAME_PREFIX = 'bhcf'
    FILE_TYPE = '.txt'
    field_headers = []

    with ZipFile('reg_data.zip', 'r') as zip:

        with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
            headers_df = pd.read_csv(qtr_file, sep='^', header=None)

            headers_df = headers_df[:1]
            headers_array = headers_df.values[0]

            parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)

I try with the file you linked and one i downloaded i think from 2015: 我尝试使用您链接的文件,并从2015年开始下载一个我下载的文件:

import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)

The final dataframe has the rows of both csv, and the union of columns from them. 最终的数据帧具有csv的行以及来自它们的列的并集。 You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns. 您可以对每个季度的csv执行此操作并追加它,这样最终您将拥有它们的所有行以及列的并集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM