简体   繁体   English

如何从.csv 文件中解析出 dataframe。 其中包含使用 Python 的 header 详细信息行

[英]How to parse out a dataframe from .csv file. which contains header detail rows using Python

I have a file I am trying to extract values from to create a data frame.我有一个文件试图从中提取值以创建数据框。 I have tried a regex approach to create lists from the file, but data format(Header/H and Detail/D) as below is giving me inconsistent row counts when I input the resulting lists into a data frame.我尝试了一种正则表达式方法从文件创建列表,但是当我将结果列表输入数据框时,如下所示的数据格式(标题/H 和详细信息/D)给了我不一致的行数。 I think the issue is that some records have 1 detail (D) row while others have more than 1 (D) row.我认为问题在于某些记录有 1 个详细信息 (D) 行,而其他记录有超过 1 个 (D) 行。 Could you suggest another approach?你能建议另一种方法吗? I was thinking of trying to create a dictionary object where each H row would be the key and each D row would be the value, using a for loop of some kind.我正在考虑尝试创建一个字典 object ,其中每个 H 行将是键,每个 D 行将是值,使用某种 for 循环。

The file format is as below:文件格式如下:

H, INV34801, 20200201, 09:18:55, IN, 5 D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55, H, INV34801, 20200201, 09:18:55, IN, 5 D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55,
H, INV34802, 20200201, 10:12:35, IN, 5 D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35, H, INV34802, 20200201, 10:12:35, IN, 5 D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35,
D, INV34802, , 1.00, IN, 20200201, 10:12:35, D, INV34802, , 1.00, IN, 20200201, 10:12:35,

This the code I have been trying:这是我一直在尝试的代码:

    import pandas as pd
    import re
    import itertools
    #First I extract the date that each sale took place. 
    lst1= [line for line in re.findall(('[IN, ]\d\d\d\d\d\d\d\d'), contents)]
    #Now I remove every alternate date to remove the duplicate date I can confirm seeing that the                                 
    #date column has the same number of rows as the Invoice Number column
    lst1=lst1[1::2]
    #Now I extract the invoice number
    lst2= [line for line in re.findall("INV\w*",contents)]
    # Now I extract the product codes
    lst3=[line for line in         
    re.findall(('\s\s\s\s\s\w\w\w\w\w\w\w\w\w\w\w|\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s|\n
    \s\s\s\s\s\s\s\s\s\s\s\s\d\d\d\d|\s\s\s\s\s\s\s\s\s\s\d\d\d\d\d\d'),contents)]
    #Now I extract the Quantity Sold 
    lst4=[line for line in re.findall(('\s\s\s\s\s\s\d\.\d\d'),contents)]
    #then I create a column from the list of Invoice numbers
    df=pd.DataFrame([lst1,lst2,lst3,lst4])
    df =df.transpose()
    df.columns=['Date','Invoice_Number','Product_Code','Quantity']
    print(df)
    ''' 

The output structure I get is correct but the quantity and product codes arent lined up to the correct invoice numbers.我得到的 output 结构是正确的,但数量和产品代码与正确的发票编号不一致。

Dataframe below: Dataframe 如下:

    Date Invoice_Number      Product_Code    Quantity
    0      20200201       INV34801                          1.00
    1      20200201       INV34802                          1.00
    2      20200201       INV34803                          1.00
    3      20200201       INV34804                          1.00
    4      20200201       INV34805                          8.00

I'd appreciate your kind advice.我会很感激你的好意的建议。

Try this:尝试这个:

regex = r"[H,D] (?P<invoice_nr>[^,]*)(, (?P<date>[^,]*)[\s\S]*?(?P<quantity>-?\d+\.00), IN)[\s\S]*?(\n|$)"

I have made you an example here , not sure if this helps, but it should give your some pointers on regex我在这里给你做了一个例子,不确定这是否有帮助,但它应该给你一些关于正则表达式的指示

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python在csv文件中写入包含逗号的列表? - How to write list which contains comma in csv file using python? How to read every column of a csv file in python after every 10-15 rows which have the same header using pandas or csv? - How to read every column of a csv file in python after every 10-15 rows which have the same header using pandas or csv? 将 Pandas DataFrame 写入 CSV 文件。 结果得到额外的行 - Write pandas DataFrame to CSV file. The result gets extra rows 如何从Excel文件中的Python逗号分隔字符串中解析出所有美国州。 - How do I parse out all US states from comma separated strings in Python from an excel file. 如何解析 csv 文件中的字段以在 Pandas 数据框中创建额外的行? - How to parse a field in csv file to create additional rows in pandas dataframe? 如何从包含 CSV 数据的 ResultProxy object 中检索 DataFrame? - How to retrieve a DataFrame from a ResultProxy object which contains CSV data? 无法读取.csv 文件。 EmptyDataError:没有要从文件中解析的列 - Cant read .csv file. EmptyDataError: No columns to parse from file 如何使用python从CSV文件中读取标头 - How to read a header from a CSV file using python 如何使用 python 和数据框从 csv 文件中读取动态数据 - How to read a dynamic data from csv file using python and dataframe 如何在python中解析包含HTML代码的XLS文件? - How to parse a XLS file in python which contains HTML code?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM