如何从.csv 文件中解析出 dataframe。其中包含使用 Python 的 header 详细信息行

Question

I have a file I am trying to extract values from to create a data frame.我有一个文件试图从中提取值以创建数据框。 I have tried a regex approach to create lists from the file, but data format(Header/H and Detail/D) as below is giving me inconsistent row counts when I input the resulting lists into a data frame.我尝试了一种正则表达式方法从文件创建列表，但是当我将结果列表输入数据框时，如下所示的数据格式（标题/H 和详细信息/D）给了我不一致的行数。 I think the issue is that some records have 1 detail (D) row while others have more than 1 (D) row.我认为问题在于某些记录有 1 个详细信息 (D) 行，而其他记录有超过 1 个 (D) 行。 Could you suggest another approach?你能建议另一种方法吗？ I was thinking of trying to create a dictionary object where each H row would be the key and each D row would be the value, using a for loop of some kind.我正在考虑尝试创建一个字典 object ，其中每个 H 行将是键，每个 D 行将是值，使用某种 for 循环。

The file format is as below:文件格式如下：

H, INV34801, 20200201, 09:18:55, IN, 5 D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55, H, INV34801, 20200201, 09:18:55, IN, 5 D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55,
H, INV34802, 20200201, 10:12:35, IN, 5 D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35, H, INV34802, 20200201, 10:12:35, IN, 5 D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35,
D, INV34802, , 1.00, IN, 20200201, 10:12:35, D, INV34802, , 1.00, IN, 20200201, 10:12:35,

This the code I have been trying:这是我一直在尝试的代码：

    import pandas as pd
    import re
    import itertools
    #First I extract the date that each sale took place. 
    lst1= [line for line in re.findall(('[IN, ]\d\d\d\d\d\d\d\d'), contents)]
    #Now I remove every alternate date to remove the duplicate date I can confirm seeing that the                                 
    #date column has the same number of rows as the Invoice Number column
    lst1=lst1[1::2]
    #Now I extract the invoice number
    lst2= [line for line in re.findall("INV\w*",contents)]
    # Now I extract the product codes
    lst3=[line for line in         
    re.findall(('\s\s\s\s\s\w\w\w\w\w\w\w\w\w\w\w|\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s|\n
    \s\s\s\s\s\s\s\s\s\s\s\s\d\d\d\d|\s\s\s\s\s\s\s\s\s\s\d\d\d\d\d\d'),contents)]
    #Now I extract the Quantity Sold 
    lst4=[line for line in re.findall(('\s\s\s\s\s\s\d\.\d\d'),contents)]
    #then I create a column from the list of Invoice numbers
    df=pd.DataFrame([lst1,lst2,lst3,lst4])
    df =df.transpose()
    df.columns=['Date','Invoice_Number','Product_Code','Quantity']
    print(df)
    '''

The output structure I get is correct but the quantity and product codes arent lined up to the correct invoice numbers.我得到的 output 结构是正确的，但数量和产品代码与正确的发票编号不一致。

Dataframe below: Dataframe 如下：

    Date Invoice_Number      Product_Code    Quantity
    0      20200201       INV34801                          1.00
    1      20200201       INV34802                          1.00
    2      20200201       INV34803                          1.00
    3      20200201       INV34804                          1.00
    4      20200201       INV34805                          8.00

I'd appreciate your kind advice.我会很感激你的好意的建议。

Answer 1

Try this:尝试这个：

regex = r"[H,D] (?P<invoice_nr>[^,]*)(, (?P<date>[^,]*)[\s\S]*?(?P<quantity>-?\d+\.00), IN)[\s\S]*?(\n|$)"

I have made you an example here , not sure if this helps, but it should give your some pointers on regex我在这里给你做了一个例子，不确定这是否有帮助，但它应该给你一些关于正则表达式的指示

如何从.csv 文件中解析出 dataframe。其中包含使用 Python 的 header 详细信息行

问题描述

1 个解决方案

解决方案1
0 2020-08-07 11:48:59

如何从.csv 文件中解析出 dataframe。 其中包含使用 Python 的 header 详细信息行

问题描述

1 个解决方案

解决方案1 0 2020-08-07 11:48:59

如何从.csv 文件中解析出 dataframe。其中包含使用 Python 的 header 详细信息行

解决方案1
0 2020-08-07 11:48:59