[英]How to parse out a dataframe from .csv file. which contains header detail rows using Python
I have a file I am trying to extract values from to create a data frame.我有一个文件试图从中提取值以创建数据框。 I have tried a regex approach to create lists from the file, but data format(Header/H and Detail/D) as below is giving me inconsistent row counts when I input the resulting lists into a data frame.
我尝试了一种正则表达式方法从文件创建列表,但是当我将结果列表输入数据框时,如下所示的数据格式(标题/H 和详细信息/D)给了我不一致的行数。 I think the issue is that some records have 1 detail (D) row while others have more than 1 (D) row.
我认为问题在于某些记录有 1 个详细信息 (D) 行,而其他记录有超过 1 个 (D) 行。 Could you suggest another approach?
你能建议另一种方法吗? I was thinking of trying to create a dictionary object where each H row would be the key and each D row would be the value, using a for loop of some kind.
我正在考虑尝试创建一个字典 object ,其中每个 H 行将是键,每个 D 行将是值,使用某种 for 循环。
The file format is as below:文件格式如下:
H, INV34801, 20200201, 09:18:55, IN, 5 D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55, H, INV34801, 20200201, 09:18:55, IN, 5 D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55,
H, INV34802, 20200201, 10:12:35, IN, 5 D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35, H, INV34802, 20200201, 10:12:35, IN, 5 D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35,
D, INV34802, , 1.00, IN, 20200201, 10:12:35, D, INV34802, , 1.00, IN, 20200201, 10:12:35,
This the code I have been trying:这是我一直在尝试的代码:
import pandas as pd
import re
import itertools
#First I extract the date that each sale took place.
lst1= [line for line in re.findall(('[IN, ]\d\d\d\d\d\d\d\d'), contents)]
#Now I remove every alternate date to remove the duplicate date I can confirm seeing that the
#date column has the same number of rows as the Invoice Number column
lst1=lst1[1::2]
#Now I extract the invoice number
lst2= [line for line in re.findall("INV\w*",contents)]
# Now I extract the product codes
lst3=[line for line in
re.findall(('\s\s\s\s\s\w\w\w\w\w\w\w\w\w\w\w|\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s|\n
\s\s\s\s\s\s\s\s\s\s\s\s\d\d\d\d|\s\s\s\s\s\s\s\s\s\s\d\d\d\d\d\d'),contents)]
#Now I extract the Quantity Sold
lst4=[line for line in re.findall(('\s\s\s\s\s\s\d\.\d\d'),contents)]
#then I create a column from the list of Invoice numbers
df=pd.DataFrame([lst1,lst2,lst3,lst4])
df =df.transpose()
df.columns=['Date','Invoice_Number','Product_Code','Quantity']
print(df)
'''
The output structure I get is correct but the quantity and product codes arent lined up to the correct invoice numbers.我得到的 output 结构是正确的,但数量和产品代码与正确的发票编号不一致。
Dataframe below: Dataframe 如下:
Date Invoice_Number Product_Code Quantity
0 20200201 INV34801 1.00
1 20200201 INV34802 1.00
2 20200201 INV34803 1.00
3 20200201 INV34804 1.00
4 20200201 INV34805 8.00
I'd appreciate your kind advice.我会很感激你的好意的建议。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.