[英]Convert unstructured CSV with filename rows
我正在使用一个输出非标准 CSV 文件的系统。 第 1 行始终包含文件名,然后是第 2 行中表格的属性(有时包括逗号)、第 3 行中的表格标题,然后是数量不等的数据行。 在数据行之后,总是有两个空行并且模式重复(标题在文件中总是相同的)。 这是一个小例子:
Example Report
Geography:Boston, MA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,275
Week Ending 03-13-22,ITEM DESCRIPTION A,297
Week Ending 03-20-22,ITEM DESCRIPTION A,261
Example Report
Geography:New York, NY
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,393
Week Ending 03-13-22,ITEM DESCRIPTION A,477
Week Ending 03-20-22,ITEM DESCRIPTION A,412
Example Report
Geography:Philadelphia, PA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,195
Week Ending 03-13-22,ITEM DESCRIPTION A,233
Week Ending 03-20-22,ITEM DESCRIPTION A,198
最终,我想丢弃文件名和额外的 header 行和 output 一个标准的 CSV 并将属性作为第一列。 上面的例子应该是这样的:
Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198
我习惯于在 python 中操作标准的 CSV 文件,但是这个文件让我感到困惑,其中混合了非结构化数据。
迭代这样的 CSV 的工作解决方案:
def read_strange_csv(filename):
header_used = False
with open(filename) as f:
while True:
line_filename = next(f).rstrip()
line_attribute = next(f).rstrip()
geography = line_attribute.split(':')[1]
line_header = next(f).rstrip()
if not header_used:
yield f'Geography,{line_header}'
header_used = True
for line in f:
line = line.rstrip()
if not line:
break
yield f'"{geography}",{line}'
try:
next(f) # empty line
except StopIteration:
return
for row in read_strange_csv('example.csv'):
print(row)
它打印出下面一行,如果需要,您可以直接将其保存到文件中:
Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198
pandas.read_csv 默认有一个参数 skip_blank_lines=True。 其他的我pandas处理。
df1 = pd.read_csv('filename', skiprows=2, skipfooter=16)
df1[Geography']='Boston'
df2 = pd.read_csv('filename', skiprows=9, skipfooter=8)
df2[Geography']='Boston'
df3 = pd.read_csv('filename', skiprows=2, skipfooter=1) #adjust those, they might have errors
df3[Geography']='Boston'
df=pd.concat(df1,df2,df3)
我知道很难对多个数据执行此操作,但这是我能想到的最佳解决方案。 祝你好运解决你的问题!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.