简体   繁体   English

使用文件名行转换非结构化 CSV

[英]Convert unstructured CSV with filename rows

I'm working with a system that outputs non-standard CSV files.我正在使用一个输出非标准 CSV 文件的系统。 Row 1 always contains the filename, followed by an attribute for the table in row 2 (which sometimes include a comma), table headers in row 3, and then a varying number of data rows.第 1 行始终包含文件名,然后是第 2 行中表格的属性(有时包括逗号)、第 3 行中的表格标题,然后是数量不等的数据行。 After the data rows, there are always two blank lines and the pattern repeats (the headers are always the same within a file).在数据行之后,总是有两个空行并且模式重复(标题在文件中总是相同的)。 Here is a small example:这是一个小例子:

Example Report
Geography:Boston, MA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,275
Week Ending 03-13-22,ITEM DESCRIPTION A,297
Week Ending 03-20-22,ITEM DESCRIPTION A,261


Example Report
Geography:New York, NY
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,393
Week Ending 03-13-22,ITEM DESCRIPTION A,477
Week Ending 03-20-22,ITEM DESCRIPTION A,412


Example Report
Geography:Philadelphia, PA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,195
Week Ending 03-13-22,ITEM DESCRIPTION A,233
Week Ending 03-20-22,ITEM DESCRIPTION A,198

Ultimately, I want to discard the filename & extra header rows and output a standard CSV with the attribute as the first column.最终,我想丢弃文件名和额外的 header 行和 output 一个标准的 CSV 并将属性作为第一列。 This is what the example above should look like:上面的例子应该是这样的:

Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198

I'm used to manipulating standard CSV files in python, but this one is stumping me with the unstructured data mixed in.我习惯于在 python 中操作标准的 CSV 文件,但是这个文件让我感到困惑,其中混合了非结构化数据。

working solution that iterates over such CSV:迭代这样的 CSV 的工作解决方案:

def read_strange_csv(filename):
    header_used = False
    with open(filename) as f:
        while True:
            line_filename = next(f).rstrip()
            line_attribute = next(f).rstrip()
            geography = line_attribute.split(':')[1]
            line_header = next(f).rstrip()
            if not header_used:
                yield f'Geography,{line_header}'
                header_used = True

            for line in f:
                line = line.rstrip()
                if not line:
                    break
                yield f'"{geography}",{line}'

            try:
                next(f)  # empty line
            except StopIteration:
                return


for row in read_strange_csv('example.csv'):
    print(row)

it prints out below line which you can save directly to a file if you need:它打印出下面一行,如果需要,您可以直接将其保存到文件中:

Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198

the pandas.read_csv has a parameter skip_blank_lines=True by default. pandas.read_csv 默认有一个参数 skip_blank_lines=True。 For the other things, I'll process them in pandas.其他的我pandas处理。

df1 = pd.read_csv('filename', skiprows=2, skipfooter=16)
df1[Geography']='Boston'

df2 = pd.read_csv('filename', skiprows=9, skipfooter=8)
df2[Geography']='Boston'

df3 = pd.read_csv('filename', skiprows=2, skipfooter=1) #adjust those, they might have errors
df3[Geography']='Boston'

df=pd.concat(df1,df2,df3)

I know it's hard to do this on multiple data, but that's the best solution that I can think.我知道很难对多个数据执行此操作,但这是我能想到的最佳解决方案。 Good Luck with solving your problem!祝你好运解决你的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM