简体   繁体   English

我正在尝试使用 pd.read_csv() 将文件加载到 Python 中,但我无法理解文件的格式

[英]I'm trying to load a file into Python using pd.read_csv(), but I cannot understand the file's format

This is my very first question on stackoverflow, so I must beg your patience.这是我关于 stackoverflow 的第一个问题,所以我必须请求你的耐心。

I believe there is something wrong with the format of a csv file I need to load into Python.我相信我需要加载到 Python 中的 csv 文件的格式有问题。 I'm using a Jupyter Notebook.我正在使用 Jupyter Notebook。 The link to the file is here .该文件的链接在这里 It is from the World Inequality Database data portal.它来自世界不平等数据库数据门户。

I'm pretty sure the delimiter is a semi-colon ( sep=";" ) because the bottom half of the data renders neatly when I specify this argument.我很确定分隔符是一个分号( sep=";" ),因为当我指定这个参数时,数据的下半部分会整齐地呈现。 However the first half of the text in the file seems to make no sense.然而,文件中文本的前半部分似乎没有意义。 I have no idea how to tell the pd.read_csv() function how to read it.我不知道如何告诉pd.read_csv()函数如何读取它。 I suspect the first half of the data simply has terrible formatting.我怀疑前半部分数据的格式很糟糕。 I've also tried header=None and sep="|"我也试过header=Nonesep="|" to no avail.无济于事。

Any ideas or suggestions would be very helpful.任何想法或建议都会非常有帮助。 Thank you very much!非常感谢!

This is common with speadsheets.这在电子表格中很常见。 You have have some commentary, tables may be inserted all over the place.您有一些评论,可能会到处插入表格。 It looks great to the content creator, but the CSV is a mess.对于内容创建者来说,它看起来很棒,但 CSV 却是一团糟。 You need to preprocess the CSV to create clean content for your analysis.您需要预处理 CSV 以为您的分析创建干净的内容。 In this case, its easy.在这种情况下,这很容易。 The content starts at canned header and you can split the file there.内容从罐头标题开始,您可以在那里拆分文件。 If that header changes, you'll get an error and now its just one more sleepless night figuring out what they've done.如果该标头更改,您将收到错误消息,现在只需再睡一个不眠之夜,即可弄清楚他们做了什么。

import itertools

canned_header_line = "Variable Code;country;year;perc;agdpro999i;"\
    "npopul999i;mgdpro999i;inyixx999i;xlceux999i;xlcusx999i;xlcyux999i"

def scrub_WID_file(in_csv_filename, out_csv_filename):
    with open(in_csv_filename) as in_file,\
            open(out_csv_filename, 'w') as out_file:
        out_file.writelines(itertools.dropwhile(
            lambda line: line.strip() != canned_header_line,
            in_fp))
    if not os.stat.st_size:
        raise ValueError("No recognized header in " + in_csv_filename)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM