简体   繁体   English

如何使用 Python 读取具有不同列数的 csv 文件

[英]How to read csv files with different amounts of columns using Python

import glob
files = glob.glob("Data/*.csv")
df = pd.concat((pd.read_csv(f) for f in files))
print(df)

I get an error that says: "ParserError: Error tokenizing data. C error: Expected 39 fields in line 273, saw 40".我收到一条错误消息:“ParserError:错误标记数据。C 错误:第 273 行中应有 39 个字段,看到 40”。 Then as per this question: import csv with different number of columns per row using Pandas , I tried passing in the names of the columns, using StringIO and BytesIO, then I got errors like: "TypeError: initial_value must be str or None, not list" or "TypeError: a bytes-like object is required, not 'list'".然后根据这个问题: import csv with different number of columns using Pandas ,我尝试使用 StringIO 和 BytesIO 传入列的名称,然后我收到如下错误:“TypeError: initial_value must be str or None, not list”或“TypeError:需要一个类似字节的对象,而不是'list'”。 I am looking at over 20 csv files.我正在查看 20 多个 csv 文件。

it looks like you have not tried all solutions as you actually had an answer in the link you shared: https://stackoverflow.com/a/57824142/8805842 if you inspect the last row/last column cell in your .csv file you will see why you get error.看起来您尚未尝试所有解决方案,因为您在共享的链接中实际上有答案: https ://stackoverflow.com/a/57824142/8805842 如果您检查 .csv 文件中的最后一行/最后一列单元格,您会看到你为什么会出错。

Solution (simple copy/paste from your question link) with 2 more rows to remove unwanted/empty columns解决方案(从您的问题链接中简单复制/粘贴)多 2 行以删除不需要的/空列

    ### Loop the data lines
    with open("storm_data_search_results.csv", 'r') as temp_f:
        # get No of columns in each line
        col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
    ### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
    column_names = [i for i in range(0, max(col_count))]
    
    ### Read csv
    df = pd.read_csv("storm_data_search_results.csv", header=None, delimiter=",", names=column_names)
    
    # my addition
    df.columns = df.iloc[0] # create headers from the first row
    df = df.iloc[:, 0:39] # keeping data frame with named headers only

Update OMG, be careful... the data they give in .csv actually is not structured properly.... just scroll all it down... if you can use any other source, - use it, Unless you do not need "comments" and you can drop them.更新OMG,小心......他们在 .csv 中提供的数据实际上结构不正确......只需向下滚动它......如果你可以使用任何其他来源, - 使用它,除非你不需要“评论”,您可以删除它们。

Assuming that the problem comes from the text fields that are multiline and can easily get messed up... ...you can remove them using RegEx: re.subn(r'(".*?")',"_______________",xx,xx.count('"'), re.DOTALL)假设问题来自多行文本字段并且很容易弄乱......你可以使用正则表达式删除它们: re.subn(r'(".*?")',"_______________",xx,xx.count('"'), re.DOTALL)

Also, assuming constant headers in all files, you can process all in text and then parse once.此外,假设所有文件中的标题都是不变的,您可以在文本中处理所有内容,然后解析一次。


# Read headers
headers = open(files[0]).read().split('\n',1)[0].split(',')

# Read all files and remove headers
xx = [open(ff).read().split('\n',1)[1] for ff in files]

# Remove the comments fields
dd = [re.sub(r'(".*?")',"__",x,x.count('"'), re.DOTALL) for x in xx]

# Load as CSV
df = pd.read_csv(StringIO(''.join(dd), names = headers)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM