如何使用 Python 读取具有不同列数的 csv 文件

Question

import glob
files = glob.glob("Data/*.csv")
df = pd.concat((pd.read_csv(f) for f in files))
print(df)

我收到一条错误消息：“ParserError：错误标记数据。C 错误：第 273 行中应有 39 个字段，看到 40”。 然后根据这个问题： import csv with different number of columns using Pandas ，我尝试使用 StringIO 和 BytesIO 传入列的名称，然后我收到如下错误：“TypeError: initial_value must be str or None, not list”或“TypeError：需要一个类似字节的对象，而不是'list'”。 我正在查看 20 多个 csv 文件。

Answer 1

看起来您尚未尝试所有解决方案，因为您在共享的链接中实际上有答案： https ://stackoverflow.com/a/57824142/8805842 如果您检查 .csv 文件中的最后一行/最后一列单元格，您会看到你为什么会出错。

解决方案（从您的问题链接中简单复制/粘贴）多 2 行以删除不需要的/空列

    ### Loop the data lines
    with open("storm_data_search_results.csv", 'r') as temp_f:
        # get No of columns in each line
        col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
    ### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
    column_names = [i for i in range(0, max(col_count))]
    
    ### Read csv
    df = pd.read_csv("storm_data_search_results.csv", header=None, delimiter=",", names=column_names)
    
    # my addition
    df.columns = df.iloc[0] # create headers from the first row
    df = df.iloc[:, 0:39] # keeping data frame with named headers only

更新OMG，小心......他们在 .csv 中提供的数据实际上结构不正确......只需向下滚动它......如果你可以使用任何其他来源， - 使用它，除非你不需要“评论”，您可以删除它们。

Answer 2

假设问题来自多行文本字段并且很容易弄乱......你可以使用正则表达式删除它们： re.subn(r'(".*?")',"_______________",xx,xx.count('"'), re.DOTALL)

此外，假设所有文件中的标题都是不变的，您可以在文本中处理所有内容，然后解析一次。


# Read headers
headers = open(files[0]).read().split('\n',1)[0].split(',')

# Read all files and remove headers
xx = [open(ff).read().split('\n',1)[1] for ff in files]

# Remove the comments fields
dd = [re.sub(r'(".*?")',"__",x,x.count('"'), re.DOTALL) for x in xx]

# Load as CSV
df = pd.read_csv(StringIO(''.join(dd), names = headers)

如何使用 Python 读取具有不同列数的 csv 文件

问题描述

2 个解决方案

解决方案1
0 2022-05-21 18:48:01

解决方案2
0 2022-05-21 21:27:23

如何使用 Python 读取具有不同列数的 csv 文件

问题描述

2 个解决方案

解决方案1 0 2022-05-21 18:48:01

解决方案2 0 2022-05-21 21:27:23

解决方案1
0 2022-05-21 18:48:01

解决方案2
0 2022-05-21 21:27:23