简体   繁体   English

如何从多个 CSV 文件中读取特定列,并使用 Python 跳过某些文件中不存在的列 Pandas

[英]How to read specific columns from mulitple CSV files, and skip columns that do not exist in some of the files using Python Pandas

I have data about online transactions that are stored in CSV files, one file per day.我有关于存储在 CSV 个文件中的在线交易数据,每天一个文件。 These files contain over 100 columns, but I only want to extract a few of them (eg user_id, event_type, event_time, event_store, sale_amount).这些文件包含 100 多列,但我只想提取其中的几列(例如 user_id、event_type、event_time、event_store、sale_amount)。 The columns included in the files have changed over time, so that more recent files have different column names that I also would like to extract (eg discount_amount).文件中包含的列随时间发生了变化,因此最近的文件具有不同的列名称,我也想提取这些名称(例如 discount_amount)。 I want to only extract the columns that I need in order to avoid loading a lot of unnecessary data.我只想提取我需要的列,以避免加载大量不必要的数据。

So far, I have tried to use the pandas.read_csv("file_name.csv", usecols=col_list) argument in order to only load the columns I want.到目前为止,我已经尝试使用pandas.read_csv("file_name.csv", usecols=col_list)参数来只加载我想要的列。 However, as not all CSV files contain these desired columns, when one of those files passes through the loop, it fails with the error message that the specific column was not found.但是,由于并非所有 CSV 文件都包含这些所需的列,因此当其中一个文件通过循环时,它会失败并显示找不到特定列的错误消息。 Is there any way to make python skip a column that does not exist in a csv file, rather than produce an error and terminate?有没有办法让 python 跳过 csv 文件中不存在的列,而不是产生错误并终止?

Here is what I have so far:这是我到目前为止所拥有的:

data = []

col_list = ["user_id", "event_type", "event_time", "event_store", "sale_amount", "discount_amount"]

for obj in files:
    csv_obj = client.get_object(Bucket=bucket_name, Key=obj)
    body = csv_obj['Body']
    csv_string = body.read().decode('utf-8')
    temp = pd.read_csv(StringIO(csv_string), usecols=col_list)
    data.append(temp)

# combining all dataframes into one
event_data = pd.concat(data, ignore_index=True)

Thanks for any help given!感谢您提供的任何帮助!

You could try to read only the columns names from the csv file and check them with your desired columns as follows:您可以尝试只读取 csv 文件中的列名称,并使用您想要的列检查它们,如下所示:

import csv 

desired_col = ["user_id", "event_type"]  # I selected only two values

for file_name in csv_files:

    csv_cols = next(csv.reader(open(file_name))) # read only the csv columns names

    cols = [col for col in desired_col if col in csv_cols]

    df = pd.read_csv(file_name, usecols=cols)

Then, each time you read a new csv file, you need first to read the names of columns and then check desired_columns against csv_columns.然后,每次读取一个新的 csv 文件时,您需要先读取列名,然后检查 desired_columns 与 csv_columns。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM