获取多个.csv文件中多个查询的信息

Question

我目前正在尝试找出一种方法来获取跨.csv文件存储在多个数据集中的信息。

语境

出于这个问题的目的，假设我有4个数据集：experiment_1.csv，experiment_2.csv，experiment_3.csv和experiment_4.csv。 在每个数据集中，有20,000+行，每行80+列。 每行代表一个动物，用ID号标识，每列代表有关该动物的各种实验数据。 假设每行的动物ID号对于每个数据集都是唯一的，但并非在所有数据集中都是唯一的。 例如，可以在experiment_1.csv，experiment_2.csv中找到ID＃ABC123，但在experiment_3.csv和experiment_4.csv中找不到

问题

假设用户希望通过在所有数据集中查找每个动物的ID号来获取约100个动物的信息。 我将如何去做呢？ 我是编程的新手，我想提高自己。 到目前为止，这就是我所拥有的。

class Animal:
    def __init__(self, id_number, *other_parameters):
        self.animal_id = id_number
        self.animal_data = {}

    def store_info(self, csv_row, dataset):
        self.animal_data[dataset] = csv_row

# Main function
# ...
# Assume animal_queries = list of Animal Objects

# Iterate through each dataset csv file
for dataset in all_datasets:

    # Make a copy of the list of queries
    copy_animal_queries = animal_queries[:]

    with open(dataset, 'r', newline='') as dataset_file:
        reader = csv.DictReader(dataset_file, delimiter=',')

        # Iterate through each row in the csv file
        for row in reader:

            # Check if the list is not empty
            if animal_queries_copy:

                # Get the current row's animal id number
                row_animal_id = row['ANIMAL ID']

                # Check if the animal id number matches with a query for
                # every animal in the list
                for animal in animal_queries_copy[:]:

                    if animal.animal_id == row_animal_id:

                        # If a match is found, store the info, remove the 
                        # query from the list, and exit iterating through 
                        # each query

                        animal.store_info(row, dataset)
                        animal_list_copy.remove(animal)
                        break

            # If the list is empty, all queries were found for the current 
            # dataset, so exit iterating through rows in reader
            else:
                break

讨论区

有没有更明显的方法呢？ 假设我现在想使用.csv文件，并且稍后考虑将这些.csv文件转换为更易于使用的格式，例如SQL Tables（我是数据库和SQL的绝对初学者，所以我需要花时间学习这个）。

animal_queries给我的一件事是，我必须创建animal_queries多个副本：每个数据集1个，数据集中每行1个（在for循环中）。 由于1行仅包含1个ID，因此一旦我从animal_queries找到与ID匹配的内容，就可以提早退出循环。 此外，由于已经找到了该ID，因此不再需要为当前数据集的其余部分搜索该ID，因此我将其从列表中删除，但是由于我还需要保留查询的原始副本，因此搜索剩余的数据集。 但是，我无法在for循环内从列表中删除元素，因此也需要创建另一个副本。 这对我来说似乎不是最佳选择，我想知道我是否朝错误的方向前进。 任何帮助，将不胜感激，谢谢！

Answer 1

好吧，您可以通过使用pandas库来极大地加快这一步。 现在忽略类定义，您可以执行以下操作：

import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user

#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]

#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]

#concatenate the data
final_data = pd.concat(retrieved_data)

#export to csv
final_data.to_csv('your_data')

这大大简化了事情。 isin方法对在列表animal_queires中找到ANIMAL ID的每个数据帧进行切片。 顺便说一句，熊猫也将帮助您处理sql表，因此它可能是您失败的好方法。

获取多个.csv文件中多个查询的信息

问题描述

语境

问题

讨论区

1 个解决方案

解决方案1
0 2015-11-24 00:18:17

获取多个.csv文件中多个查询的信息

问题描述

语境

问题

讨论区

1 个解决方案

解决方案1 0 2015-11-24 00:18:17

解决方案1
0 2015-11-24 00:18:17