简体   繁体   English

获取多个.csv文件中多个查询的信息

[英]Getting information for multiple queries across multiple .csv files

I am currently trying to figure out a way to get information stored across multiple datasets as .csv files. 我目前正在尝试找出一种方法来获取跨.csv文件存储在多个数据集中的信息。

Context 语境

For the purposes of this question, suppose I have 4 datasets: experiment_1.csv, experiment_2.csv, experiment_3.csv, and experiment_4.csv. 出于这个问题的目的,假设我有4个数据集:experiment_1.csv,experiment_2.csv,experiment_3.csv和experiment_4.csv。 In each dataset, there are 20,000+ rows with 80+ columns in each row. 在每个数据集中,有20,000+行,每行80+列。 Each row represents an Animal, identified by a id number, and each column represents various experimental data about that Animal. 每行代表一个动物,用ID号标识,每列代表有关该动物的各种实验数据。 Assume each row's Animal ID number is unique for each dataset, but not across all datasets. 假设每行的动物ID号对于每个数据集都是唯一的,但并非在所有数据集中都是唯一的。 For instance, ID#ABC123 can be found in experiment_1.csv, experiment_2.csv, but not experiment_3.csv and experiment_4.csv 例如,可以在experiment_1.csv,experiment_2.csv中找到ID#ABC123,但在experiment_3.csv和experiment_4.csv中找不到

Problem 问题

Say a user wants to get info for ~100 Animals by looking up each Animal's ID # across all datasets. 假设用户希望通过在所有数据集中查找每个动物的ID号来获取约100个动物的信息。 How would I go about doing this? 我将如何去做呢? I'm relatively new to programming, and I would like to improve. 我是编程的新手,我想提高自己。 Here's what I have so far. 到目前为止,这就是我所拥有的。

class Animal:
    def __init__(self, id_number, *other_parameters):
        self.animal_id = id_number
        self.animal_data = {}

    def store_info(self, csv_row, dataset):
        self.animal_data[dataset] = csv_row

# Main function
# ...
# Assume animal_queries = list of Animal Objects

# Iterate through each dataset csv file
for dataset in all_datasets:

    # Make a copy of the list of queries
    copy_animal_queries = animal_queries[:]

    with open(dataset, 'r', newline='') as dataset_file:
        reader = csv.DictReader(dataset_file, delimiter=',')

        # Iterate through each row in the csv file
        for row in reader:

            # Check if the list is not empty
            if animal_queries_copy:

                # Get the current row's animal id number
                row_animal_id = row['ANIMAL ID']

                # Check if the animal id number matches with a query for
                # every animal in the list
                for animal in animal_queries_copy[:]:

                    if animal.animal_id == row_animal_id:

                        # If a match is found, store the info, remove the 
                        # query from the list, and exit iterating through 
                        # each query

                        animal.store_info(row, dataset)
                        animal_list_copy.remove(animal)
                        break

            # If the list is empty, all queries were found for the current 
            # dataset, so exit iterating through rows in reader
            else:
                break

Discussion 讨论区

Is there a more obvious approach for this? 有没有更明显的方法呢? Assume that I want to use .csv files for now, and I will consider converting these .csv files to an easier-to-use format like SQL Tables later down the line (I am an absolute beginner at databases and SQL, so I need to spend time learning this). 假设我现在想使用.csv文件,并且稍后考虑将这些.csv文件转换为更易于使用的格式,例如SQL Tables(我是数据库和SQL的绝对初学者,所以我需要花时间学习这个)。

The one thing that sticks out to me is that I have to create multiple copies of animal_queries : 1 for each dataset, and 1 for each row in a dataset (in the for loop). animal_queries给我的一件事是,我必须创建animal_queries多个副本:每个数据集1个,数据集中每行1个(在for循环中)。 Since 1 row only contains 1 ID, I can exit the loop early once I find a match to an ID from animal_queries . 由于1行仅包含1个ID,因此一旦我从animal_queries找到与ID匹配的内容,就可以提早退出循环。 In addition, since that ID was already found, I no longer need to search for that ID for the rest of the current dataset, so I remove it from the list, but I need to keep the original copy of the queries since I also need it to search the remaining datasets. 此外,由于已经找到了该ID,因此不再需要为当前数据集的其余部分搜索该ID,因此我将其从列表中删除,但是由于我还需要保留查询的原始副本,因此搜索剩余的数据集。 However, I can't remove an element from a list while inside a for loop, so I need to create another copy as well. 但是,我无法在for循环内从列表中删除元素,因此也需要创建另一个副本。 This doesn't seem optimal to me and I'm wondering if I'm approaching this in the wrong direction. 这对我来说似乎不是最佳选择,我想知道我是否朝错误的方向前进。 Any help would be appreciated, thanks! 任何帮助,将不胜感激,谢谢!

Well, you could greatly speed this up by using the pandas library for one thing. 好吧,您可以通过使用pandas库来极大地加快这一步。 Ignoring the class definition for now, you could do the following: 现在忽略类定义,您可以执行以下操作:

import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user

#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]

#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]

#concatenate the data
final_data = pd.concat(retrieved_data)

#export to csv
final_data.to_csv('your_data') 

This simplifies things a lot. 这大大简化了事情。 The isin method slices each data frame where ANIMAL ID is found in the list animal_queires. isin方法对在列表animal_queires中找到ANIMAL ID的每个数据帧进行切片。 Incidentally pandas will also help you to cope with sql tables also so is probably a good route for you to go down. 顺便说一句,熊猫也将帮助您处理sql表,因此它可能是您失败的好方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM