Getting information for multiple queries across multiple .csv files

Question

I am currently trying to figure out a way to get information stored across multiple datasets as .csv files.

Context

For the purposes of this question, suppose I have 4 datasets: experiment_1.csv, experiment_2.csv, experiment_3.csv, and experiment_4.csv. In each dataset, there are 20,000+ rows with 80+ columns in each row. Each row represents an Animal, identified by a id number, and each column represents various experimental data about that Animal. Assume each row's Animal ID number is unique for each dataset, but not across all datasets. For instance, ID#ABC123 can be found in experiment_1.csv, experiment_2.csv, but not experiment_3.csv and experiment_4.csv

Problem

Say a user wants to get info for ~100 Animals by looking up each Animal's ID # across all datasets. How would I go about doing this? I'm relatively new to programming, and I would like to improve. Here's what I have so far.

class Animal:
    def __init__(self, id_number, *other_parameters):
        self.animal_id = id_number
        self.animal_data = {}

    def store_info(self, csv_row, dataset):
        self.animal_data[dataset] = csv_row

# Main function
# ...
# Assume animal_queries = list of Animal Objects

# Iterate through each dataset csv file
for dataset in all_datasets:

    # Make a copy of the list of queries
    copy_animal_queries = animal_queries[:]

    with open(dataset, 'r', newline='') as dataset_file:
        reader = csv.DictReader(dataset_file, delimiter=',')

        # Iterate through each row in the csv file
        for row in reader:

            # Check if the list is not empty
            if animal_queries_copy:

                # Get the current row's animal id number
                row_animal_id = row['ANIMAL ID']

                # Check if the animal id number matches with a query for
                # every animal in the list
                for animal in animal_queries_copy[:]:

                    if animal.animal_id == row_animal_id:

                        # If a match is found, store the info, remove the 
                        # query from the list, and exit iterating through 
                        # each query

                        animal.store_info(row, dataset)
                        animal_list_copy.remove(animal)
                        break

            # If the list is empty, all queries were found for the current 
            # dataset, so exit iterating through rows in reader
            else:
                break

Discussion

Is there a more obvious approach for this? Assume that I want to use .csv files for now, and I will consider converting these .csv files to an easier-to-use format like SQL Tables later down the line (I am an absolute beginner at databases and SQL, so I need to spend time learning this).

The one thing that sticks out to me is that I have to create multiple copies of animal_queries : 1 for each dataset, and 1 for each row in a dataset (in the for loop). Since 1 row only contains 1 ID, I can exit the loop early once I find a match to an ID from animal_queries . In addition, since that ID was already found, I no longer need to search for that ID for the rest of the current dataset, so I remove it from the list, but I need to keep the original copy of the queries since I also need it to search the remaining datasets. However, I can't remove an element from a list while inside a for loop, so I need to create another copy as well. This doesn't seem optimal to me and I'm wondering if I'm approaching this in the wrong direction. Any help would be appreciated, thanks!

Answer 1

Well, you could greatly speed this up by using the pandas library for one thing. Ignoring the class definition for now, you could do the following:

import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user

#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]

#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]

#concatenate the data
final_data = pd.concat(retrieved_data)

#export to csv
final_data.to_csv('your_data')

This simplifies things a lot. The isin method slices each data frame where ANIMAL ID is found in the list animal_queires. Incidentally pandas will also help you to cope with sql tables also so is probably a good route for you to go down.

Getting information for multiple queries across multiple .csv files

Question

Context

Problem

Discussion

1 answers

solution1
0 2015-11-24 00:18:17

Getting information for multiple queries across multiple .csv files

Question

Context

Problem

Discussion

1 answers

solution1 0 2015-11-24 00:18:17

solution1
0 2015-11-24 00:18:17