獲取多個.csv文件中多個查詢的信息

Question

我目前正在嘗試找出一種方法來獲取跨.csv文件存儲在多個數據集中的信息。

語境

出於這個問題的目的，假設我有4個數據集：experiment_1.csv，experiment_2.csv，experiment_3.csv和experiment_4.csv。 在每個數據集中，有20,000+行，每行80+列。 每行代表一個動物，用ID號標識，每列代表有關該動物的各種實驗數據。 假設每行的動物ID號對於每個數據集都是唯一的，但並非在所有數據集中都是唯一的。 例如，可以在experiment_1.csv，experiment_2.csv中找到ID＃ABC123，但在experiment_3.csv和experiment_4.csv中找不到

問題

假設用戶希望通過在所有數據集中查找每個動物的ID號來獲取約100個動物的信息。 我將如何去做呢？ 我是編程的新手，我想提高自己。 到目前為止，這就是我所擁有的。

class Animal:
    def __init__(self, id_number, *other_parameters):
        self.animal_id = id_number
        self.animal_data = {}

    def store_info(self, csv_row, dataset):
        self.animal_data[dataset] = csv_row

# Main function
# ...
# Assume animal_queries = list of Animal Objects

# Iterate through each dataset csv file
for dataset in all_datasets:

    # Make a copy of the list of queries
    copy_animal_queries = animal_queries[:]

    with open(dataset, 'r', newline='') as dataset_file:
        reader = csv.DictReader(dataset_file, delimiter=',')

        # Iterate through each row in the csv file
        for row in reader:

            # Check if the list is not empty
            if animal_queries_copy:

                # Get the current row's animal id number
                row_animal_id = row['ANIMAL ID']

                # Check if the animal id number matches with a query for
                # every animal in the list
                for animal in animal_queries_copy[:]:

                    if animal.animal_id == row_animal_id:

                        # If a match is found, store the info, remove the 
                        # query from the list, and exit iterating through 
                        # each query

                        animal.store_info(row, dataset)
                        animal_list_copy.remove(animal)
                        break

            # If the list is empty, all queries were found for the current 
            # dataset, so exit iterating through rows in reader
            else:
                break

討論區

有沒有更明顯的方法呢？ 假設我現在想使用.csv文件，並且稍后考慮將這些.csv文件轉換為更易於使用的格式，例如SQL Tables（我是數據庫和SQL的絕對初學者，所以我需要花時間學習這個）。

animal_queries給我的一件事是，我必須創建animal_queries多個副本：每個數據集1個，數據集中每行1個（在for循環中）。 由於1行僅包含1個ID，因此一旦我從animal_queries找到與ID匹配的內容，就可以提早退出循環。 此外，由於已經找到了該ID，因此不再需要為當前數據集的其余部分搜索該ID，因此我將其從列表中刪除，但是由於我還需要保留查詢的原始副本，因此搜索剩余的數據集。 但是，我無法在for循環內從列表中刪除元素，因此也需要創建另一個副本。 這對我來說似乎不是最佳選擇，我想知道我是否朝錯誤的方向前進。 任何幫助，將不勝感激，謝謝！

Answer 1

好吧，您可以通過使用pandas庫來極大地加快這一步。 現在忽略類定義，您可以執行以下操作：

import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user

#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]

#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]

#concatenate the data
final_data = pd.concat(retrieved_data)

#export to csv
final_data.to_csv('your_data')

這大大簡化了事情。 isin方法對在列表animal_queires中找到ANIMAL ID的每個數據幀進行切片。 順便說一句，熊貓也將幫助您處理sql表，因此它可能是您失敗的好方法。

獲取多個.csv文件中多個查詢的信息

問題描述

語境

問題

討論區

1 個解決方案

解決方案1
0 2015-11-24 00:18:17

獲取多個.csv文件中多個查詢的信息

問題描述

語境

問題

討論區

1 個解決方案

解決方案1 0 2015-11-24 00:18:17

解決方案1
0 2015-11-24 00:18:17