如何读取具有多个具有相同或相似名称的列的 CSV 文件？

Question

I am given a CSV that has two issues that is provided by a third party, out of my control to.我收到了一个 CSV，它有两个问题，由第三方提供，我无法控制。

Columns with Similar Names具有相似名称的列
Columns with the Same Name具有相同名称的列

Different CSV will have different Similar Names不同的 CSV 会有不同的 Similar Names

CSV File A CSV 文件 A

File Name,Column2[en],Column2[us],isPartOf,isPartOf
file1.tif,English,US English,USA,North America
file2.tif,English,,USA,
file3.tif,,US,,North America

CSV File B CSV 文件 B

File Name,Column2[fr],Column2[en],isPartOf,isPartOf

Is it possible using csv.DictReader to use startswith() to read multiple columns?是否可以使用csv.DictReader使用startswith()来读取多列？ Or do I need to create a map of the header row and map them separately before reading the CSV with DictReader ?或者我是否需要在用Dictionary读取ZCC8D68C351C4ADEAFD6D3之前分别创建header行和map的DictReader ？

Is it possible to read both to load the data from both columns with the same name?是否可以读取两者以从具有相同名称的两个列中加载数据？ I know you can do something with dataframes in pandas, but I am not allowed to use Pandas.我知道你可以对 pandas 中的数据帧做一些事情，但我不允许使用 Pandas。

#!/bin/env python3

import csv

with open("./test.csv") as csv_file:
        csv_reader = csv.DictReader(csv_file, delimiter=',')
        for row in csv_reader:
                print(row["isPartOf"],row["isPartOf"])

I run this using:我使用以下方法运行它：

$ ./csvReader.py 
North America North America
North America

Answer 1

You could create a class which uses csv.reader to read the first line, use its column names to figure out how to handle duplicate columns, and then yield rows as dictionaries when iterated over.您可以创建一个 class ，它使用csv.reader读取第一行，使用其列名来确定如何处理重复列，然后在迭代时将行作为字典。 This example groups all columns by name, and if multiple columns have the same name, returns a tuple containing all the column values in the dictionary此示例按名称对所有列进行分组，如果多个列具有相同名称，则返回一个包含字典中所有列值的元组

import csv
import collections

class DuplicateColumnDictReader:
    def __init__(self, iterable, dialect='excel', **kwargs):
        self.reader = csv.reader(iterable, dialect, **kwargs)
        self.header = next(self.reader)
        self.columns_grouping = collections.defaultdict(list)
        
        for index, col_name in enumerate(self.header):
            self.columns_grouping[col_name].append(index)
            
    def __iter__(self):
        return self
    
    def __next__(self):
        row = next(self.reader)
        row_dict = dict()
        for col_name, col_indices in self.columns_grouping.items():
            if len(col_indices) == 1:
                row_dict[col_name] = row[col_indices[0]]
            else:
                row_dict[col_name] = tuple(row[index] for index in col_indices)
        return row_dict

Running this with your file A gives:用你的文件 A 运行它会给出：

import io

csv_str = """File Name,Column2[en],Column2[us],isPartOf,isPartOf
file1.tif,English,US English,USA,North America
file2.tif,English,,USA,
file3.tif,,US,,North America"""

reader = DuplicateColumnDictReader(io.StringIO(csv_str), delimiter=",")
for row in reader:
    print(row["isPartOf"])

Which will print:这将打印：

('USA', 'North America')
('USA', '')
('', 'North America')

如何读取具有多个具有相同或相似名称的列的 CSV 文件？

问题描述

1 个解决方案

解决方案1
0 2022-08-16 16:32:28

如何读取具有多个具有相同或相似名称的列的 CSV 文件？

问题描述

1 个解决方案

解决方案1 0 2022-08-16 16:32:28

解决方案1
0 2022-08-16 16:32:28