How to read CSV files having multiple columns with same or similar names?

Question

I am given a CSV that has two issues that is provided by a third party, out of my control to.

Columns with Similar Names
Columns with the Same Name

Different CSV will have different Similar Names

CSV File A

File Name,Column2[en],Column2[us],isPartOf,isPartOf
file1.tif,English,US English,USA,North America
file2.tif,English,,USA,
file3.tif,,US,,North America

CSV File B

File Name,Column2[fr],Column2[en],isPartOf,isPartOf

Is it possible using csv.DictReader to use startswith() to read multiple columns? Or do I need to create a map of the header row and map them separately before reading the CSV with DictReader ?

Is it possible to read both to load the data from both columns with the same name? I know you can do something with dataframes in pandas, but I am not allowed to use Pandas.

#!/bin/env python3

import csv

with open("./test.csv") as csv_file:
        csv_reader = csv.DictReader(csv_file, delimiter=',')
        for row in csv_reader:
                print(row["isPartOf"],row["isPartOf"])

I run this using:

$ ./csvReader.py 
North America North America
North America

Answer 1

You could create a class which uses csv.reader to read the first line, use its column names to figure out how to handle duplicate columns, and then yield rows as dictionaries when iterated over. This example groups all columns by name, and if multiple columns have the same name, returns a tuple containing all the column values in the dictionary

import csv
import collections

class DuplicateColumnDictReader:
    def __init__(self, iterable, dialect='excel', **kwargs):
        self.reader = csv.reader(iterable, dialect, **kwargs)
        self.header = next(self.reader)
        self.columns_grouping = collections.defaultdict(list)
        
        for index, col_name in enumerate(self.header):
            self.columns_grouping[col_name].append(index)
            
    def __iter__(self):
        return self
    
    def __next__(self):
        row = next(self.reader)
        row_dict = dict()
        for col_name, col_indices in self.columns_grouping.items():
            if len(col_indices) == 1:
                row_dict[col_name] = row[col_indices[0]]
            else:
                row_dict[col_name] = tuple(row[index] for index in col_indices)
        return row_dict

Running this with your file A gives:

import io

csv_str = """File Name,Column2[en],Column2[us],isPartOf,isPartOf
file1.tif,English,US English,USA,North America
file2.tif,English,,USA,
file3.tif,,US,,North America"""

reader = DuplicateColumnDictReader(io.StringIO(csv_str), delimiter=",")
for row in reader:
    print(row["isPartOf"])

Which will print:

('USA', 'North America')
('USA', '')
('', 'North America')

How to read CSV files having multiple columns with same or similar names?

Question

1 answers

solution1
0 2022-08-16 16:32:28

How to read CSV files having multiple columns with same or similar names?

Question

1 answers

solution1 0 2022-08-16 16:32:28

solution1
0 2022-08-16 16:32:28