[英]How to read CSV files having multiple columns with same or similar names?
I am given a CSV that has two issues that is provided by a third party, out of my control to.我收到了一个 CSV,它有两个问题,由第三方提供,我无法控制。
Different CSV will have different Similar Names不同的 CSV 会有不同的 Similar Names
CSV File A CSV 文件 A
File Name,Column2[en],Column2[us],isPartOf,isPartOf
file1.tif,English,US English,USA,North America
file2.tif,English,,USA,
file3.tif,,US,,North America
CSV File B CSV 文件 B
File Name,Column2[fr],Column2[en],isPartOf,isPartOf
Is it possible using csv.DictReader
to use startswith()
to read multiple columns?是否可以使用csv.DictReader
使用startswith()
来读取多列? Or do I need to create a map of the header row and map them separately before reading the CSV with DictReader
?或者我是否需要在用Dictionary读取ZCC8D68C351C4ADEAFD6D3之前分别创建header行和map的DictReader
?
Is it possible to read both to load the data from both columns with the same name?是否可以读取两者以从具有相同名称的两个列中加载数据? I know you can do something with dataframes in pandas, but I am not allowed to use Pandas.我知道你可以对 pandas 中的数据帧做一些事情,但我不允许使用 Pandas。
#!/bin/env python3
import csv
with open("./test.csv") as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=',')
for row in csv_reader:
print(row["isPartOf"],row["isPartOf"])
I run this using:我使用以下方法运行它:
$ ./csvReader.py
North America North America
North America
You could create a class which uses csv.reader
to read the first line, use its column names to figure out how to handle duplicate columns, and then yield rows as dictionaries when iterated over.您可以创建一个 class ,它使用csv.reader
读取第一行,使用其列名来确定如何处理重复列,然后在迭代时将行作为字典。 This example groups all columns by name, and if multiple columns have the same name, returns a tuple containing all the column values in the dictionary此示例按名称对所有列进行分组,如果多个列具有相同名称,则返回一个包含字典中所有列值的元组
import csv
import collections
class DuplicateColumnDictReader:
def __init__(self, iterable, dialect='excel', **kwargs):
self.reader = csv.reader(iterable, dialect, **kwargs)
self.header = next(self.reader)
self.columns_grouping = collections.defaultdict(list)
for index, col_name in enumerate(self.header):
self.columns_grouping[col_name].append(index)
def __iter__(self):
return self
def __next__(self):
row = next(self.reader)
row_dict = dict()
for col_name, col_indices in self.columns_grouping.items():
if len(col_indices) == 1:
row_dict[col_name] = row[col_indices[0]]
else:
row_dict[col_name] = tuple(row[index] for index in col_indices)
return row_dict
Running this with your file A gives:用你的文件 A 运行它会给出:
import io
csv_str = """File Name,Column2[en],Column2[us],isPartOf,isPartOf
file1.tif,English,US English,USA,North America
file2.tif,English,,USA,
file3.tif,,US,,North America"""
reader = DuplicateColumnDictReader(io.StringIO(csv_str), delimiter=",")
for row in reader:
print(row["isPartOf"])
Which will print:这将打印:
('USA', 'North America')
('USA', '')
('', 'North America')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.