简体   繁体   English

Python:csv.Dictreader 列上的额外逗号

[英]Python: Extra comma on csv.Dictreader column

I have this read function where it reads a csv file using csv.DictReader.我有这个读取函数,它使用 csv.DictReader 读取 csv 文件。 The file.csv is separated by commas and it fully reads. file.csv 由逗号分隔,它完全读取。 However, this part of my file has a column that contains multiple commas.但是,我文件的这一部分有一列包含多个逗号。 My question is, how can I make sure that comma is counted as part of a column?我的问题是,如何确保将逗号视为列的一部分? I cannot alter my csv file to meet the criteria.我无法更改我的 csv 文件以满足标准。

Text File :文本文件

ID,Name,University,Street,ZipCode,Country
12,Jon Snow,U of Winterfell,Winterfell #45,60434,Westeros
13,Steve Rogers,NYU,108, Chelsea St.,23333,United States
20,Peter Parker,Yale,34, Tribeca,32444,United States
34,Tyrion Lannister,U of Casterly Rock,Kings Landing #89, 43543,Westeros

The desired output is this:所需的输出是这样的:

{'ID': '12', 'Name': 'Jon Snow', 'University': 'U of Winterfell', 'Street': 'Winterfell #45', 'ZipCode': '60434', 'Country': 'Westeros'}
{'ID': '13', 'Name': 'Steve Rogers', 'University': 'NYU', 'Street': '108, Chelsea St.', 'ZipCode': '23333', 'Country': 'United States'}
{'ID': '20', 'Name': 'Peter Parker', 'University': 'Yale', 'Street': '34, Tribeca', 'ZipCode': '32444', 'Country': 'United States'}
{'ID': '34', 'Name': 'Tyrion Lannister', 'University': 'U of Casterly Rock', 'Street': 'Kings Landing #89', 'ZipCode': '43543', 'Country': 'Westeros'}

As you can tell the 'Street' has at least two commas due to the numbers:正如您所知道的,由于数字,“街道”至少有两个逗号:

13,Steve Rogers,NYU, 108, Chelsea St. ,23333,United States 13,Steve Rogers,NYU, 108, Chelsea St. ,23333,United States

20,Peter Parker,Yale, 34, Tribeca ,32444,United States 20,彼得帕克,耶鲁, 34,翠贝卡,32444,美国

Note: Most of the columns being read splits by a str,str BUT under the 'Street' column it is followed by a str, str (there is an extra space after the comma).注意:正在读取的大多数列由str,str拆分但在“Street”列下,它后跟str, str (逗号后有一个额外的空格)。 I hope this makes sense.我希望这是有道理的。

The options I tried looking out is using re.split, but I don't know how to implement it on my read file.我尝试寻找的选项是使用 re.split,但我不知道如何在我的读取文件中实现它。 I was thinking re.split(r'(?!\\s),(?!\\s)',x[:-1]) ?我在想re.split(r'(?!\\s),(?!\\s)',x[:-1]) How can I make sure the format from my file will count as part of any column?如何确保我的文件中的格式可以算作任何列的一部分? I can't use pandas.我不能使用熊猫。

My current output looks like this right now:我当前的输出现在看起来像这样:

{'ID': '12', 'Name': 'Jon Snow', 'University': 'U of Winterfell', 'Street': 'Winterfell #45', 'ZipCode': '60434', 'Country': 'Westeros'}
{'ID': '13', 'Name': 'Steve Rogers', 'University': 'NYU', 'Street': '108', 'ZipCode': 'Chelsea St.', 'Country': '23333', None: ['United States']}
{'ID': '20', 'Name': 'Peter Parker', 'University': 'Yale', 'Street': '34', 'ZipCode': 'Tribeca', 'Country': '32444', None: ['United States']}
{'ID': '34', 'Name': 'Tyrion Lannister', 'University': 'U of Casterly Rock', 'Street': 'Kings Landing #89', 'ZipCode': '43543', 'Country': 'Westeros'}

This is my read function:这是我的阅读功能:

import csv

list = []
with open('file.csv', mode='r') as csv_file:
  csv_reader = csv.DictReader(csv_file, delimiter=",", skipinitialspace=True)

  for col in csv_reader:
    list.append(dict(col))
    print(dict(col))

You can't use csv if the file isn't valid CSV format.如果文件不是有效的 CSV 格式,则不能使用csv

You need to call re.split() on ordinary lines, not on dictionaries.您需要在普通行上调用re.split() ,而不是在字典上。

list = []
with open('file.csv', mode='r') as csv_file:
    keys = csv_file.readline().strip().split(',') # Read header line
    for line in csv_file:
        line = line.strip()
        row = re.split(r'(?!\s),(?!\s)',line)
        list.append(dict(zip(keys, row)))

The actual solution for the problem is modifying the script that generates the csv file.该问题的实际解决方案是修改生成 csv 文件的脚本。

If you have a chance to modify that output you can do 2 things如果您有机会修改该输出,您可以做两件事

  • Use a delimiter other than a comma such as |使用逗号以外的分隔符,例如| symbol or ;符号 或; whatever you believe it doesn't exist in the string.无论你认为它不存在于字符串中。
  • Or enclose all columns with " so you'll be able to split them by , which are actual separators.或者用"将所有列括起来,这样您就可以将它们拆分为,这是实际的分隔符。

If you don't have a chance to modify the output.如果您没有机会修改输出。

And if you are sure about that multiple commas are only in the street column;如果您确定多个逗号仅在街道列中; then you should use csv.reader instead of DictReader this way you can get the columns by Indexes that you are already sure.那么你应该使用csv.reader而不是DictReader这样你可以通过你已经确定的索引来获取列。 for instance row[0] will be ID row[1] will be Name and row[-1] will be Country row[-2] will be ZipCode so row[2:-2] would give you what you need i guess.例如row[0]将是ID row[1]将是Namerow[-1]将是Country row[-2]将是ZipCode所以row[2:-2]会给你你需要的我猜。 Indexes can be arranged but the idea is clear I guess.可以安排索引,但我想这个想法很清楚。

Hope that helps.希望有帮助。


Edit:编辑:

import csv

list = []
with open('file.csv', mode='r') as csv_file:
  csv_reader = csv.reader(csv_file, delimiter=",", skipinitialspace=True)
  # pass the header row
  next(csv_reader)
  for row in csv_reader:
  list.append({"ID": row[0],
               "Name": row[1],
               "University": row[2],
               "Street": ' '.join(row[3:-2]),
               "Zipcode": row[-2],
               "Country": row[-1]})
print(list)

-- Here is the output (with pprint) -- 这是输出(使用 pprint)

[{'Country': 'Westeros',
'ID': '12',
'Name': 'Jon Snow',
'Street': 'Winterfell #45',
'University': 'U of Winterfell',
'Zipcode': '60434'},
{'Country': 'United States',
'ID': '13',
'Name': 'Steve Rogers',
'Street': '108 Chelsea St.',
'University': 'NYU',
'Zipcode': '23333'},
 {'Country': 'United States',
'ID': '20',
'Name': 'Peter Parker',
'Street': '34 Tribeca',
'University': 'Yale',
'Zipcode': '32444'},
 {'Country': 'Westeros',
'ID': '34',
'Name': 'Tyrion Lannister',
'Street': 'Kings Landing #89',
'University': 'U of Casterly Rock',
'Zipcode': '43543'}]

-- second edit edited the index on the street. -- 第二次编辑编辑了街道上的索引。 Regards.问候。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM