[英]How do i handle data from a csv file that has a comma in one of the fields?
commas within data in a csv file csv文件中数据中的逗号
this is the data set download im working with from WHO official website这是我从世卫组织官方网站下载的数据集
this is my functional code, but some of the record contains a comma, it is contained by having quotation marks at the start and end of the field but i dont know how keep it as one string before it goes into the parallel arrays, i cant just import csv because its for my sqa higher (age 15-16 for those who arnt from scotland) computing course and i have to do it manually这是我的功能代码,但有些记录包含一个逗号,它包含在字段的开头和结尾处有引号,但我不知道如何在它进入并行数组之前将其保留为一个字符串,我不能只需导入 csv,因为它适用于我的 sqa 更高(对于那些来自苏格兰的 15-16 岁的人)计算课程,我必须手动完成
one of the countries names have a comma in it其中一个国家名称中有一个逗号
2020-09-30,BQ,"Bonaire, Sint Eustatius and Saba",AMRO,9,115,0,1
2020-09-30,BQ,“博内尔、圣尤斯特歇斯和萨巴”,AMRO,9,115,0,1
("Bonaire, Sint Eustatius and Saba") (“博内尔岛、圣尤斯特歇斯和萨巴岛”)
where as the rest dont and nor quote marks其余的不和也不引号
2020-09-08,AF,Afghanistan,EMRO,96,38494,3,1415
2020-09-08,AF,阿富汗,EMRO,96,38494,3,1415
(Afghanistan) (阿富汗)
and basically it splits where i dont want it to and im left with 9 fields when there sould only be 8基本上它在我不想要的地方分裂,当只有 8 个字段时,我留下了 9 个字段
def get_data():
country_code, country_code, country, who_region, new_cases, cumulative_cases, new_deaths, cumulative_deaths = [], [], [], [], [] , [], [], []
with open("data\csv\\todays_data.csv") as f:
next(f)
t=0
for line in f:
field = line.split(",")
country_code.append(field[0])
country_code.append(field[1])
country.append(field[2])
who_region.append(field[3])
cumulative_cases.append(int(field[5]))
new_deaths.append(int(field[6]))
cumulative_deaths.append(int(field[7].strip("\n")))
print("data successfully read")
return country_code, country_code, country, who_region, new_cases, cumulative_cases, new_deaths, cumulative_deaths
Code here for your example only, at most only one special quotation string.此处的代码仅作为示例,最多只有一个特殊的引号字符串。
from itertools import repeat
lines = [
'2020-09-30,BQ,"Bonaire, Sint Eustatius and Saba",AMRO,9,115,0,1',
'2020-09-08,AF,Afghanistan,EMRO,96,38494,3,1415',
]
magic = 'a0b1c2d3e4' # magic string won't be found in your csv file
result = []
for line in lines:
if '"' in line:
temp = line.split('"')
temp[1] = temp[1].replace(',', magic)
line = '"'.join(temp)
result.append(list(map(str.replace, line.split(','), repeat(magic), repeat(','))))
>>> result
[['2020-09-30', 'BQ', '"Bonaire, Sint Eustatius and Saba"', 'AMRO', '9', '115', '0', '1'],
['2020-09-08', 'AF', 'Afghanistan', 'EMRO', '96', '38494', '3', '1415']]
Given that this is an assignment for a computing course, it's likely intending for you to build a simple parser.鉴于这是计算课程的作业,您可能打算构建一个简单的解析器。 The pseudo-code would be something like this:
伪代码将是这样的:
data = []
for char in file:
record = []
is_in_quotes = false
cell = ''
switch char:
case ',':
if is_in_quotes:
# If we are inside a quoted string, treat the comma as part of the cell's value
cell.append(char)
else:
# Otherwise, it denotes the end of a cell
record.append(cell)
cell = ''
case '"':
# We assume that
# - there are only ever two double quotes within a cell
# - the double quotes only appear right after the leading command and right before the trailing comma
# Because of that, we can simply toggle the state
is_in_quotes = !is_in_quotes
case '\n':
# At line end, we first add the last cell's value to the record
record.append(cell)
cell = ''
# Then we add the record to the data set
data.append(record)
record = []
default:
# Any other character is simply treated as part of the cell's value
cell.append(char)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.