我如何处理来自其中一个字段中包含逗号的 csv 文件的数据？

Question

commas within data in a csv file csv文件中数据中的逗号

this is the data set download im working with from WHO official website这是我从世卫组织官方网站下载的数据集

this is my functional code, but some of the record contains a comma, it is contained by having quotation marks at the start and end of the field but i dont know how keep it as one string before it goes into the parallel arrays, i cant just import csv because its for my sqa higher (age 15-16 for those who arnt from scotland) computing course and i have to do it manually这是我的功能代码，但有些记录包含一个逗号，它包含在字段的开头和结尾处有引号，但我不知道如何在它进入并行数组之前将其保留为一个字符串，我不能只需导入 csv，因为它适用于我的 sqa 更高（对于那些来自苏格兰的 15-16 岁的人）计算课程，我必须手动完成

one of the countries names have a comma in it其中一个国家名称中有一个逗号

2020-09-30,BQ,"Bonaire, Sint Eustatius and Saba",AMRO,9,115,0,1 2020-09-30,BQ,“博内尔、圣尤斯特歇斯和萨巴”,AMRO,9,115,0,1

("Bonaire, Sint Eustatius and Saba") （“博内尔岛、圣尤斯特歇斯和萨巴岛”）

where as the rest dont and nor quote marks其余的不和也不引号

2020-09-08,AF,Afghanistan,EMRO,96,38494,3,1415 2020-09-08,AF,阿富汗,EMRO,96,38494,3,1415

(Afghanistan) （阿富汗）

and basically it splits where i dont want it to and im left with 9 fields when there sould only be 8基本上它在我不想要的地方分裂，当只有 8 个字段时，我留下了 9 个字段

def get_data():

    country_code, country_code, country, who_region, new_cases, cumulative_cases, new_deaths, cumulative_deaths = [], [], [], [], [] , [], [], []
    with open("data\csv\\todays_data.csv") as f:
        next(f)
        t=0
        for line in f:
            field = line.split(",")
            country_code.append(field[0])
            country_code.append(field[1])
            country.append(field[2])
            who_region.append(field[3])
            cumulative_cases.append(int(field[5]))
            new_deaths.append(int(field[6]))
            cumulative_deaths.append(int(field[7].strip("\n")))
    print("data successfully read")
    return country_code, country_code, country, who_region, new_cases, cumulative_cases, new_deaths, cumulative_deaths

Answer 1

Code here for your example only, at most only one special quotation string.此处的代码仅作为示例，最多只有一个特殊的引号字符串。

from itertools import repeat

lines = [
    '2020-09-30,BQ,"Bonaire, Sint Eustatius and Saba",AMRO,9,115,0,1',
    '2020-09-08,AF,Afghanistan,EMRO,96,38494,3,1415',
]

magic = 'a0b1c2d3e4'  # magic string won't be found in your csv file
result = []
for line in lines:
    if '"' in line:
        temp = line.split('"')
        temp[1] = temp[1].replace(',', magic)
        line = '"'.join(temp)
    result.append(list(map(str.replace, line.split(','), repeat(magic), repeat(','))))

>>> result
[['2020-09-30', 'BQ', '"Bonaire, Sint Eustatius and Saba"', 'AMRO', '9', '115', '0', '1'],
 ['2020-09-08', 'AF', 'Afghanistan', 'EMRO', '96', '38494', '3', '1415']]

Answer 2

Given that this is an assignment for a computing course, it's likely intending for you to build a simple parser.鉴于这是计算课程的作业，您可能打算构建一个简单的解析器。 The pseudo-code would be something like this:伪代码将是这样的：

data = []
for char in file:
  record = []
  is_in_quotes = false
  cell = ''

  switch char:
    case ',':
      if is_in_quotes:
        # If we are inside a quoted string, treat the comma as part of the cell's value
        cell.append(char)
      else:
        # Otherwise, it denotes the end of a cell
        record.append(cell)
        cell = ''
    case '"':
      # We assume that
      # - there are only ever two double quotes within a cell
      # - the double quotes only appear right after the leading command and right before the trailing comma
      # Because of that, we can simply toggle the state
      is_in_quotes = !is_in_quotes
    case '\n':
      # At line end, we first add the last cell's value to the record
      record.append(cell)
      cell = ''
      # Then we add the record to the data set
      data.append(record)
      record = []
    default:
      # Any other character is simply treated as part of the cell's value
      cell.append(char)

我如何处理来自其中一个字段中包含逗号的 csv 文件的数据？

问题描述

2 个解决方案

解决方案1
1 2020-10-01 19:52:43

解决方案2
0 2020-10-02 20:04:47

我如何处理来自其中一个字段中包含逗号的 csv 文件的数据？

问题描述

2 个解决方案

解决方案1 1 2020-10-01 19:52:43

解决方案2 0 2020-10-02 20:04:47

解决方案1
1 2020-10-01 19:52:43

解决方案2
0 2020-10-02 20:04:47