简体   繁体   中英

How to handle double quotes inside field values with csv module?

I'm trying to parse CSV files from an external system which I have no control of.

  • comma is used as a separator
  • when cell contains comma then it's wrapped in quotes and all other quotes are escaped with another quote character.
  • (my problem) when cell was not wrapped in quotes then all quote characters are escaped with another quote nonetheless.

Example CSV:

qw""erty,"a""b""c""d,ef""""g"

Should be parsed as:

[['qw"erty', 'a"b"c"d,ef""g']]

However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place. csv.reader(my_file) (with default doublequote=True ) returns:

['qw""erty', 'a"b"c"d,ef""g']

Is there any way to parse this with python csv module ?

Following on @JackManey comment where he suggested to replace all instances of '""' inside of double quotes with '\\\\"' .

Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""' with '\\\\"' . Python documentation says :

On reading, the escapechar removes any special meaning from the following character

However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\\\\\""erty' producing [['qw\\\\"erty']] . So we have to escape the escape characters before parsing too.

Final solution:

with open(file_path, 'rb') as f:
  content = f.read().replace('\\', '\\\\').replace('""', '\\"')
  reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
  return [row for row in reader]

就像@JackManey建议的那样,在读取文件后,您可以将单引号替换为双引号。

my_file_onequote = [col.replace('""', '"') for col in row for row in my_file]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM