简体   繁体   English

如何使用csv模块处理字段值内的双引号?

[英]How to handle double quotes inside field values with csv module?

I'm trying to parse CSV files from an external system which I have no control of. 我正在尝试从无法控制的外部系统解析CSV文件。

  • comma is used as a separator 逗号用作分隔符
  • when cell contains comma then it's wrapped in quotes and all other quotes are escaped with another quote character. 当单元格包含逗号时,则将其用引号引起来,而所有其他引号均使用另一个引号字符进行转义。
  • (my problem) when cell was not wrapped in quotes then all quote characters are escaped with another quote nonetheless. (我的问题)当单元格未用引号引起来时,所有引号字符都会被另一个引号转义。

Example CSV: CSV示例:

qw""erty,"a""b""c""d,ef""""g" qw“” erty,“ a”“ b”“ c”“ d,ef”“”“ g”

Should be parsed as: 应该解析为:

[['qw"erty', 'a"b"c"d,ef""g']]

However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place. 但是,我认为Python的csv模块不希望将单元格放在第一位时不将引号字符转义。 csv.reader(my_file) (with default doublequote=True ) returns: csv.reader(my_file) (默认为doublequote=True )返回:

['qw""erty', 'a"b"c"d,ef""g']

Is there any way to parse this with python csv module ? 有什么办法可以使用python csv模块来解析它吗?

Following on @JackManey comment where he suggested to replace all instances of '""' inside of double quotes with '\\\\"' . 在@JackManey注释之后,他建议将双引号内的所有'""'实例替换为'\\\\"'

Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""' with '\\\\"' . Python documentation says : 识别我们当前是否在双引号内是不必要的,我们可以用'\\\\"'替换所有'""'实例Python文档说

On reading, the escapechar removes any special meaning from the following character 阅读时,escapechar删除了以下字符中的任何特殊含义

However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\\\\\""erty' producing [['qw\\\\"erty']] . So we have to escape the escape characters before parsing too. 但是,在原始单元格已经包含转义字符的情况下,这仍然会中断,例如: 'qw\\\\\\\\""erty'产生[['qw\\\\"erty']] 。因此,我们必须在转义之前转义转义字符也解析。

Final solution: 最终解决方案:

with open(file_path, 'rb') as f:
  content = f.read().replace('\\', '\\\\').replace('""', '\\"')
  reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
  return [row for row in reader]

就像@JackManey建议的那样,在读取文件后,您可以将单引号替换为双引号。

my_file_onequote = [col.replace('""', '"') for col in row for row in my_file]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM