简体   繁体   English

Pyspark 读取 csv 某些数据未解析定界符

[英]Pyspark reading csv delimiter not parsed for some data

csv_df = spark.read.option("header", "true") 
              .csv(path, sep = '┐') 

A small portion of the data cannot be parsed correctly and ends up all in the first column in format "str┐str┐str┐str┐str┐str┐str┐str" and the other columns are null. The number of delimiters in these rows are the same as those that were parsed correctly.一小部分数据无法正确解析,最终全部在第一列格式为"str┐str┐str┐str┐str┐str┐str┐str" ,其他列为null。这些中的分隔符数量行与正确解析的行相同。 There are also nothing else special about the rows that were not parsed from I could tell.我可以说,没有解析的行也没有其他特别之处。 Any idea what might be causing this and how to fix?知道是什么原因造成的以及如何解决吗?

An example that failed parsing:解析失败的例子:

FUNDACAO ESTATAL SAUDE DA FAMILIA FESF┐VIP┐BR┐Creative Cloud All Apps┐PAID┐SMB┐1┐1┐2┐2022-07-29

I usually don't like to write answers that aren't complete, but I'll go through the steps I've taken so far to debug and offer a possible solution.我通常不喜欢写不完整的答案,但我会 go 通过我目前采取的调试步骤并提供可能的解决方案。

I created a one row.csv file with '┐' as the delimiting character (and also a header row):我创建了一个 row.csv 文件,其中'┐'作为分隔符(还有一个 header 行):

Header1┐Header2┐Header3┐Header4┐Header5┐Header6┐Header7┐Header8┐Header9┐Header10
FUNDACAO ESTATAL SAUDE DA FAMILIA FESF┐VIP┐BR┐Creative Cloud All Apps┐PAID┐SMB┐1┐1┐2┐2022-07-29

And when I run the line:当我运行该行时:

csv_df = spark.read.option("header", "true").csv(path, sep = '┐') 

The dataframe loads correctly: dataframe 正确加载:

+--------------------+-------+-------+--------------------+-------+-------+-------+-------+-------+----------+
|             Header1|Header2|Header3|             Header4|Header5|Header6|Header7|Header8|Header9|  Header10|
+--------------------+-------+-------+--------------------+-------+-------+-------+-------+-------+----------+
|FUNDACAO ESTATAL ...|    VIP|     BR|Creative Cloud Al...|   PAID|    SMB|      1|      1|      2|2022-07-29|
+--------------------+-------+-------+--------------------+-------+-------+-------+-------+-------+----------+

However, if I put quotation marks around the first non-header row, then this will escape all of the '┐' delimiter symbols in that row, so they won't be parsed.但是,如果我在第一个非标题行周围加上引号,那么这将转义该行中的所有'┐'分隔符,因此它们不会被解析。

Header1┐Header2┐Header3┐Header4┐Header5┐Header6┐Header7┐Header8┐Header9┐Header10
"FUNDACAO ESTATAL SAUDE DA FAMILIA FESF┐VIP┐BR┐Creative Cloud All Apps┐PAID┐SMB┐1┐1┐2┐2022-07-29"

This will lead to the behavior you observed when you try to load the csv file:这将导致您在尝试加载 csv 文件时观察到的行为:

+--------------------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|             Header1|Header2|Header3|Header4|Header5|Header6|Header7|Header8|Header9|Header10|
+--------------------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|FUNDACAO ESTATAL ...|   null|   null|   null|   null|   null|   null|   null|   null|    null|
+--------------------+-------+-------+-------+-------+-------+-------+-------+-------+--------+

Therefore, I think your csv file most likely has quotation marks around your row, or that there are one or more characters inside your csv file next to the location of that row that are causing the problem – the problem is probably with the csv file itself rather than the pyspark csv parser.因此,我认为你的 csv 文件很可能在你的行周围有引号,或者你的 csv 文件中有一个或多个字符在导致问题的那一行的位置旁边——问题可能出在 csv 文件本身而不是 pyspark csv 解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM