[英]How to read csv file with additional comma in quotes using pyspark?
[英]To read a field with comma and quotes in csv where comma is delimiter - pyspark
我在输入的csv文件中有一条记录,
"2017-11-01","2017-10-29","2017-11-04","4532491","","","","Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation","1000","Richard W. Judd"
当我在pyspark中阅读此csv时,字段"Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation"
被分隔为单独的列。
>>> df = spark.read.csv('file.csv')
>>> df.show(truncate=False)
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|_c0 |_c1 |_c2 |_c3 |_c4 |_c5 |_c6 |_c7 |_c8 |_c9 |_c10|_c11 |
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|2017-11-01|2017-10-29|2017-11-04| 4532491 |null|null|null|Natural States: "The Environmental Imagination" in Maine | Oregon| and the Nation |1000|Richard W. Judd|
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
除了更改输入文件中的定界符外,任何解决方法都无法解决,因为我们无法更改输入文件。
您可以使用sparkContext
读取文件并以多个字符split
为","
,然后将rdd
转换为rdd
dataframe
,如下所示
rdd = sc.textFile("file.csv")
def replaceFunc(words):
result = []
for word in words.split("\",\""):
result.append(word.replace("\"", ""))
return result
rdd.map(replaceFunc).toDF().show(1, False)
您应该有以下输出
+----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
|_1 |_2 |_3 |_4 |_5 |_6 |_7 |_8 |_9 |_10 |
+----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
|2017-11-01|2017-10-29|2017-11-04|4532491| | | |Natural States: The Environmental Imagination in Maine, Oregon, and the Nation|1000|Richard W. Judd|
+----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
这可能会与sep='","'
类似,例如:
spark.read.csv('file.csv', sep='","')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.