To read a field with comma and quotes in csv where comma is delimiter - pyspark

Question

I have a record in my input csv file as,

"2017-11-01","2017-10-29","2017-11-04","4532491","","","","Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation","1000","Richard W. Judd"

When I read this csv in pyspark, the field "Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation" gets delimited as separate columns.

>>> df = spark.read.csv('file.csv')
>>> df.show(truncate=False)
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|_c0       |_c1       |_c2       |_c3       |_c4 |_c5 |_c6 |_c7                                                      |_c8    |_c9             |_c10|_c11           |
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|2017-11-01|2017-10-29|2017-11-04| 4532491  |null|null|null|Natural States: "The Environmental Imagination" in Maine | Oregon| and the Nation |1000|Richard W. Judd|
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+

Any workaround apart from changing the delimiter in input file, as we can't change the input file.

Answer 1

you can use sparkContext to read the files and split with multiple characters as "," and then convert the rdd to dataframe as below

rdd = sc.textFile("file.csv")

def replaceFunc(words):
    result = []
    for word in words.split("\",\""):
        result.append(word.replace("\"", ""))
    return result

rdd.map(replaceFunc).toDF().show(1, False)

You should have following output

+----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
|_1        |_2        |_3        |_4     |_5 |_6 |_7 |_8                                                                            |_9  |_10            |
+----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
|2017-11-01|2017-10-29|2017-11-04|4532491|   |   |   |Natural States: The Environmental Imagination in Maine, Oregon, and the Nation|1000|Richard W. Judd|
+----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+

Answer 2

这可能会与sep='","'类似，例如：

spark.read.csv('file.csv', sep='","')

To read a field with comma and quotes in csv where comma is delimiter - pyspark

Question

2 answers

solution1
2 ACCPTED 2018-01-25 04:45:23

solution2
0 2018-01-25 04:19:21

To read a field with comma and quotes in csv where comma is delimiter - pyspark

Question

2 answers

solution1 2 ACCPTED 2018-01-25 04:45:23

solution2 0 2018-01-25 04:19:21

solution1
2 ACCPTED 2018-01-25 04:45:23

solution2
0 2018-01-25 04:19:21