读取用双引号括起来但带有换行符的csv文件

Question

我在列中有换行符的csv 。 以下是我的示例：

"A","B","C"
1,"This is csv with 
newline","This is another column"
"This is newline
and another line","apple","cat"

我可以在spark中读取文件，但是列内的换行符被视为单独的行。

我如何准备将其作为包含双引号内的文本的csv。

我同时使用apache csv插件和apache读取文件。

alarms = sc.textFile("D:\Dataset\oneday\oneday.csv")

这给了我RDD：

**example.take(5)**

[u'A,B,C', u'1,"This is csv with ', u'newline",This is another column', u'"This is newline', u'and another line",apple,cat']

Spark版本：1.4

Answer 1

标准python库中的csv模块可以立即使用：

>>> txt = '''"A","B","C"
1,"This is csv with 
newline","This is another column"
"This is newline
and another line","apple","cat"'''
>>> import csv
>>> import io
>>> with io.BytesIO(txt) as fd:
    rd = csv.reader(fd)
    for row in rd:
        print row


['A', 'B', 'C']
['1', 'This is csv with \nnewline', 'This is another column']
['This is newline\nand another line', 'apple', 'cat']

可以将其与binaryFiles一起使用（与textFile ，性能会受到很大textFile ）：

>>> (sc.binaryFiles(path)
        .values()
        .flatMap(lambda x: csv.reader(io.BytesIO(x))))

Answer 2

您不需要导入任何内容。 下面提出的解决方案仅出于演示目的而创建了第二个文件。 您可以在修改后读取该行，而无需将其写入任何地方。

with open(r'C:\Users\evkouni\Desktop\test_in.csv', 'r') as fin:
    with open(r'C:\Users\evkouni\Desktop\test_out.csv', 'w') as fout:
        cont = fin.readlines()
        for line in cont[:-1]:
            if line.count('"') % 2 == 1 and '"\n' not in line:
                line = line.replace('\n', '')
            fout.write(line)

#DEMO

#test_in.csv
#------------
#"A";"B";"C"
#1;"This is csv with 
#newline";"This is another column"
#"This is newline

#test_out.csv
#------------
#"A";"B";"C"
#1;"This is csv with newline";"This is another column"
#"This is newline

如果有您不清楚的地方，请告诉我。

Answer 3

如果要使用换行符从csv创建数据框并用双引号引起来而不重新设计wheel，请使用spark-csv和common-csv库：

from pyspark.sql import SQLContext
df = sqlContext.load(header="true",source="com.databricks.spark.csv", path = "hdfs://analytics.com.np:8020/hdp/badcsv.csv")

读取用双引号括起来但带有换行符的csv文件

问题描述

3 个解决方案

解决方案1
2 2016-07-13 14:20:54

解决方案2
0 2016-07-13 15:01:52

解决方案3
0 已采纳 2016-07-14 00:50:27

读取用双引号括起来但带有换行符的csv文件

问题描述

3 个解决方案

解决方案1 2 2016-07-13 14:20:54

解决方案2 0 2016-07-13 15:01:52

解决方案3 0 已采纳 2016-07-14 00:50:27

解决方案1
2 2016-07-13 14:20:54

解决方案2
0 2016-07-13 15:01:52

解决方案3
0 已采纳 2016-07-14 00:50:27