PySpark - 读取 csv 跳过自己的标题

Question

I have the problem that i can't skip my own Header in a CSV-File while reading it with Pyspark read.csv .我有一个问题，我无法在使用 Pyspark read.csv读取 CSV 文件时跳过我自己的 Header 。
CSV-File looks like that: CSV 文件如下所示：

°°°°°°°°°°°°°°°°°°°°°°°°
°      My Header       °
°    Important Data    °
°        Data          °
°°°°°°°°°°°°°°°°°°°°°°°°

MYROW;SECONDROW;THIRDROW
290;6848;66484
96849684;68463;63848
84646;6484;98718

I can't figure it out how i skip all those first lines or 'n' lines.我不知道如何跳过所有第一行或“n”行。
I tried something like:我试过类似的东西：

    df_read = spark.read.csv('MyCSV-File.csv', sep=';') \
        .rdd.zipWithIndex() \
        .filter(lambda x: x[1] > 6) \
        .map(lambda x: x[0]) \
        .toDF('MYROW','SECONDROW','THIRDROW')

Is there any posibility to skip the lines, in particular how fast will it be?是否有可能跳过线路，特别是它的速度有多快？ Data could be some GB's.数据可能是一些 GB 的。 Thanks谢谢

Answer 1

You can add filter on first lines:您可以在第一行添加过滤器：

.filter(lambda line: not line.startswith("°"))

Another option is to mark those line as comments:另一种选择是将这些行标记为注释：

.option("comment", "°")

PySpark - 读取 csv 跳过自己的标题

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-16 13:47:19

PySpark - 读取 csv 跳过自己的标题

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-16 13:47:19

解决方案1
1 已采纳 2020-09-16 13:47:19