如何用 spark.read 方法排除第一行？

Question

I am using databricks with pyspark.我正在使用带有 pyspark 的数据块。 I would like to read this file (with the 'spark.read' method) which has the first and last lines to be excluded:我想阅读这个文件（使用'spark.read'方法），其中包含要排除的第一行和最后一行：

<X> example </X>
ID
11111   
22222    
<X> example </X>

How do I exclude the first row (or multiple rows if that is the case)?如何排除第一行（或多行，如果是这种情况）？ I have tried using:我试过使用：

  df = spark.read \
    .options(header='false')

I was unsuccessful.我没有成功。 Alternatively adding a '#' character to the beginning of the file (because then the spark.read command can interpret it as a comment and ignore the line) but I am dealing with very large files and would like to avoid reading or unnecessary steps that lengthen the process或者在文件开头添加一个“#”字符（因为 spark.read 命令可以将其解释为注释并忽略该行）但我正在处理非常大的文件，并希望避免阅读或不必要的步骤延长过程

  line = """#"""
  with open("myfile.txt", 'r+') as file: 
    file_data = file.read() 
    file.seek(0, 0) 
    file.write(line + file_data)

Answer 1

Have you tried using the option ("mode", "DROPMALFORMED").您是否尝试过使用选项（“mode”、“DROPMALFORMED”）。 This option will drop bad rows automatically.此选项将自动删除坏行。 If the rows are still being read then try to filter them.如果仍在读取行，则尝试过滤它们。 Refer to this article as well: https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5另请参阅这篇文章： https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5

如何用 spark.read 方法排除第一行？

问题描述

1 个解决方案

解决方案1
0 2022-09-20 12:03:12

如何用 spark.read 方法排除第一行？

问题描述

1 个解决方案

解决方案1 0 2022-09-20 12:03:12

解决方案1
0 2022-09-20 12:03:12