简体   繁体   English

如何用 spark.read 方法排除第一行?

[英]How to exclude the first line with spark.read method?

I am using databricks with pyspark.我正在使用带有 pyspark 的数据块。 I would like to read this file (with the 'spark.read' method) which has the first and last lines to be excluded:我想阅读这个文件(使用'spark.read'方法),其中包含要排除的第一行和最后一行:

<X> example </X>
ID
11111   
22222    
<X> example </X>

How do I exclude the first row (or multiple rows if that is the case)?如何排除第一行(或多行,如果是这种情况)? I have tried using:我试过使用:

  df = spark.read \
    .options(header='false') 

I was unsuccessful.我没有成功。 Alternatively adding a '#' character to the beginning of the file (because then the spark.read command can interpret it as a comment and ignore the line) but I am dealing with very large files and would like to avoid reading or unnecessary steps that lengthen the process或者在文件开头添加一个“#”字符(因为 spark.read 命令可以将其解释为注释并忽略该行)但我正在处理非常大的文件,并希望避免阅读或不必要的步骤延长过程

  line = """#"""
  with open("myfile.txt", 'r+') as file: 
    file_data = file.read() 
    file.seek(0, 0) 
    file.write(line + file_data) 
    

Have you tried using the option ("mode", "DROPMALFORMED").您是否尝试过使用选项(“mode”、“DROPMALFORMED”)。 This option will drop bad rows automatically.此选项将自动删除坏行。 If the rows are still being read then try to filter them.如果仍在读取行,则尝试过滤它们。 Refer to this article as well: https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5另请参阅这篇文章: https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM