[英]How to exclude the first line with spark.read method?
I am using databricks with pyspark.我正在使用带有 pyspark 的数据块。 I would like to read this file (with the 'spark.read' method) which has the first and last lines to be excluded:
我想阅读这个文件(使用'spark.read'方法),其中包含要排除的第一行和最后一行:
<X> example </X>
ID
11111
22222
<X> example </X>
How do I exclude the first row (or multiple rows if that is the case)?如何排除第一行(或多行,如果是这种情况)? I have tried using:
我试过使用:
df = spark.read \
.options(header='false')
I was unsuccessful.我没有成功。 Alternatively adding a '#' character to the beginning of the file (because then the spark.read command can interpret it as a comment and ignore the line) but I am dealing with very large files and would like to avoid reading or unnecessary steps that lengthen the process
或者在文件开头添加一个“#”字符(因为 spark.read 命令可以将其解释为注释并忽略该行)但我正在处理非常大的文件,并希望避免阅读或不必要的步骤延长过程
line = """#"""
with open("myfile.txt", 'r+') as file:
file_data = file.read()
file.seek(0, 0)
file.write(line + file_data)
Have you tried using the option ("mode", "DROPMALFORMED").您是否尝试过使用选项(“mode”、“DROPMALFORMED”)。 This option will drop bad rows automatically.
此选项将自动删除坏行。 If the rows are still being read then try to filter them.
如果仍在读取行,则尝试过滤它们。 Refer to this article as well: https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5
另请参阅这篇文章: https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.