[英]How do i get the next lines in a file whenever any ERROR word come in a line in pyspark?
I have a log file in which i need to check on each line.我有一个日志文件,我需要在其中检查每一行。 Whenever "ERROR" word come in any line then i need to take the next two line after that line.
每当“错误”字出现在任何一行时,我都需要在该行之后取下两行。 I have to do this in pyspark.
我必须在 pyspark 中执行此操作。
for example: Input log File:例如:输入日志文件:
line 1
1号线
line 2
2号线
line...ERROR... 3
行...错误... 3
line 4
4号线
line 5
5号线
line 6
6号线
Output will be :输出将是:
line 4
4号线
line 5
5号线
I have created an rdd using the log file and using map() to traverse each line but i am not getting the exact idea.我已经使用日志文件创建了一个 rdd,并使用 map() 来遍历每一行,但我没有得到确切的想法。
Thanks in advance.提前致谢。
what about something like:怎么样:
# open your file as f
lines = f.readlines()
for i, line in enumerate(lines):
if "ERROR" in line:
print(lines[i+1])
print(lines[i+2])
# Exit or something you want to do.
Here is a method using windowing functions:这是使用窗口函数的方法:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# set up DF
df = sc.parallelize([["line1"], ["line2"], ["line3..ERROR"], ["line4"], ["line5"]]).toDF(['col'])
# create an indicator that created a boundary between consecutive errors
win1 = Window.orderBy('col')
df = df.withColumn('hit_error', F.expr("case when col like '%ERROR%' then 1 else 0 end"))
df = df.withColumn('cum_error', F.sum('hit_error').over(win1))
# now count the lines between each error occurrence
win2 = Window.partitionBy('cum_error').orderBy('col')
df = df.withColumn('rownum', F.row_number().over(win2))
# the lines we want are rows 2,3
df.filter("cum_error>0 and rownum in (2,3)").select("col").show(10)```
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.