[英]How do i get the next lines in a file whenever any ERROR word come in a line in pyspark?

I have a log file in which i need to check on each line.我有一个日志文件,我需要在其中检查每一行。 Whenever "ERROR" word come in any line then i need to take the next two line after that line.每当“错误”字出现在任何一行时,我都需要在该行之后取下两行。 I have to do this in pyspark.我必须在 pyspark 中执行此操作。

for example: Input log File:例如:输入日志文件:

line 1 1号线

line 2 2号线

line...ERROR... 3行...错误... 3

line 4 4号线

line 5 5号线

line 6 6号线

Output will be :输出将是:

line 4 4号线

line 5 5号线

I have created an rdd using the log file and using map() to traverse each line but i am not getting the exact idea.我已经使用日志文件创建了一个 rdd,并使用 map() 来遍历每一行,但我没有得到确切的想法。

Thanks in advance.提前致谢。

what about something like:怎么样:

# open your file as f
lines = f.readlines()
for i, line in enumerate(lines):
    if "ERROR" in line:
        # Exit or something you want to do.

Here is a method using windowing functions:这是使用窗口函数的方法:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# set up DF
df = sc.parallelize([["line1"], ["line2"], ["line3..ERROR"], ["line4"], ["line5"]]).toDF(['col'])

# create an indicator that created a boundary between consecutive errors
win1 = Window.orderBy('col')
df = df.withColumn('hit_error', F.expr("case when col like '%ERROR%' then 1 else 0 end"))
df = df.withColumn('cum_error', F.sum('hit_error').over(win1))

# now count the lines between each error occurrence
win2 = Window.partitionBy('cum_error').orderBy('col')
df = df.withColumn('rownum', F.row_number().over(win2))

# the lines we want are rows 2,3
df.filter("cum_error>0 and rownum in (2,3)").select("col").show(10)```

