每当pyspark中的一行出现任何错误词时，我如何获取文件中的下一行？

Question

I have a log file in which i need to check on each line.我有一个日志文件，我需要在其中检查每一行。 Whenever "ERROR" word come in any line then i need to take the next two line after that line.每当“错误”字出现在任何一行时，我都需要在该行之后取下两行。 I have to do this in pyspark.我必须在 pyspark 中执行此操作。

for example: Input log File:例如：输入日志文件：

line 1 1号线

line 2 2号线

line...ERROR... 3行...错误... 3

line 4 4号线

line 5 5号线

line 6 6号线

Output will be :输出将是：

line 4 4号线

line 5 5号线

I have created an rdd using the log file and using map() to traverse each line but i am not getting the exact idea.我已经使用日志文件创建了一个 rdd，并使用 map() 来遍历每一行，但我没有得到确切的想法。

Thanks in advance.提前致谢。

Answer 1

what about something like:怎么样：

# open your file as f
lines = f.readlines()
for i, line in enumerate(lines):
    if "ERROR" in line:
        print(lines[i+1])
        print(lines[i+2])
        # Exit or something you want to do.

Answer 2

Here is a method using windowing functions:这是使用窗口函数的方法：

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# set up DF
df = sc.parallelize([["line1"], ["line2"], ["line3..ERROR"], ["line4"], ["line5"]]).toDF(['col'])

# create an indicator that created a boundary between consecutive errors
win1 = Window.orderBy('col')
df = df.withColumn('hit_error', F.expr("case when col like '%ERROR%' then 1 else 0 end"))
df = df.withColumn('cum_error', F.sum('hit_error').over(win1))

# now count the lines between each error occurrence
win2 = Window.partitionBy('cum_error').orderBy('col')
df = df.withColumn('rownum', F.row_number().over(win2))

# the lines we want are rows 2,3
df.filter("cum_error>0 and rownum in (2,3)").select("col").show(10)```

每当pyspark中的一行出现任何错误词时，我如何获取文件中的下一行？

问题描述

2 个解决方案

解决方案1
0 2019-01-31 10:11:09

解决方案2
0 2019-01-31 14:01:48

每当pyspark中的一行出现任何错误词时，我如何获取文件中的下一行？

问题描述

2 个解决方案

解决方案1 0 2019-01-31 10:11:09

解决方案2 0 2019-01-31 14:01:48

解决方案1
0 2019-01-31 10:11:09

解决方案2
0 2019-01-31 14:01:48