繁体   English   中英

是否可以选择特定的正则表达式捕获组来在 Pandas 中创建列?

[英]Is it possible to select specific regex capture groups to create columns in Pandas?

我目前正在尝试用 Python 编写一个程序,该程序读取系统日志并运行正则表达式负载,以便将数据提取到数据框中的列中,然后我可以对其进行分析。

这是 syslog 文件的示例:

Feb  1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.84 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=12973 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=24.17.237.70 DST=11.11.11.95 LEN=40 TOS=0x00 PREC=0x00 TTL=113 ID=27095 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:07 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.85 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=13801 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:17 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.87 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=15432 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:24 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=24.17.237.70 DST=11.11.11.100 LEN=40 TOS=0x00 PREC=0x00 TTL=113 ID=31168 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:27 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.89 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=17292 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.70 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7204 DF PROTO=TCP SPT=1208 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.72 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7206 DF PROTO=TCP SPT=1210 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.64 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7198 DF PROTO=TCP SPT=1202 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0  

到目前为止,我已经在 PySpark 中使用这种方法加载数据文件实现了这一点:

raw_data_files = glob.glob('*.log')
base_df = spark.read.text(raw_data_files)

并将数据提取到这样的列中:

ts_pattern =r'((Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec)\s+(\d+)\s+((0[0-9]|1[0-9]|2[0-3])(\:)(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])(\:)(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])))'
mth_pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec\s+\d+\s+)'
dy_pattern = r'\s+(\d+)\s+'
tme_pattern = r'(([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))'
src_pattern = r'((\sSRC.*?=)([0-9a-f\.]*))'
dst_pattern = r'((\sDST.*?=)([0-9a-f\.]*))'
dvc_pattern = r'((([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))\s)(\S+)'
svc_pattern = r'((([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))\s)(\S+)\s((?:(?!:).)*)'
ufw_pattern = r'(\bUFW\s\b)([^]\s]+)'
sts_pattern = r'((?<=stats:\s).*)'
in_pattern = r'((\bIN=\b)([^\s]+))'
physin_pattern = r'((\bPHYSIN=\b)([^\s]+))'
out_pattern = r'((\bOUT=\b)([^\s]+))'
physout_pattern = r'((\bPHYSOUT=\b)([^\s]+))'
mac_pattern = r'((\bMAC=\b)([^\s]+))'
len_pattern = r'((\bLEN=\b)([^\s]+))'
tos_pattern = r'((\bTOS=\b)([^\s]+))'
prec_pattern = r'((\bPREC=\b)([^\s]+))'
ttl_pattern = r'((\bTTL=\b)([^\s]+))'
id_pattern = r'((\bID=\b)([^\s]+))'
flag_pattern = r'(\b\sCE|DF|MF\s\b)'
prtc_pattern = r'((\bPROTO=\b)([^\s]+))'
spt_pattern = r'((\bSPT=\b)([^\s]+))'
dpt_pattern = r'((\bDPT=\b)([^\s]+))'
recseq_pattern = r'((\bSEQ=\b)([^\s]+))'
ackseq_pattern = r'((\bSEQ=\b)([^\s]+))'
win_pattern = r'((\bWINDOW=\b)([^\s]+))'
res_pattern = r'((\bRES=\b)([^\s]+))'
synurgp_pattern = r'((\bSYN\sURGP=\b)([^\s]+))'
status_pattern = r'((\bkernel:\s\b)(((eth|br|dev|Ker).*)))'

    # Combined regex
logs_df = base_df.select(regexp_extract('value', mth_pattern, 1).alias('month'),
                         regexp_extract('value', dy_pattern, 1).alias('day'),
                         regexp_extract('value', tme_pattern, 1).alias('time'),
                         regexp_extract('value', src_pattern, 3).alias('source'),
                         regexp_extract('value', dst_pattern, 3).alias('destination'),
                         regexp_extract('value', dvc_pattern, 6).alias('device'),
                         regexp_extract('value', svc_pattern, 7).alias('service'),
                         regexp_extract('value', ufw_pattern, 2).alias('ufw rule'),
                         regexp_extract('value', sts_pattern, 1).alias('stats event'),
                         regexp_extract('value', status_pattern, 3).alias('status event'),
                         regexp_extract('value', in_pattern, 3).alias('input'),
                         regexp_extract('value', physin_pattern, 3).alias('physinput'),
                         regexp_extract('value', out_pattern, 3).alias('output'),
                         regexp_extract('value', physout_pattern, 3).alias('physout'),
                         regexp_extract('value', mac_pattern, 3).alias('mac address'),
                         regexp_extract('value', len_pattern, 3).cast('integer').alias('length'),
                         regexp_extract('value', tos_pattern, 3).alias('typeofservice'),
                         regexp_extract('value', prec_pattern, 3).alias('precedence'),
                         regexp_extract('value', ttl_pattern, 3).cast('integer').alias('timetolive'),
                         regexp_extract('value', id_pattern, 3).cast('integer').alias('id'),
                         regexp_extract('value', flag_pattern, 1).alias('flag'),
                         regexp_extract('value', prtc_pattern, 3).alias('protocol'),
                         regexp_extract('value', spt_pattern, 3).cast('integer').alias('sourceport'),
                         regexp_extract('value', dpt_pattern, 3).cast('integer').alias('destinationport'),
                         regexp_extract('value', recseq_pattern, 3).cast('integer').alias('recseqnumber'),
                         regexp_extract('value', ackseq_pattern, 3).cast('integer').alias('ackseqnumber'),
                         regexp_extract('value', win_pattern, 3).cast('integer').alias('window'),
                         regexp_extract('value', res_pattern, 3).alias('reserved'),
                         regexp_extract('value', synurgp_pattern, 3).cast('integer').alias('synurgp'))

打印数据框:

+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+
|month|day|    time|        source| destination|device|service|ufw rule|stats event|status event|input|physinput|output|physout|mac address|length|typeofservice|precedence|timetolive|   id|flag|protocol|sourceport|destinationport|recseqnumber|ackseqnumber|window|reserved|synurgp|
+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+
|  Feb|  1|00:00:02|192.150.249.87| 11.11.11.84|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|12973|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:02|  24.17.237.70| 11.11.11.95|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       113|27095|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:07|192.150.249.87| 11.11.11.85|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|13801|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:17|192.150.249.87| 11.11.11.87|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|15432|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:24|  24.17.237.70|11.11.11.100|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       113|31168|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:27|192.150.249.87| 11.11.11.89|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|17292|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:31|211.168.230.94| 11.11.11.70|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7204|  DF|     TCP|      1208|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:31|211.168.230.94| 11.11.11.72|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7206|  DF|     TCP|      1210|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:31|211.168.230.94| 11.11.11.64|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7198|  DF|     TCP|      1202|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.69|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7203|  DF|     TCP|      1207|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.73|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7207|  DF|     TCP|      1211|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.75|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7209|  DF|     TCP|      1213|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.80|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7214|  DF|     TCP|      1218|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.67|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7201|  DF|     TCP|      1205|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.71|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7205|  DF|     TCP|      1209|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|192.150.249.87| 11.11.11.90|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|18107|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.70|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7235|  DF|     TCP|      1208|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.72|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7237|  DF|     TCP|      1210|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.69|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7234|  DF|     TCP|      1207|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.80|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7233|  DF|     TCP|      1218|            135|        null|        null| 16384|    0x00|      0|
+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+

这种方法效果很好,我对结果很满意,但是,如果可能的话,我想在不使用 Apache Spark 的情况下完成这项工作。 这是一个很棒的工具,但我认为对于这个用例来说可能太多了。

我试图通过将文件读取为 fwf 来使用 Pandas 获得类似的结果,这并不理想,因为它由于日志格式而将其拆分为奇数列。 如果我可以使用 PySpark 中的特定正则表达式组从数据框中提取数据,这将不是问题。 我无论如何都找不到在 Pandas 中执行此操作的方法。 或者无论如何将所有正则表达式编译成一个命令并使用指定的组以这种方式创建列。

我尝试将 Pandas 数据框中的所有列合并为 1,然后使用 str.extract 将数据提取到新列中。 但是这种方法仅在我的正则表达式有 1 个捕获组时才有效,这是不可能实现的并且仍然获得相关结果。

我还尝试将我所有的正则表达式组合成一个模式,并以这种方式提取数据,如下所示:

import re
import pandas as pd
import glob
import numpy as np

# define regex patterns
mth = r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec\s+\d+\s+)'
dy = r'\s+(?P<day>\d+)\s+'
tme = r'(?P<time>([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))'
src = r'\sSRC.*?=?(?P<source>[0-9a-f\.]*)'
dst = r'\sDST.*?=(?P<destination>[0-9a-f\.]*)'

# import all files 
data = pd.read_fwf('/home/fred/Documents/PythonProjects/LogFileReader/SotM30-anton.log', header=None)
all_re = re.compile('({0}{1}{2}{3}{4})'.format(mth, dy, tme, src, dst))
# parse to dataframe
df = pd.DataFrame(data)

# format to single column
df = df.rename(columns={0 : 'c1', 1 : 'c2', 2 : 'c3', 3 : 'c4', 4 : 'c5'})
cols = ['c1', 'c2', 'c3', 'c4', 'c5']
df['combined'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
print(df)

打印结果(df):

         c1  c2        c3      c4       c5                                                  5                       combined
0       Feb   1  00:00:02  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:02 bridge kernel:
1       Feb   1  00:00:02  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:02 bridge kernel:
2       Feb   1  00:00:07  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:07 bridge kernel:
3       Feb   1  00:00:17  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:17 bridge kernel:
4       Feb   1  00:00:24  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:24 bridge kernel:
...     ...  ..       ...     ...      ...                                                ...                            ...
307519  Feb   7  14:35:06  bridge  kernel:  OUTG CONN TCP: IN=br0 PHYSIN=eth1 OUT=br0 PHYS...  Feb 7 14:35:06 bridge kernel:
307520  Feb   7  14:36:12  bridge  kernel:  INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 7 14:36:12 bridge kernel:
307521  Feb   7  14:39:04  bridge  kernel:  INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 7 14:39:04 bridge kernel:
307522  Feb   7  14:39:11  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 7 14:39:11 bridge kernel:
307523  Feb   7  14:40:06  bridge  kernel:  OUTG CONN TCP: IN=br0 PHYSIN=eth1 OUT=br0 PHYS...  Feb 7 14:40:06 bridge kernel:
new_df = df[['combined']].copy()
x = new_df['combined'].str.extract(all_re, expand=True)
print(x)

但这只会导致填充 NaN 的数据框(print(x) 的结果):

          0 month  day time    4    5    6 source destination
0       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
1       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
2       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
3       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
4       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
...     ...   ...  ...  ...  ...  ...  ...    ...         ...
307519  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307520  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307521  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307522  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307523  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN

它似乎确实在获取捕获组名称并将它们用作列标题,但是由于没有获取任何值,因此出现了问题。

总而言之,我想使用正则表达式从 syslog 文件中提取数据,并将结果解析到数据框中的列中,以便对数据进行分析。 我可以使用 Apache Spark 执行此操作,但我不想这样做。 我想使用 Pandas,但不知道如何指定正则表达式捕获组并将它们一起编译,就像在 Apache Spark 中一样。 任何帮助将不胜感激。

标准 re 库中的组字典可能对您有用groupdict

一旦你有了匹配的字典,你应该能够遍历它们来构建一个表。

另请参阅重新文档中的分词器示例: Writing a Tokenizer

希望这有帮助!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM