简体   繁体   English

使用正则表达式将句子标记为单词

[英]tokenize sentence into words using regex

i'm using this code to predict different fields using crf model and it actually work pretty good except the regex!我正在使用此代码来预测使用 crf model 的不同字段,它实际上工作得很好,除了正则表达式!

sentence='[\'Viktoria Jan 1 12:13:16 google 10.0.0.0 port 448 ssh2 Jan 5 02:17:14 nginx 10.0.0.0 05/Jan/2019:02:17:14 +0100 GET /test/bin/test/test.exe HTTP/X.X " Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) " \']'
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')

m = rx.search(sentence)

if m:
  sentence=m.groups()
else:
  sentence=sentence.split(' ')  
  
  
padded_sentence = sentence + [word2index["--PADDING--"]] * (MAX_SENTENCE - len(sentence))
padded_sentence = [word2index.get(w, 0) for w in padded_sentence]

pred = ner_model.predict(np.array([padded_sentence]))
pred = np.argmax(pred, axis=-1)

retval = ""
for w, p in zip(sentence, pred[0]):
  retval = retval + "{:15}: {:5}".format(w, index2tag[p])+ "\n"

print(retval)

and as an output i get the following:作为 output 我得到以下信息:

['Viktoria     : USER 
Jan            : MONTH
1              : DAY  
12:13:16       : Time 
google         : HOSTNAME 
10.0.0.0       : SRC_IP
port           : O    
448            : SRC_PORT
ssh2           : O    
Jan            : MONTH
5              : DAY  
02:17:14       : Time 
nginx          : O    
10.0.0.        : SRC_IP
05/Jan/2019:02:17:14: TIMESTAMP
+0100          : O    
GET            : METHOD
/test/bin/test/test.exe : PATH 
HTTP/X.X       : HTTP_VERSION  
"              : O    
Mozilla/X.0    : O    
[en]           : O    
(XX ,          : O    
U;             : O    
Test-TT        : O    
0.0.0)         : O    
"              : O    
']             : O  

what i want actually is to see the user agent as a single word in the output, i mean something like this:我实际上想要的是在 output 中将用户代理视为一个单词,我的意思是这样的:

['Viktoria     : USER 
Jan            : MONTH
1              : DAY  
12:13:16       : Time 
google         : HOSTNAME
10.0.0.0       : SRC_IP
port           : O    
448            : SRC_PORT
ssh2           : O    
Jan            : MONTH
5              : DAY  
02:17:14       : Time 
nginx          : O    
10.0.0.        : SRC_IP
05/Jan/2019:02:17:14: TIMESTAMP
+0100          : O    
GET            : METHOD
/test/bin/test/test.exe : PATH 
HTTP/X.X       : HTTP_VERSION
Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) : USER_AGENT    
']             : O  

I'm not sure but i think the problem is on the regex i'm using.. any idea to solve that?我不确定,但我认为问题出在我正在使用的正则表达式上。有什么想法可以解决这个问题吗? Thank you in advance先感谢您

This sample program:这个示例程序:

import re

sentence='[\'Viktoria Jan 1 12:13:16 google 10.0.0.0 port 448 ssh2 Jan 5 02:17:14 nginx 10.0.0.0 05/Jan/2019:02:17:14 +0100 GET /test/bin/test/test.exe HTTP/X.X " Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) " \']'

pattern = r'\b(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s"([^"]+)'

m = re.compile(pattern).search(sentence)

if m:
  print(m.groups())

produces this output, which seems to be what you're asking for it to do:产生这个 output,这似乎是你要求它做的:

('Viktoria', 'Jan', '1', '12:13:16', 'google', '10.0.0.0', 'port', '448', 'ssh2', 'Jan', '5', '02:17:14', 'nginx', '10.0.0.0', '05/Jan/2019:02:17:14', '+0100', 'GET', '/test/bin/test/test.exe', 'HTTP/X.X', ' Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) ')

Further format-checking can be added as needed in place of the various "(\S+)" terms.可以根据需要添加进一步的格式检查来代替各种“(\S+)”术语。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM