[英]tokenize sentence into words using regex
i'm using this code to predict different fields using crf model and it actually work pretty good except the regex!我正在使用此代码来预测使用 crf model 的不同字段,它实际上工作得很好,除了正则表达式!
sentence='[\'Viktoria Jan 1 12:13:16 google 10.0.0.0 port 448 ssh2 Jan 5 02:17:14 nginx 10.0.0.0 05/Jan/2019:02:17:14 +0100 GET /test/bin/test/test.exe HTTP/X.X " Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) " \']'
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')
m = rx.search(sentence)
if m:
sentence=m.groups()
else:
sentence=sentence.split(' ')
padded_sentence = sentence + [word2index["--PADDING--"]] * (MAX_SENTENCE - len(sentence))
padded_sentence = [word2index.get(w, 0) for w in padded_sentence]
pred = ner_model.predict(np.array([padded_sentence]))
pred = np.argmax(pred, axis=-1)
retval = ""
for w, p in zip(sentence, pred[0]):
retval = retval + "{:15}: {:5}".format(w, index2tag[p])+ "\n"
print(retval)
and as an output i get the following:作为 output 我得到以下信息:
['Viktoria : USER
Jan : MONTH
1 : DAY
12:13:16 : Time
google : HOSTNAME
10.0.0.0 : SRC_IP
port : O
448 : SRC_PORT
ssh2 : O
Jan : MONTH
5 : DAY
02:17:14 : Time
nginx : O
10.0.0. : SRC_IP
05/Jan/2019:02:17:14: TIMESTAMP
+0100 : O
GET : METHOD
/test/bin/test/test.exe : PATH
HTTP/X.X : HTTP_VERSION
" : O
Mozilla/X.0 : O
[en] : O
(XX , : O
U; : O
Test-TT : O
0.0.0) : O
" : O
'] : O
what i want actually is to see the user agent as a single word in the output, i mean something like this:我实际上想要的是在 output 中将用户代理视为一个单词,我的意思是这样的:
['Viktoria : USER
Jan : MONTH
1 : DAY
12:13:16 : Time
google : HOSTNAME
10.0.0.0 : SRC_IP
port : O
448 : SRC_PORT
ssh2 : O
Jan : MONTH
5 : DAY
02:17:14 : Time
nginx : O
10.0.0. : SRC_IP
05/Jan/2019:02:17:14: TIMESTAMP
+0100 : O
GET : METHOD
/test/bin/test/test.exe : PATH
HTTP/X.X : HTTP_VERSION
Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) : USER_AGENT
'] : O
I'm not sure but i think the problem is on the regex i'm using.. any idea to solve that?我不确定,但我认为问题出在我正在使用的正则表达式上。有什么想法可以解决这个问题吗? Thank you in advance先感谢您
This sample program:这个示例程序:
import re
sentence='[\'Viktoria Jan 1 12:13:16 google 10.0.0.0 port 448 ssh2 Jan 5 02:17:14 nginx 10.0.0.0 05/Jan/2019:02:17:14 +0100 GET /test/bin/test/test.exe HTTP/X.X " Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) " \']'
pattern = r'\b(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s"([^"]+)'
m = re.compile(pattern).search(sentence)
if m:
print(m.groups())
produces this output, which seems to be what you're asking for it to do:产生这个 output,这似乎是你要求它做的:
('Viktoria', 'Jan', '1', '12:13:16', 'google', '10.0.0.0', 'port', '448', 'ssh2', 'Jan', '5', '02:17:14', 'nginx', '10.0.0.0', '05/Jan/2019:02:17:14', '+0100', 'GET', '/test/bin/test/test.exe', 'HTTP/X.X', ' Mozilla/X.0 [en] (XX, U; Test-TT 0.0.0) ')
Further format-checking can be added as needed in place of the various "(\S+)" terms.可以根据需要添加进一步的格式检查来代替各种“(\S+)”术语。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.