I want to extract information from different sentences so i'm using nltk to divide each sentence to words, I'm using this code:
words=[]
for i in range(len(sentences)):
words.append(nltk.word_tokenize(sentences[i]))
words
it works pretty good but i want something little bit different.. for example i have this sentence: '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/xx" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'
i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"
to be one word and not divided to several single words.
UPDATE: i want something like that:
[
'Jan',
'31',
'19:28:14',
'nginx',
'10.0.0.0',
'31/Jan/2019:19:28:14',
'+0100',
'POST',
'/test/itf/',
'HTTP/x.x',
'404',
'146',
'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
any idea to make it possible?? Thank you in advance
You can import re
and parse the log line (which is not a natural language sentence) with a regex:
import re
sentences = ['[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')
words=[]
for sent in sentences:
m = rx.search(sent)
if m:
words.append(list(m.groups()))
else:
words.append(nltk.word_tokenize(sent))
print(words)
See the Python demo .
The output will look like
[['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]
First you need to chose to use " or ' because the both are unusual and can to cause any strange behavior. After that is just string formating:
s='"[\"Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\"]" i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"'
words = s.split(' ') # break the sentence into spaces
# ['"["Jan', '31', '19:28:14', 'nginx:', '10.0.0.0', '-', '-', '[31/Jan/2019:19:28:14', '+0100]', '"POST', '/test/itf/', 'HTTP/x.x"', '404', '146', '"-"', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)""]"', 'i', 'want', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)"']
# then access your data list
words[0] # '"["Jan'
words[1] # '31'
words[2] # '19:28:14'
You could do that using parition()
and space delimiter, and keep paritioning the string until you get the result you wish. Below is the solution. I have to say though, this solution is strict to the string format you provided . It may not be the best approach, but will give you the desired output. Look for regular expressions for a more elegant solution.
s = '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'
x = s.partition(" ")
s_list = []
s_list.append(x[0].replace("'", '').replace('[', ''))
x = x[2].partition(" ")
s_list.append(x[0])
x = x[2].partition(" ")
s_list.append(x[0])
x = x[2].partition(" ")
s_list.append(x[0].replace(":", ''))
x = x[2].partition(" ")
s_list.append(x[0])
x = x[2].partition(" ")
x = x[2].partition(" ")
x = x[2].partition(" ")
s_list.append(x[0].replace('[', ''))
x = x[2].partition(" ")
s_list.append(x[0].replace(']', ''))
x = x[2].partition(" ")
s_list.append(x[0])
x = x[2].partition(" ")
s_list.append(x[0])
x = x[2].partition(" ")
s_list.append(x[0].replace('"', ''))
x = x[2].partition(" ")
s_list.append(x[0])
x = x[2].partition(" ")
s_list.append(x[0])
x = x[2].partition(" ")
s_list.append(x[2].replace('"', '').replace(']', '').replace("'", ''))
print(s_list)
Output:
['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100',
'"POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en (X11, U; OpenV
AS-XX 9.2.7)']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.