[英]Python find n-sized window around phrase within string
I have a string, for example 'i cant sleep what should i do'
as well as a phrase that is contained in the string 'cant sleep'
. 我有一个字符串,例如
'i cant sleep what should i do'
以及字符串'cant sleep'
包含的短语。 What I am trying to accomplish is to get an n sized window around the phrase even if there isn't n words on either side. 我想要完成的是在短语周围获得一个n大小的窗口,即使两边都没有n个单词。 So in this case if I had a window size of 2 (2 words on either size of the phrase) I would want
'i cant sleep what should'
. 因此,在这种情况下,如果我的窗口大小为2(在短语的任一大小上为2个单词),我会希望
'i cant sleep what should'
。
This is my current solution attempting to find a window size of 2, however it fails when the number of words to the left or right of the phrase is less than 2, I would also like to be able to use different window sizes. 这是我当前尝试找到窗口大小为2的解决方案,但是当短语左侧或右侧的单词数小于2时,它会失败,我还希望能够使用不同的窗口大小。
import re
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print sentence_words[left-2:right+3]
left = sentence_words.index(span_words[0])
right = sentence_words.index(span_words[-1])
print sentence_words[left-2:right+3]
You can use the partition method for a non-regex solution: 您可以将分区方法用于非正则表达式解决方案:
>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)
Then use a slice to get up to two words: 然后使用切片最多得到两个单词:
>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')
Your exact output: 你的确切输出:
>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'
You would want to check whether p
is in s
or if the partition succeeds of course. 您可能希望检查
p
是否在s
或者当然分区是否成功。
As pointed out in comments, lh
should be a negative to take the last n
words (thanks Mathias Ettinger): 正如评论中指出的那样,
lh
应该是最后n
单词的否定(感谢Mathias Ettinger):
>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'
If you define words being entities separated by spaces you can split your sentences and use regular python slicing: 如果您将单词定义为由空格分隔的实体,则可以拆分句子并使用常规的python切片:
def get_window(sentence, phrase, window_size):
sentence = sentence.split()
phrase = phrase.split()
words = len(phrase)
for i,word in enumerate(sentence):
if word == phrase[0] and sentence[i:i+words] == phrase:
start = max(0, i-window_size)
return ' '.join(sentence[start:i+words+window_size])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))
You can also change it to a generator by changing return
to yield
and be able to generate all windows if several match of phrase
are in sentence
: 您也可以将它通过改变改变发电机
return
,以yield
并能够产生所有窗口,如果几个比赛phrase
在sentence
:
>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']
import re
def contains_sublist(lst, sublst):
n = len(sublst)
for i in xrange(len(lst)-n+1):
if (sublst == lst[i:i+n]):
a = max(i, i-2)
b = min(i+n+2, len(lst))
return ' '.join(lst[a:b])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print contains_sublist(sentence_words, phrase_words)
you can split words using inbuilt string methods, so re
shouldn't be nessesary. 你可以使用内置的字符串方法拆分单词,所以
re
不应该是nessesary。 If you want to define varrring values, then wrap it in a function call like so: 如果要定义varrring值,请将其包装在函数调用中,如下所示:
def get_word_window(sentence, phrase, w_left=0, w_right=0):
w_lst = sentence.split()
p_lst = phrase.split()
for i,word in enumerate(w_lst):
if word == p_lst[0] and \
w_lst[i:i+len(p_lst)] == p_lst:
left = max(0, i-w_left)
right = min(len(w_lst), i+w_right+len(p_list)
return w_lst[left:right]
Then you can get the new phrase like so: 然后你可以得到这样的新短语:
>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.