![](/img/trans.png)
[英]How to efficiently filter a string against a long list of words in Python/Django?
[英]How to split a string based on a reference list and on words at the same time efficiently in python?
我有一個字符串和一個元素的參考列表。 我希望能夠將字符串拆分為另一個元素列表,同時考慮參考列表。 這意味着根據參考或單詞拆分句子。 例如,
reference_list = ['10', '2 to 3', '1 and 1/2' '1/2', '1/22', ... ... etc]
my_list = "this happened at 10 o'clock and now after 2 to 3 hours has gone..meet 1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012... etc.
Output 應該看起來像,
out = ["this", "happened", "at", "10", "o'clock", .... "2 to 3", "hours", ... ... "1 and 1/2", "hours", ... "1/22", "or", "2/12/2012... ]
我將不勝感激任何幫助。 先感謝您。
更新:
我試過這個,
reg = r'\b(%s|\w+)\b' % '|'.join(reference_list)
print(reg)
result = []
for e in re.finditer(reg, sentence):
result.append(e.group())
print(result)
不工作。
假設我們有以下數據:
reference_list = ['10', '2', '1', '2 to 3', '1/2', '1 and 1/2',
'1/22', '2 to 3 to 4']
my_list = "this happened at 10 o'clock and now after 2 to 3 " +
"to 4 hours has gone we've decided to meet on-time " +
"1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012"
(我以這種方式編寫了字符串,以便無需水平滾動即可查看它。)
關鍵是首先對reference_list
進行排序以創建一個列表new_list
,這樣如果new_list[j]
包含在new_list[i]
中,則i < j
(盡管相反通常不正確)。對於 Ruby,可以按如下方式完成。
new_list = reference_list.sort { |a,b| a.include?(b) ? -1 : 1 }
#=> ["1/22", "1 and 1/2", "1/2", "2 to 3 to 4", "10", "1",
# "2 to 3", "2"]
我假設 Python 代碼會相似。
接下來,我們以編程方式從new_list
構造一個正則表達式。 同樣,這可以在 Ruby 中按如下方式完成,我假設 Python 代碼將類似:
/\b(?:#{new_list.join('|')}|[\w'-]+)\b/
#=> /\b(?:1\/22|1 and 1\/2|1\/2|2 to 3 to 4|10|1|2 to 3|2|[\w'-]+)\b/
如果將此正則表達式與re.findall
使用,我們將獲得以下結果:
["this", "happened", "at", "10", "o'clock", "and", "now", "after",
"2 to 3 to 4", "hours", "has", "gone", "we've", "decided", "to",
"meet", "on-time", "1 and 1/2", "hours", "later", "Visit", "us",
"on", "1/22", "or", "2", "12", "2012"]
在進行任何匹配之前和每次匹配之后, findall
都會嘗試在字符串中的當前位置匹配'1/22'
。 如果不匹配,它會嘗試匹配'1 and 1\/2'
,依此類推。 最后,如果所有匹配但最后一個失敗,它將嘗試匹配全部[\w'-]+
。 我任意包含了一個撇號(因此"o'clock"
將被匹配)和連字符(因此"on-time"
將被匹配)。 請注意,所有匹配項的前后都必須有一個單詞邊界( \b
)。
請注意,雖然'2 to 3 to 4'
與2 to 3 to 4
、 2 to 3
和2
匹配,但交替元素的順序確保其中第一個是匹配的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.