[英]Fully tokenize sentence including punctuation, contractions and hyphenated words
我想完整地標記一句話:“半衰期最長的元素是鈾234”教授說。
我想要這個 output:
['"', 'The', 'element', 'with', 'the', 'longests', 'half-life', 'isn't', 'Uranium-234', '"', 'said', 'the', 'professor', '.']
這里所有的標點符號都是分開的,但是像“不是”和“不是”這樣的詞是一個標記。 連字符也被視為一個標記,這就是我想要的。
目前我正在使用它來標記它:
p = re.compile(r"\w+(?:'\w+)?|[^\w\s]")
p.findall(s)
這給了我 output:
['"', 'The', 'element', 'with', 'the', 'longest', 'half', '-', 'life', 'isn't', 'Uranium', '-', '234', '"', 'said', 'the', 'professor', "."]
有了這個,我不能將連字符的單詞標記為一個標記。
使用['-]
字符 class,您忘記了下划線:
\w+(?:['-]\w+)?|[^\w\s]|_
見證明。
解釋
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
['-] any character of: ''', '-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^\w\s] any character except: word characters (a-
z, A-Z, 0-9, _), whitespace (\n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
_ '_'
import re
regex = r"\w+(?:['-]\w+)?|[^\w\s]|_"
test_str = "\"The element with the longest half-life is Uranium-234\" said the professor."
print(re.findall(regex, test_str))
結果: ['"', 'The', 'element', 'with', 'the', 'longest', 'half-life', 'is', 'Uranium-234', '"', 'said', 'the', 'professor', '.']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.