簡體   English   中英

完全標記句子,包括標點符號、縮寫詞和連字符

[英]Fully tokenize sentence including punctuation, contractions and hyphenated words

我想完整地標記一句話:“半衰期最長的元素是鈾234”教授說。

我想要這個 output:

['"', 'The', 'element', 'with', 'the', 'longests', 'half-life', 'isn't', 'Uranium-234', '"', 'said', 'the', 'professor', '.']

這里所有的標點符號都是分開的,但是像“不是”和“不是”這樣的詞是一個標記。 連字符也被視為一個標記,這就是我想要的。

目前我正在使用它來標記它:

p = re.compile(r"\w+(?:'\w+)?|[^\w\s]")
p.findall(s)

這給了我 output:

['"', 'The', 'element', 'with', 'the', 'longest', 'half', '-', 'life', 'isn't', 'Uranium', '-', '234', '"', 'said', 'the', 'professor', "."]

有了這個,我不能將連字符的單詞標記為一個標記。

使用['-]字符 class,您忘記了下划線:

\w+(?:['-]\w+)?|[^\w\s]|_

證明

解釋

--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    ['-]                     any character of: ''', '-'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  [^\w\s]                  any character except: word characters (a-
                           z, A-Z, 0-9, _), whitespace (\n, \r, \t,
                           \f, and " ")
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  _                        '_'

Python 代碼

import re
regex = r"\w+(?:['-]\w+)?|[^\w\s]|_"
test_str = "\"The element with the longest half-life is Uranium-234\" said the professor."
print(re.findall(regex, test_str))

結果['"', 'The', 'element', 'with', 'the', 'longest', 'half-life', 'is', 'Uranium-234', '"', 'said', 'the', 'professor', '.']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM