搜索特定的詞性（例如名詞）並將它們與前面的單詞一起打印

Question

我有一個由一系列基本句子組成的文本，例如“她是醫生” 、 “他是個好人”等等。 我正在嘗試編寫一個只返回名詞和前面的代詞（例如她、他、它）的程序。 我需要它們成對打印，例如(she, doctor)或(he, person) 。 我正在使用SpaCy ，因為這也可以讓我處理類似的法語和德語文本。

這是我在本網站其他地方找到的最接近我需要的東西。 到目前為止，我一直在嘗試的是在文本中生成一個名詞列表，然后在文本中搜索列表中的名詞，並在其前面 3 個位置打印名詞和單詞（因為這是大多數的模式句子，大多數對我的目的來說已經足夠好了）。 這就是我創建列表所需要的：

 def spacy_tag(text): text_open = codecs.open(text, encoding='latin1').read() parsed_text = nlp_en(text_open) tokens = list([(token, token.tag_) for token in parsed_text]) list1 = [] for token, token.tag_ in tokens: if token.tag_ == 'NN': list1.append(token) return(list1)

但是，當我嘗試用它做任何事情時，我會收到一條錯誤消息。 我試過使用枚舉，但我也無法讓它工作。 這是我在文本中搜索列表中單詞的當前代碼（我還沒有開始添加應該預先在幾個地方打印單詞的部分，因為我仍然停留在搜索部分）：

 def spacy_search(text, list): text_open = codecs.open(text, encoding='latin1').read() for word in text_open: if word in list: print(word)

我得到的錯誤在第 4 行， "if word in list:", and it says "TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)"

有沒有使用SpaCy打印PRP 、 NN pair的更有效方法？ 或者，如何修改我的代碼以使其在文本中搜索列表中的名詞？ （它不需要是一個特別優雅的解決方案，它只需要產生一個結果）。

Answer 1

你采取了錯誤的方法：

先 append 語句中的所有令牌屬性：

tokonized=[]
for token in doc:
 tokonized.append((token.text ,token.lemma_, token.pos_, token.tag_, token.dep_,
                    token.shape_, token.is_alpha, token.is_stop,token.head,token.left_edge,token.right_edge,token.ent_type_))

編寫一個 function 接收令牌並將其返回相關頭並檢查if Token pos == 'NOUN' and tag== 'NN'

Head=''
if token[2]=='NOUN' and token[3]=='NN': 
 return token[8]

現在，如果返回頭是一個 PRON，您找到了您正在尋找的東西，如果不是，則再次將頭令牌發送到 function。

您可以在下面看到運行示例：

sentences=["she is a doctor", "he is a good person"]

('she', 'she', 'PRON', 'PRP', 'nsubj', 'xxx', True, True, is, she, she, '')
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True, is, she, doctor, '')
('a', 'a', 'DET', 'DT', 'det', 'x', True, True, doctor, a, a, '')
('doctor', 'doctor', 'NOUN', 'NN', 'attr', 'xxxx', True, False, is, a, doctor, '')

所以第一個電話將返回 Is，第二個電話將返回她，然后您停止。

相同的：

('he', 'he', 'PRON', 'PRP', 'nsubj', 'xx', True, True, is, he, he, '')
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True, is, he, person, '')
('a', 'a', 'DET', 'DT', 'det', 'x', True, True, person, a, a, '')
('good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False, person, good, good, '')
('person', 'person', 'NOUN', 'NN', 'attr', 'xxxx', True, False, is, a, person, '')

因此，第一次調用將返回 Is，第二次調用將返回 he，然后您停止。

Answer 2

這是一種實現您預期方法的干凈方法。

# put your nouns of interest here
NOUN_LIST = ["doctor", ...]

def find_stuff(text):
    doc = nlp(text)
    if len(doc) < 4: return None # too short
    
    for tok in doc[3:]:
        if tok.pos_ == "NOUN" and tok.text in NOUN_LIST and doc[tok.i-3].pos_ == "PRON":
            return (doc[tok.i-3].text, tok.text)

正如提到的另一個答案，您在這里的方法是錯誤的。 你想要句子的主語和 object（或謂語主格）。 您應該為此使用DependencyMatcher 。 這是一個例子：

from spacy.matcher import DependencyMatcher
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("she is a good person")

pattern = [
  # anchor token: verb, usually "is"
  {
    "RIGHT_ID": "verb",
    "RIGHT_ATTRS": {"POS": "AUX"}
  },
  # verb -> pronoun
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "pronoun",
    "RIGHT_ATTRS": {"DEP": "nsubj", "POS": "PRON"}
  },
  # predicate nominatives have "attr" relation
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "target",
    "RIGHT_ATTRS": {"DEP": "attr", "POS": "NOUN"}
  }
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PREDNOM", [pattern])
matches = matcher(doc)

for match_id, (verb, pron, target) in matches:
    print(doc[pron], doc[verb], doc[target])

您可以使用displace檢查依賴關系。 您可以在Jurafsky 和 Martin 的書中了解更多關於它們的信息。

搜索特定的詞性（例如名詞）並將它們與前面的單詞一起打印

問題描述

2 個解決方案

解決方案1
0 2022-01-02 16:46:12

解決方案2
0 2022-01-03 05:37:40

搜索特定的詞性（例如名詞）並將它們與前面的單詞一起打印

問題描述

2 個解決方案

解決方案1 0 2022-01-02 16:46:12

解決方案2 0 2022-01-03 05:37:40

解決方案1
0 2022-01-02 16:46:12

解決方案2
0 2022-01-03 05:37:40