使用 textacy 提取引用

Question

I am attempting to extract quotations and quotation attributions (ie, the speaker) from text, but I am getting errors.我正在尝试从文本中提取引用和引用属性（即说话者），但我遇到了错误。 Here is the setup:这是设置：

import textacy
import pandas as pd
import spacy

data = [
        ("\"Hello, nice to meet you,\" said world 1"),
        ("\"Hello, nice to meet you,\" said world 2"),  
        ]

df = pd.DataFrame(data, columns=['text'])

nlp = spacy.load('en_core_web_sm')

doc = df['text'].apply(nlp)

Here is the desired output:这是所需的输出：

[DQTriple(speaker=[world 1], cue=[said], content="Hello, nice to meet you,")] [DQTriple(speaker=[world 2], cue=[said], content="Hello, nice to meet you,")] [DQTriple(speaker=[world 1], cue=[said], content="Hello, nice to meet you,")] [DQTriple(speaker=[world 2], cue=[said], content="Hello,很高兴见到你，”）]

Here is the first attempt at extraction:这是提取的第一次尝试：

print(list(textacy.extract.triples.direct_quotations(doc) for records in doc))

Which gives the following output:这给出了以下输出：

[<generator object direct_quotations at 0x7f82edf58ac0>, <generator object direct_quotations at 0x7f82edf58190>] [<generator object direct_quotations at 0x7f82edf58ac0>, <generator object direct_quotations at 0x7f82edf58190>]

Here is the second attempt at extraction:这是提取的第二次尝试：

print(list(textacy.extract.triples.direct_quotations(doc)))

Which gives the following error:这给出了以下错误：

AttributeError: 'Series' object has no attribute 'lang_' AttributeError：“系列”对象没有属性“lang_”

Answer 1

In your first attempt you were extracting quotations by iterating over the tokens.在您的第一次尝试中，您通过遍历标记来提取报价。

Here is an example of what you could do:这是您可以执行的操作的示例：

import textacy

import spacy

text =""" "Hello, nice to meet you," said world 1"""

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

print(list(textacy.extract.triples.direct_quotations(doc)))
# will print: [DQTriple(speaker=[world], cue=[said], content="Hello, nice to meet you,")]

使用 textacy 提取引用

问题描述

1 个解决方案

解决方案1
0 2022-06-17 09:32:42

使用 textacy 提取引用

问题描述

1 个解决方案

解决方案1 0 2022-06-17 09:32:42

解决方案1
0 2022-06-17 09:32:42