简体   繁体   English

如何取消标记替换的 spacy.tokens.token.Token?

[英]How can I untokenize a replaced spacy.tokens.token.Token?

I was trying to replace the location name from a string and replace it with any city from the list mentioned below, randomly and the get the new formed string and append it to a file.我试图从字符串中替换位置名称,并随机替换为下面提到的列表中的任何城市,然后获取新形成的字符串并将其附加到文件中。 I tried using spacy for this.我尝试为此使用 spacy。 I can easily detect the cities and replace the token, but I am stuck with appending them to get the new line.我可以轻松检测城市并替换令牌,但我坚持附加它们以获取新行。

from pprint import pprint
import spacy
import random

list = ['Delhi','Mumbai','Bangalore','Agra','Jaipur','Noida','Lucknow','Bombay','Jaipur','Indore','Chandigarh','Guwahati','Ghaziabad','Faridabad',
        'Pune','Chennai','kolkata','Hyderabad','Goa']

nlp = spacy.load('en_core_web_sm')

sentence = '''Can You deliver pizza to London.'''

entities = nlp(sentence)

pprint([(X, X.ent_iob_, X.ent_type_) for X in entities])
newstr=""
for X in entities:
    newstr += X
    if  X.ent_type_=='GPE' and X.ent_iob_=='B':
        X = random.choice(list)
        print(X)
        #print(type(X))
    elif X.ent_type_=='GPE' and X.ent_iob_=='I':
        X= ' '



pprint(newstr)

i am getting the following error:我收到以下错误:

 Traceback (most recent call last):
  File "C:\Users\shahi\PycharmProjects\pythonscrappingproject\main.py", line 17, in <module>
    newstr += X
TypeError: can only concatenate str (not "spacy.tokens.token.Token") to str

When i try to run this with commenting out - newstr += X ;当我尝试通过注释运行它时 - newstr += X ; it runs okay.它运行正常。

First, do not use the built-in list as a variable name, use l , for example:首先,不要使用内置list作为变量名,使用l ,例如:

l = ['Delhi','Mumbai','Bangalore','Agra','Jaipur','Noida','Lucknow','Bombay','Jaipur','Indore','Chandigarh','Guwahati','Ghaziabad','Faridabad',
        'Pune','Chennai','kolkata','Hyderabad','Goa']

Then, you can use然后,您可以使用

for X in entities:
    if  X.ent_type_=='GPE' and X.ent_iob_=='B':
        newstr += random.choice(l) + X.whitespace_
    else:
        newstr += X.text + X.whitespace_

where X.text is the actual token text and X.whitespace_ is the whitespace after that token in the original char sequence.其中X.text是实际的标记文本, X.whitespace_是原始字符序列中该标记之后的空格。

尝试通过编写newstr += str(X)spacy.tokens.token.Token类型转换为str

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM