[英]Customized tag and lemmas for URLs using spaCy
Consider the sentence 考虑一下这句话
msg = 'I got this URL https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293 freed'
Next, I process the sentence using out-of-the-box spaCy
for English: 接下来,我使用开箱
spaCy
用的英语spaCy
处理句子:
import spacy
nlp = spacy.load('en')
doc = nlp(msg)
Let's review the output of: [(t, t.lemma_, t.pos_, t.tag_, t.dep_) for t in doc]
: 让我们回顾一下以下内容的输出:
[(t, t.lemma_, t.pos_, t.tag_, t.dep_) for t in doc]
:
[(I, '-PRON-', 'PRON', 'PRP', 'nsubj'),
(got, 'get', 'VERB', 'VBD', 'ROOT'),
(this, 'this', 'DET', 'DT', 'det'),
(URL, 'url', 'NOUN', 'NN', 'compound'),
(https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293,
'https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293',
'NOUN',
'NN',
'nsubj'),
(freed, 'free', 'VERB', 'VBN', 'ccomp')]
I would like to improve the handling of the URL piece. 我想改善URL片段的处理。 In particular, I want to:
我尤其要:
lemma
to stackoverflow.com
lemma
设置为stackoverflow.com
tag
to URL
tag
设置为URL
How can I do it using spaCy
? 我如何使用
spaCy
做到这spaCy
? I want to use a regex (as suggested here ) to decide whether a string is a URL or not and get the domain. 我想用一个正则表达式(如建议在这里 )来决定一个字符串是否是URL或不并获得域名。 So far, I failed to find the way to do it.
到目前为止,我仍未找到解决方法。
EDIT I guess a custom component is what I need. 编辑我想我需要一个自定义组件。 However, it seems like there's no way of placing a regex-based (or any other) callable as the
patterns
. 但是,似乎没有办法将基于正则表达式的(或任何其他)可调用方式放置为
patterns
。
You can specify the URL regex using a customized tokenizer, eg from https://spacy.io/usage/linguistic-features#native-tokenizers 您可以使用自定义的令牌生成器来指定URL正则表达式,例如从https://spacy.io/usage/linguistic-features#native-tokenizers
import regex as re
from spacy.tokenizer import Tokenizer
prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=simple_url_re.match)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
msg = 'I got this URL https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293 freed'
for i, token in enumerate(nlp(msg)):
print(i, ':\t', token)
[out]: [出]:
0 : I
1 : got
2 : this
3 : URL
4 : https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293
5 : freed
You can check whether a token is like URL, eg 您可以检查令牌是否类似于URL,例如
for i, token in enumerate(nlp(msg)):
print(token.like_url, ':\t', token.lemma_)
[out]: [出]:
False : -PRON-
False : get
False : this
False : url
True : https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293
False : free
doc = nlp(msg)
for i, token in enumerate(doc):
if token.like_url:
token.tag_ = 'URL'
print([token.tag_ for token in doc])
[out]: [出]:
['PRP', 'VBD', 'DT', 'NN', 'URL', 'VBN']
Using the regex https://regex101.com/r/KfjQ1G/1 : 使用正则表达式https://regex101.com/r/KfjQ1G/1 :
doc = nlp(msg)
for i, token in enumerate(doc):
if re.match(r'(?:http[s]:\/\/)stackoverflow.com.*', token.lemma_):
token.lemma_ = 'stackoverflow.com'
print([token.lemma_ for token in doc])
[out]: [出]:
['-PRON-', 'get', 'this', 'url', 'stackoverflow.com', 'free']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.