简体   繁体   English

如何使用 SpaCy 更改自定义 NER model 再训练的训练数据格式?

[英]How to change the format of training data for custom NER model retraining using SpaCy?

I am working on this problem where the text data is in the a document file and the resulting 5 tags are in a csv file.我正在解决这个问题,其中文本数据位于文档文件中,生成的 5 个标签位于 csv 文件中。 So to train spaCy NER model, we have to tag dtaa something like:因此,要训练spaCy NER model,我们必须将 dtaa 标记为:

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

But my data is in csv file like:但我的数据在 csv 文件中,例如:

在此处输入图像描述

I wrote a function which will search the first occurrences of col query in the text and add the length.我写了一个 function 它将搜索文本中第一次出现的col query并添加长度。 Something like:就像是:

train_data = []
for i,index in enumerate(df.index.tolist()):
    row_data = df.iloc[i,:].values.tolist()
    entities = {"entities":[]}
    for file in dir_files:
        if file.split('.')[0] == row_data[0]:
            text = preprocess(textract.process("./Training_data/"+file))
            
            for j,entry in enumerate(row_data[1:]):
                
                if not pd.isna(entry):
                    if isinstance(entry,str): # takes care of null values
                        entities['entities'].append((text.find(str(entry).strip()),len(str(entry)),ent_names[j]))

and the result is结果是

{'entities': [(-1, 7, 'Aggrement Value'),
  (-1, 10, 'Aggrement Start Date'),
  (-1, 10, 'Aggrement End Date'),
  (-1, 4, 'Renewal Notice (Days)'),
  (124, 22, 'Party One'),
  (540, 45, 'Party Two')]}

It is giving me decent results for the STRING but I have a huge problem for date as the are in format 12.08.2018 and price which is format 6000.00 .它为STRING提供了不错的结果,但我在日期方面遇到了一个巨大的问题,因为格式为12.08.2018 ,价格为格式6000.00 I can't compare directly so I have to change the price str(int(price)) and then match.我无法直接比较,所以我必须更改价格str(int(price))然后匹配。 It'll work BUT the date is never in the format given in CSV.它会起作用,但日期永远不会采用 CSV 中给出的格式。 It's spmething like 1stDAY OF SEPTEMBER 2018 TWO THOUSAND EIGHTEEN .这就像1stDAY OF SEPTEMBER 2018 TWO THOUSAND EIGHTEEN How am I supposed to tag that one in format?我应该如何以格式标记那个?

I tried using Spacy's inbuilt NER so that I could figure out but it is not giving me good results.我尝试使用Spacy's内置 NER 以便我能够弄清楚,但它并没有给我带来好的结果。

nlp = spacy.load('en_core_web_sm')
doc = nlp(preprocess(text))
displacy.render(nlp(doc.text),style='ent',jupyter=True)

It gives me something like:它给了我类似的东西:

在此处输入图像描述

How can I tag my data because without proper tagging of dates, it's all futile as it'll never learn to get the dates no matter what.我如何标记我的数据,因为如果没有正确标记日期,这一切都是徒劳的,因为无论如何它都不会学会获取日期。 Is there any Regular expression RE or I saw that NLTK POS based Queries to extract NER gives us something like:是否有任何Regular expression RE或者我看到基于 NLTK POS 的查询来提取 NER给了我们类似的东西:

在此处输入图像描述

If I understand your problem correctly, you need to robustly parse dates which are given mostly in a fully textual form.如果我正确理解您的问题,您需要稳健地解析主要以全文形式给出的日期。

To normalize such dates you can attempt to use one of dateutil , maya , pendulum or arrow libraries.要标准化此类日期,您可以尝试使用dateutilmayapendulumarrow库之一。 Good demonstration of possibilities of these libraries you can find here .您可以在此处找到这些库的可能性的良好演示。

In case your problem is also to tag the dates in text, it is a bit more nuanced and you would have to train such a model if SpaCy / NLTK is not suitable for your purpose.如果您的问题也是在文本中标记日期,它会更加细微,如果 SpaCy / NLTK 不适合您的目的,您将不得不训练这样的 model。 You can theoretically also implement a number of regular expressions but this is error prone and would take months to complete.理论上,您还可以实现许多正则表达式,但这很容易出错并且需要几个月才能完成。 AFAIK, there is no robust implementation available. AFAIK,没有可用的强大实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM