简体   繁体   English

为 ML 注释文本数据后如何进行?

[英]How to proceed after annotating text data for ML?

I am currently working on a project where I want to classify some text.我目前正在做一个项目,我想对一些文本进行分类。 For that, I first had to annotate text data.为此,我首先必须注释文本数据。 I did it using a web tool and have now the corresponding json file (containing the annotations) and the plain txt files (containing the raw text).我使用 web 工具完成了它,现在有了相应的 json 文件(包含注释)和纯 txt 文件(包含原始文本)。 I now want to use different classifiers to train the data and eventually predict the desired outcome.我现在想使用不同的分类器来训练数据并最终预测所需的结果。

However, I am struggling with where to start.但是,我正在努力从哪里开始。 I haven't really found what I've been looking for in the internet so that's why I try it here.我还没有真正在互联网上找到我一直在寻找的东西,所以这就是我在这里尝试的原因。

How would I proceed with the json and txt.我将如何继续使用 json 和 txt。 files?文件? As far as I understood I'd have to somehow convert these info to a.csv where I have information about the labels, the text but also "none" for thext that has not been annotated.据我了解,我必须以某种方式将这些信息转换为 a.csv ,其中我有关于标签、文本的信息,还有未注释的文本的“无”。 So I guess that's why I use the.txt files to somehow merge them with the annotations files and being able to detect if a text sentence (or word) has a label or not.所以我想这就是为什么我使用 .txt 文件以某种方式将它们与注释文件合并并能够检测文本句子(或单词)是否具有 label。 And then I could use the.csv data to load it into the model.然后我可以使用 .csv 数据将其加载到 model 中。

Could someone give me a hint on where to start or how I should proceed now?有人可以给我一个关于从哪里开始或我现在应该如何进行的提示吗? Everything I've found so far is covering the case that data is already converted and ready to preprocess but I am struggling with what to do with the results from the annotation process.到目前为止,我发现的所有内容都涵盖了数据已经转换并准备好进行预处理的情况,但我正在努力处理注释过程的结果。

My JSON looks something like that:我的 JSON 看起来像这样:

{"annotatable":{"parts":["s1p1"]},
 "anncomplete":true,
 "sources":[],
 "metas":{},
 "entities":[{"classId":"e_1","part":"s1p1","offsets": 
 [{"start":11,"text":"This is the text"}],"coordinates":[],"confidence": 
 {"state":"pre-added","who":["user:1"],"prob":1},"fields":{"f_4": 
 {"value":"3","confidence":{"state":"pre-added","who": 
 ["user:1"],"prob":1}}},"normalizations":{}},"normalizations":{}}],
 "relations":[]}

Each text is given a classId ( e_1 in this case) and a field_value ( f_4 given the value 3 in this case).每个文本都有一个classId (在本例中为e_1 )和一个field_value (在本例中为f_4给定值3 )。 I'd need to extract it step by step.我需要逐步提取它。 First extracting the entity with the corresponding text (and adding "none" to where no annotation has been annotated) and in a second step retrieving the field information with the corresponding text.首先提取具有相应文本的实体(并在没有注释的地方添加“无”),然后在第二步中检索具有相应文本的字段信息。 The corresponding.txt file is just simply like that: This is the text对应的.txt文件就是这样:这是文本

I have all.json files in one folder and all.txt in another.我在一个文件夹中有 all.json 文件,在另一个文件夹中有 all.txt。

So, let's assume you have a JSON file where the labels are indexed by the corresponding line in your raw txt file:因此,假设您有一个JSON文件,其中标签由原始txt文件中的相应行索引:

{
  0: "politics"
  1: "sports",
  2: "weather",
}

And a txt file with the correspondingly indexed raw text:以及一个带有相应索引的原始文本的txt文件:

0 The American government has launched ... today.
1 FC Barcelona has won ... the country.
2 The forecast looks ... okay.

Then first, you would need to indeed connect the examples with their labels, before you go on featurizing the text and build a machine learning model.然后首先,您需要将示例与其标签联系起来,然后再 go 对文本进行特征化并构建机器学习 model。 If your examples are, such as in my example, are aligned by index or an ID or any other identifying information, you could do:如果您的示例(例如在我的示例中)按索引或 ID 或任何其他识别信息对齐,您可以这样做:

import json

with open('labels.json') as json_file:
    labels = json.load(json_file)
    # This results in a Python dictionary where you can look-up a label given an index.

with open(raw.txt) as txt_file:
    raw_texts = txt_file.readlines()
    # This results in a list where you can retrieve the raw text by index like this: raw_texts[index].

Now that you can match your raw text to your labels, you may want to put them in one single dataframe for ease of use (assuming they are ordered the same way for now):现在您可以将原始文本与标签匹配,您可能希望将它们放在一个 dataframe 中以方便使用(假设它们现在以相同的方式订购):

import pandas as pd

data = pd.DataFrame(
    {'label': labels.values(),
     'text': raw_texts
    })

#    label      text
# 0  politics   Sentence_1
# 1  sports     Sentence_2
# 2  weather    Sentence_3

Now, you can use different machine learning libraries, but the one I would recommend for starters is definitely scikit-learn .现在,您可以使用不同的机器学习库,但我推荐给初学者的绝对是scikit-learn It provides a good explanation on how to convert your raw text strings into machine learning usable features:它很好地解释了如何将原始文本字符串转换为机器学习可用的特征:

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files

And afterwards, how to train a classifier using these features:然后,如何使用这些特征训练分类器:

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier

The provided DataFrame I showed should be just right to start testing out these scikit-learn techniques.我展示的提供的DataFrame应该正好可以开始测试这些scikit-learn技术。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM