简体   繁体   English

KeyError:"Corrections" while parsing text using GingerIt in python on text data in pandas

[英]KeyError:"Corrections" while parsing text using GingerIt in python on text data in pandas

##!pip install gingerit

from gingerit.gingerit import GingerIt
jd = []
for txt in list(data['Job Description']):
   jd.append(GingerIt().parse(txt)['result'])
data['jd'] = jd

I want to correct the spelling and grammatical mistakes in a text feature/column in a pandas dataframe which has ~3000 rows.我想纠正 pandas dataframe 中的文本特征/列中的拼写和语法错误,它有 ~3000 行。 Each row contains 4-5 statements.每行包含 4-5 个语句。 So, I used GingerIt() from gingerit.gingerit and I am getting an error.因此,我使用了gingerit.gingerit 中的 GingerIt() ,但出现错误。

KeyError                                  Traceback (most recent call last)
<ipython-input-25-ea5c757d88d2> in <module>()
     5           jd = []
     6           for txt in list(datajd['Job Description']):
---->7           jd.append(GingerIt().parse(txt)['result'])


/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in parse(self, text, verify)
      26         )
      27         data = request.json()
 ---> 28         return self._process_data(text, data)
      29 
      30     @staticmethod

 /usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in _process_data(self, text, data)
      38         corrections = []
      39 
 ---> 40         for suggestion in reversed(data["Corrections"]):
      41             start = suggestion["From"]
      42             end = suggestion["To"]

 KeyError: 'Corrections'

GingerIt has a paid Premium service based on API keys, so the free version cannot deal with sentences of more than 300 chars. GingerIt有一个基于 API 密钥的付费高级服务,因此免费版无法处理超过 300 个字符的句子。

You can use a sentence segmenter of your choice, here, you can use [ pysb Pragmatic Sentence Boundary Disambiguation module][1] (install it with pip install pysbd ).您可以使用您选择的句子分割器,在这里,您可以使用 [ pysb Pragmatic Sentence Boundary Disambiguation module][1] (使用pip install pysbd安装它)。 Then, run the sentences with length less than 300 chars through Ginger and join the results.然后,通过 Ginger 运行长度小于 300 个字符的句子并加入结果。

If you can have long sentences and you still want to handle them, make sure you further subsegment the sentences.如果你可以有很长的句子,但你仍然想处理它们,请确保进一步细分句子。 Here, I suggest a regex like [^;:\n•]+[;,:\n•]?\s* that subsegments on ;在这里,我建议像[^;:\n•]+[;,:\n•]?\s*这样的正则表达式在; , : , newline and a bullet point, but you may add more chars you need. , : , 换行符和项目符号,但您可以添加更多需要的字符。

from gingerit.gingerit import GingerIt # pip install gingerit
import pandas as pd
import pysbd, re # pip install pysbd

file  = r'test.csv'

segmentor = pysbd.Segmenter(language="en", clean=False)
data = pd.read_csv(file)

subsegment_re = r'[^;:\n•]+[;,:\n•]?\s*'

def runGinger(par):
    fixed = []
    for sentence in segmentor.segment(par):
        if len(sentence) < 300:
            fixed.append(GingerIt().parse(sentence)['result'])
        else:
            subsegments = re.findall(subsegment_re, sentence)
            if len(subsegments) == 1 or any(len(v) < 300 for v in subsegments):
                # print(f'Skipped: {sentence}') // No grammar check possible
                fixed.append(sentence)
            else:
                res = []
                for s in subsegments:
                    res.append(GingerIt().parse(s)['result'])
                fixed.append("".join(res))
    return " ".join(fixed)

data['jd'] = data['Job Description'].apply(lambda x: runGinger(x))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM