簡體   English   中英

KeyError:"Corrections" while parsing text using GingerIt in python on text data in pandas

[英]KeyError:"Corrections" while parsing text using GingerIt in python on text data in pandas

##!pip install gingerit

from gingerit.gingerit import GingerIt
jd = []
for txt in list(data['Job Description']):
   jd.append(GingerIt().parse(txt)['result'])
data['jd'] = jd

我想糾正 pandas dataframe 中的文本特征/列中的拼寫和語法錯誤,它有 ~3000 行。 每行包含 4-5 個語句。 因此,我使用了gingerit.gingerit 中的 GingerIt() ,但出現錯誤。

KeyError                                  Traceback (most recent call last)
<ipython-input-25-ea5c757d88d2> in <module>()
     5           jd = []
     6           for txt in list(datajd['Job Description']):
---->7           jd.append(GingerIt().parse(txt)['result'])


/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in parse(self, text, verify)
      26         )
      27         data = request.json()
 ---> 28         return self._process_data(text, data)
      29 
      30     @staticmethod

 /usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in _process_data(self, text, data)
      38         corrections = []
      39 
 ---> 40         for suggestion in reversed(data["Corrections"]):
      41             start = suggestion["From"]
      42             end = suggestion["To"]

 KeyError: 'Corrections'

GingerIt有一個基於 API 密鑰的付費高級服務,因此免費版無法處理超過 300 個字符的句子。

您可以使用您選擇的句子分割器,在這里,您可以使用 [ pysb Pragmatic Sentence Boundary Disambiguation module][1] (使用pip install pysbd安裝它)。 然后,通過 Ginger 運行長度小於 300 個字符的句子並加入結果。

如果你可以有很長的句子,但你仍然想處理它們,請確保進一步細分句子。 在這里,我建議像[^;:\n•]+[;,:\n•]?\s*這樣的正則表達式在; , : , 換行符和項目符號,但您可以添加更多需要的字符。

from gingerit.gingerit import GingerIt # pip install gingerit
import pandas as pd
import pysbd, re # pip install pysbd

file  = r'test.csv'

segmentor = pysbd.Segmenter(language="en", clean=False)
data = pd.read_csv(file)

subsegment_re = r'[^;:\n•]+[;,:\n•]?\s*'

def runGinger(par):
    fixed = []
    for sentence in segmentor.segment(par):
        if len(sentence) < 300:
            fixed.append(GingerIt().parse(sentence)['result'])
        else:
            subsegments = re.findall(subsegment_re, sentence)
            if len(subsegments) == 1 or any(len(v) < 300 for v in subsegments):
                # print(f'Skipped: {sentence}') // No grammar check possible
                fixed.append(sentence)
            else:
                res = []
                for s in subsegments:
                    res.append(GingerIt().parse(s)['result'])
                fixed.append("".join(res))
    return " ".join(fixed)

data['jd'] = data['Job Description'].apply(lambda x: runGinger(x))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM