简体   繁体   中英

KeyError:"Corrections" while parsing text using GingerIt in python on text data in pandas

##!pip install gingerit

from gingerit.gingerit import GingerIt
jd = []
for txt in list(data['Job Description']):
   jd.append(GingerIt().parse(txt)['result'])
data['jd'] = jd

I want to correct the spelling and grammatical mistakes in a text feature/column in a pandas dataframe which has ~3000 rows. Each row contains 4-5 statements. So, I used GingerIt() from gingerit.gingerit and I am getting an error.

KeyError                                  Traceback (most recent call last)
<ipython-input-25-ea5c757d88d2> in <module>()
     5           jd = []
     6           for txt in list(datajd['Job Description']):
---->7           jd.append(GingerIt().parse(txt)['result'])


/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in parse(self, text, verify)
      26         )
      27         data = request.json()
 ---> 28         return self._process_data(text, data)
      29 
      30     @staticmethod

 /usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in _process_data(self, text, data)
      38         corrections = []
      39 
 ---> 40         for suggestion in reversed(data["Corrections"]):
      41             start = suggestion["From"]
      42             end = suggestion["To"]

 KeyError: 'Corrections'

GingerIt has a paid Premium service based on API keys, so the free version cannot deal with sentences of more than 300 chars.

You can use a sentence segmenter of your choice, here, you can use [ pysb Pragmatic Sentence Boundary Disambiguation module][1] (install it with pip install pysbd ). Then, run the sentences with length less than 300 chars through Ginger and join the results.

If you can have long sentences and you still want to handle them, make sure you further subsegment the sentences. Here, I suggest a regex like [^;:\n•]+[;,:\n•]?\s* that subsegments on ; , : , newline and a bullet point, but you may add more chars you need.

from gingerit.gingerit import GingerIt # pip install gingerit
import pandas as pd
import pysbd, re # pip install pysbd

file  = r'test.csv'

segmentor = pysbd.Segmenter(language="en", clean=False)
data = pd.read_csv(file)

subsegment_re = r'[^;:\n•]+[;,:\n•]?\s*'

def runGinger(par):
    fixed = []
    for sentence in segmentor.segment(par):
        if len(sentence) < 300:
            fixed.append(GingerIt().parse(sentence)['result'])
        else:
            subsegments = re.findall(subsegment_re, sentence)
            if len(subsegments) == 1 or any(len(v) < 300 for v in subsegments):
                # print(f'Skipped: {sentence}') // No grammar check possible
                fixed.append(sentence)
            else:
                res = []
                for s in subsegments:
                    res.append(GingerIt().parse(s)['result'])
                fixed.append("".join(res))
    return " ".join(fixed)

data['jd'] = data['Job Description'].apply(lambda x: runGinger(x))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM