##!pip install gingerit
from gingerit.gingerit import GingerIt
jd = []
for txt in list(data['Job Description']):
jd.append(GingerIt().parse(txt)['result'])
data['jd'] = jd
I want to correct the spelling and grammatical mistakes in a text feature/column in a pandas dataframe which has ~3000 rows. Each row contains 4-5 statements. So, I used GingerIt() from gingerit.gingerit and I am getting an error.
KeyError Traceback (most recent call last)
<ipython-input-25-ea5c757d88d2> in <module>()
5 jd = []
6 for txt in list(datajd['Job Description']):
---->7 jd.append(GingerIt().parse(txt)['result'])
/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in parse(self, text, verify)
26 )
27 data = request.json()
---> 28 return self._process_data(text, data)
29
30 @staticmethod
/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in _process_data(self, text, data)
38 corrections = []
39
---> 40 for suggestion in reversed(data["Corrections"]):
41 start = suggestion["From"]
42 end = suggestion["To"]
KeyError: 'Corrections'
GingerIt
has a paid Premium service based on API keys, so the free version cannot deal with sentences of more than 300 chars.
You can use a sentence segmenter of your choice, here, you can use [ pysb
Pragmatic Sentence Boundary Disambiguation module][1] (install it with pip install pysbd
). Then, run the sentences with length less than 300 chars through Ginger and join the results.
If you can have long sentences and you still want to handle them, make sure you further subsegment the sentences. Here, I suggest a regex like [^;:\n•]+[;,:\n•]?\s*
that subsegments on ;
, :
, newline and a bullet point, but you may add more chars you need.
from gingerit.gingerit import GingerIt # pip install gingerit
import pandas as pd
import pysbd, re # pip install pysbd
file = r'test.csv'
segmentor = pysbd.Segmenter(language="en", clean=False)
data = pd.read_csv(file)
subsegment_re = r'[^;:\n•]+[;,:\n•]?\s*'
def runGinger(par):
fixed = []
for sentence in segmentor.segment(par):
if len(sentence) < 300:
fixed.append(GingerIt().parse(sentence)['result'])
else:
subsegments = re.findall(subsegment_re, sentence)
if len(subsegments) == 1 or any(len(v) < 300 for v in subsegments):
# print(f'Skipped: {sentence}') // No grammar check possible
fixed.append(sentence)
else:
res = []
for s in subsegments:
res.append(GingerIt().parse(s)['result'])
fixed.append("".join(res))
return " ".join(fixed)
data['jd'] = data['Job Description'].apply(lambda x: runGinger(x))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.