[英]KeyError:"Corrections" while parsing text using GingerIt in python on text data in pandas
##!pip install gingerit
from gingerit.gingerit import GingerIt
jd = []
for txt in list(data['Job Description']):
jd.append(GingerIt().parse(txt)['result'])
data['jd'] = jd
我想糾正 pandas dataframe 中的文本特征/列中的拼寫和語法錯誤,它有 ~3000 行。 每行包含 4-5 個語句。 因此,我使用了gingerit.gingerit 中的 GingerIt() ,但出現錯誤。
KeyError Traceback (most recent call last)
<ipython-input-25-ea5c757d88d2> in <module>()
5 jd = []
6 for txt in list(datajd['Job Description']):
---->7 jd.append(GingerIt().parse(txt)['result'])
/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in parse(self, text, verify)
26 )
27 data = request.json()
---> 28 return self._process_data(text, data)
29
30 @staticmethod
/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in _process_data(self, text, data)
38 corrections = []
39
---> 40 for suggestion in reversed(data["Corrections"]):
41 start = suggestion["From"]
42 end = suggestion["To"]
KeyError: 'Corrections'
GingerIt
有一個基於 API 密鑰的付費高級服務,因此免費版無法處理超過 300 個字符的句子。
您可以使用您選擇的句子分割器,在這里,您可以使用 [ pysb
Pragmatic Sentence Boundary Disambiguation module][1] (使用pip install pysbd
安裝它)。 然后,通過 Ginger 運行長度小於 300 個字符的句子並加入結果。
如果你可以有很長的句子,但你仍然想處理它們,請確保進一步細分句子。 在這里,我建議像[^;:\n•]+[;,:\n•]?\s*
這樣的正則表達式在;
, :
, 換行符和項目符號,但您可以添加更多需要的字符。
from gingerit.gingerit import GingerIt # pip install gingerit
import pandas as pd
import pysbd, re # pip install pysbd
file = r'test.csv'
segmentor = pysbd.Segmenter(language="en", clean=False)
data = pd.read_csv(file)
subsegment_re = r'[^;:\n•]+[;,:\n•]?\s*'
def runGinger(par):
fixed = []
for sentence in segmentor.segment(par):
if len(sentence) < 300:
fixed.append(GingerIt().parse(sentence)['result'])
else:
subsegments = re.findall(subsegment_re, sentence)
if len(subsegments) == 1 or any(len(v) < 300 for v in subsegments):
# print(f'Skipped: {sentence}') // No grammar check possible
fixed.append(sentence)
else:
res = []
for s in subsegments:
res.append(GingerIt().parse(s)['result'])
fixed.append("".join(res))
return " ".join(fixed)
data['jd'] = data['Job Description'].apply(lambda x: runGinger(x))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.