[英]Spell Checker in Pandas
I'm trying to implement Peter Norvig's spell checker in a pandas class with words pulled from a SQL database.我正在尝试使用从 SQL 数据库中提取的单词在 Pandas 类中实现Peter Norvig 的拼写检查器。 The data contains user queries which often contains a number of spelling errors, and I'm hoping this class will return the most likely query (spelt correctly).
数据包含用户查询,这些查询通常包含许多拼写错误,我希望这个类将返回最可能的查询(拼写正确)。
The class is initialized with a database query that returns a pandas dataframe.该类使用返回 Pandas 数据帧的数据库查询进行初始化。 For example:
例如:
query count
0 foo bar 1864
1 super foo 73
2 bar of foos 1629
3 crazy foos 940
Most of the below is pulled directly from Peter's work, but the modifications I've made to the class don't seem to work correctly.下面的大部分内容都是直接从 Peter 的作品中提取的,但我对课程所做的修改似乎无法正常工作。 My guess is that it has something to do with removing the Counter functionality (
WORDS = Counter(words(open('big.txt').read()))
) but I'm unsure the best way to get this same functionality from a dataframe.我的猜测是它与删除计数器功能(
WORDS = Counter(words(open('big.txt').read()))
)有关,但我不确定从一个数据框。
Current class below:以下当前班级:
class _SpellCheckClient(object):
"""Wraps functionality to check the spelling of a query."""
def __init__(self, team, table, dremel_connection):
self.df = database_connection.ExecuteQuery(
'SELECT query, COUNT(query) AS count FROM table GROUP BY 1;'
def expected_word(self, word):
"""Most probable spelling correction for word."""
return max(self._candidates(word), key=self._probability)
def _probability(self, query):
"""Probability of a given word within a query."""
query_count = self.df.loc[self.df['query'] == query]['count'].values
return query_count / self.df['count'].sum()
def _candidates(self, word):
"""Generate possible spelling corrections for word."""
return (self._known([word])
or self._known(self._one_edits_from_word(word))
or self._known(self._two_edits_from_word(word))
or [word])
def _known(self, query):
"""The subset of `words` that appear in the dictionary of WORDS."""
# return set(w for w in query if w in WORDS)
return set(w for w in query if w in self.df['query'].value_counts)
def _one_edits_from_word(self, word):
"""All edits that are one edit away from `word`."""
splits = [(word[:i], word[i:]) for i in xrange(len(word) + 1)]
deletes = [left + right[1:] for left, right in splits if right]
transposes = [left + right[1] + right[0] + right[2:]
for left, right in splits
if len(right) > 1]
replaces = [left + center + right[1:]
for left, right in splits
if right for center in LETTERS]
inserts = [left + center + right
for left, right in splits
for center in LETTERS]
return set(deletes + transposes + replaces + inserts)
def _two_edits_from_word(self, word):
"""All edits that are two edits away from `word`."""
return (e2 for e1 in self._one_edits_from_word(word)
for e2 in self._one_edits_from_word(e1))
Thanks in advance!提前致谢!
For anyone looking for an answer to this, below is what worked for me:对于任何寻找答案的人来说,以下是对我有用的:
def _words(df):
"""Returns the total count of each word within a dataframe."""
return df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
class _SpellCheckClient(object):
"""Wraps functionality to check the spelling of a query."""
def __init__(self, team, table, database_connection):
self.df = database_connection
self.words = _words(self.df)
def expected_word(self, query):
"""Most probable spelling correction for word."""
return max(self._candidates(query), key=self._probability)
def _probability(self, query):
"""Probability of a given word within a query."""
return self.words.pipe(lambda x: x / x.sum()).get(query, 0.0)
def _candidates(self, query):
"""Generate possible spelling corrections for word."""
return (self._known(self._one_edits_from_query(query))
or self._known(self._two_edits_from_query(query))
or [query])
def _known(self, query):
"""The subset of `query` that appear in the search console database."""
return set(w for w in query if self.words.get(w))
def _one_edits_from_query(self, query):
"""All edits that are one edit away from `query`."""
splits = [(query[:i], query[i:]) for i in xrange(len(query) + 1)]
deletes = [left + right[1:] for left, right in splits if right]
transposes = [left + right[1] + right[0] + right[2:]
for left, right in splits
if len(right) > 1]
replaces = [left + center + right[1:]
for left, right in splits
if right for center in LETTERS]
inserts = [left + center + right
for left, right in splits
for center in LETTERS]
return set(deletes + transposes + replaces + inserts)
def _two_edits_from_query(self, query):
"""All edits that are two edits away from `query`."""
return (e2 for e1 in self._one_edits_from_query(query)
for e2 in self._one_edits_from_query(e1))
import pandas as pd
from spellchecker import SpellChecker
df = pd.Series(['Customir','Tast','Hlp'])
spell = SpellChecker(distance=1)
def Correct(x):
return spell.correction(x)
df = df.apply(Correct)
df
0 customer
1 last
2 help
dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.