将函数应用于pandas数据框列中每一行的每个单词

Question

I have a sample dataframe as follows: 我有一个示例数据框，如下所示：

df = pd.DataFrame({
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})

  name                notes
0  Walter White     This speling is incorrect
1  Walter White     Corrector should correct korrecter

I want to adapt the spell checker by Peter Norvig available here . 我想改用Peter Norvig的拼写检查器，请点击此处。 I would then like to apply this function to every row by going over every word in the row. 然后，我想遍历行中的每个单词，将此功能应用于行。 I was wondering how can this be done in Python Pandas context? 我想知道如何在Python Pandas上下文中完成此操作？

I would like the output as: 我希望输出为：

    name                notes
0  Walter White     This spelling is incorrect
1  Walter White     Corrector should correct corrector

Appreciate any inputs. 感谢任何输入。 Thanks! 谢谢！

Answer 1

You can try this solution with str.split , but I think performance in big df can be problematic: 您可以使用str.split尝试此解决方案，但我认为big df性能可能会出现问题：

import pandas as pd
import numpy as np

df = pd.DataFrame({
'notes': pd.Series(['This speling is incorrect', 'Corrector should correct korrecter one']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})
print df
           name                                   notes
0  Walter White               This speling is incorrect
1  Walter White  Corrector should correct korrecter one    

#simulate function correct
def correct(x):
    return x + '888'

#split column notes and apply correct
df1 = df.notes.str.split(expand=True).apply(correct)
print df1
              0           1           2             3       4
0       This888  speling888       is888  incorrect888     NaN
1  Corrector888   should888  correct888  korrecter888  one888

#remove NaN and concanecate all words together
df['notes'] = df1.fillna('').apply(lambda row: ' '.join(row), axis=1)
print df
           name                                              notes
0  Walter White             This888 speling888 is888 incorrect888 
1  Walter White  Corrector888 should888 correct888 korrecter888...

Answer 2

I have used the code from the link you have posted in order to make it work. 我使用了您发布的链接中的代码，以使其正常运行。 Use this as an inspiration. 以此为灵感。

import re, collections
import pandas as pd

# This code comes from the link you have posted
def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

# This is your code
df = pd.DataFrame({
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})

# Spellchecking can be optimized, of course and not hardcoded
for i, row in df.iterrows():
    df.set_value(i,'notes',correct(row['notes']))

将函数应用于pandas数据框列中每一行的每个单词

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-03-03 08:24:21

解决方案2
0 2016-03-03 08:24:13

将函数应用于pandas数据框列中每一行的每个单词

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-03-03 08:24:21

解决方案2 0 2016-03-03 08:24:13

解决方案1
1 已采纳 2016-03-03 08:24:21

解决方案2
0 2016-03-03 08:24:13