單字和復數單詞與Pandas匹配

Question

這個問題是我之前的問題“ Python Pandas匹配多個短語”的擴展。 盡管我已經找到解決問題的方法，但還是出現了一些典型的單數和復數問題。

ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])

df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

我只需要將配料系列中的短語與DataFrame中的短語進行匹配。 作為偽代碼，

如果在DataFrame的短語中找到成分（單數或復數），則返回成分。 否則，返回false。

這是通過以下給出的答案實現的：

df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),K[...,np.newaxis], '').T)

而且我還應用了以下操作，以NAN填充空白單元格，以便可以輕松過濾出數據。

df.ix[df.existence=='', 'existence'] = np.nan

結果如下

print df
                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           NaN

一直以來都是正確的，但是當單數和復數詞映射不像almond => almonds apple => apples 。 當出現strawberry => strawberries類的東西時，該代碼將其識別為NaN 。

改進我的代碼以檢測此類情況。 我喜歡按以下方式將我的配料Series更改為data Frame 。

#ingredients

#inputwords       #outputword

vanilla extract    vanilla extract 
walnut             walnut
walnuts            walnut
oat                oat
oats               oat
egg                egg
eggs               egg
almond             almond
almonds            almond
strawberry         strawberry
strawberries       strawberry
cherry             cherry
cherries           cherry

所以我的邏輯是，只要#inputwords中的單詞出現在短語中，我想在其他單元格中返回該單詞。 換句話說，當短語中出現“ strawberry或“ strawberries ”時，代碼剛好在“ strawberry旁邊加上了這個詞。 這樣我的最終結果將是

                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           strawberry

我找不到將這種功能合並到現有代碼中或編寫新代碼的方法。 誰能幫我這個？

Answer 1

考慮使用詞干分析器：) http://www.nltk.org/howto/stem.html

直接從他們的頁面中取出：

    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer("english")
    stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
    >>> print(stemmer.stem("having"))
    have
    >>> print(stemmer2.stem("having"))
    having

重構您的代碼以阻止句子中的所有單詞，然后再將它們與成分列表匹配。

nltk是一個很棒的工具，可以滿足您的所有要求！

干杯

Answer 2

# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] , 
          data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
    match = np.nan
    for key , value in mapping.iterkv():
        if key in df[0]:
            match = value
    return match
# apply this function on each row
df.apply(get_match, axis = 1)

單字和復數單詞與Pandas匹配

問題描述

2 個解決方案

解決方案1
1 2015-09-13 06:26:57

解決方案2
0 已采納 2015-09-13 07:12:50

單字和復數單詞與Pandas匹配

問題描述

2 個解決方案

解決方案1 1 2015-09-13 06:26:57

解決方案2 0 已采納 2015-09-13 07:12:50

解決方案1
1 2015-09-13 06:26:57

解決方案2
0 已采納 2015-09-13 07:12:50