簡體   English   中英

單字和復數單詞與Pandas匹配

[英]Singular and Plural words matching with Pandas

這個問題是我之前的問題“ Python Pandas匹配多個短語”的擴展。 盡管我已經找到解決問題的方法,但還是出現了一些典型的單數和復數問題。

ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])

df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

我只需要將配料系列中的短語與DataFrame中的短語進行匹配。 作為偽代碼,

如果在DataFrame的短語中找到成分(單數或復數),則返回成分。 否則,返回false。

這是通過以下給出的答案實現的:

df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),K[...,np.newaxis], '').T)

而且我還應用了以下操作,以NAN填充空白單元格,以便可以輕松過濾出數據。

df.ix[df.existence=='', 'existence'] = np.nan

結果如下

print df
                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           NaN

一直以來都是正確的,但是當單數和復數詞映射不像almond => almonds apple => apples 當出現strawberry => strawberries類的東西時,該代碼將其識別為NaN

改進我的代碼以檢測此類情況。 我喜歡按以下方式將我的配料Series更改為data Frame

#ingredients

#inputwords       #outputword

vanilla extract    vanilla extract 
walnut             walnut
walnuts            walnut
oat                oat
oats               oat
egg                egg
eggs               egg
almond             almond
almonds            almond
strawberry         strawberry
strawberries       strawberry
cherry             cherry
cherries           cherry

所以我的邏輯是,只要#inputwords中的單詞出現在短語中,我想在其他單元格中返回該單詞。 換句話說,當短語中出現“ strawberry或“ strawberries ”時,代碼剛好在“ strawberry旁邊加上了這個詞。 這樣我的最終結果將是

                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           strawberry

我找不到將這種功能合並到現有代碼中或編寫新代碼的方法。 誰能幫我這個?

考慮使用詞干分析器:) http://www.nltk.org/howto/stem.html

直接從他們的頁面中取出:

    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer("english")
    stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
    >>> print(stemmer.stem("having"))
    have
    >>> print(stemmer2.stem("having"))
    having

重構您的代碼以阻止句子中的所有單詞,然后再將它們與成分列表匹配。

nltk是一個很棒的工具,可以滿足您的所有要求!

干杯

# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] , 
          data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
    match = np.nan
    for key , value in mapping.iterkv():
        if key in df[0]:
            match = value
    return match
# apply this function on each row
df.apply(get_match, axis = 1)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM