[英]Singular and Plural words matching with Pandas
這個問題是我之前的問題“ Python Pandas匹配多個短語”的擴展。 盡管我已經找到解決問題的方法,但還是出現了一些典型的單數和復數問題。
ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])
df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])
我只需要將配料系列中的短語與DataFrame中的短語進行匹配。 作為偽代碼,
如果在DataFrame的短語中找到成分(單數或復數),則返回成分。 否則,返回false。
這是通過以下給出的答案實現的:
df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),K[...,np.newaxis], '').T)
而且我還應用了以下操作,以NAN填充空白單元格,以便可以輕松過濾出數據。
df.ix[df.existence=='', 'existence'] = np.nan
結果如下
print df
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries NaN
一直以來都是正確的,但是當單數和復數詞映射不像almond
=> almonds
apple
=> apples
。 當出現strawberry
=> strawberries
類的東西時,該代碼將其識別為NaN
。
改進我的代碼以檢測此類情況。 我喜歡按以下方式將我的配料Series
更改為data Frame
。
#ingredients
#inputwords #outputword
vanilla extract vanilla extract
walnut walnut
walnuts walnut
oat oat
oats oat
egg egg
eggs egg
almond almond
almonds almond
strawberry strawberry
strawberries strawberry
cherry cherry
cherries cherry
所以我的邏輯是,只要#inputwords
中的單詞出現在短語中,我想在其他單元格中返回該單詞。 換句話說,當短語中出現“ strawberry
或“ strawberries
”時,代碼剛好在“ strawberry
旁邊加上了這個詞。 這樣我的最終結果將是
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries strawberry
我找不到將這種功能合並到現有代碼中或編寫新代碼的方法。 誰能幫我這個?
考慮使用詞干分析器:) http://www.nltk.org/howto/stem.html
直接從他們的頁面中取出:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
>>> print(stemmer.stem("having"))
have
>>> print(stemmer2.stem("having"))
having
重構您的代碼以阻止句子中的所有單詞,然后再將它們與成分列表匹配。
nltk是一個很棒的工具,可以滿足您的所有要求!
干杯
# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])
# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] ,
data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
match = np.nan
for key , value in mapping.iterkv():
if key in df[0]:
match = value
return match
# apply this function on each row
df.apply(get_match, axis = 1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.