简体   繁体   English

如果整个字符串包含熊猫数据框中的子字符串,则替换整个字符串

[英]Replace whole string if it contains substring in pandas dataframe

I have an sample dataset.我有一个示例数据集。

raw_data = {
    'categories': ['sweet beverage', 'salty snacks', 'beverage,sweet', 'fruit juice,beverage,', 'salty crackers'],
    'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df_a = pd.DataFrame(raw_data)

I need to iterate thru the rows in the 'categories' columns, and check if it contains a particular string, in this case, 'beverage', after which i will update the categories to just 'beverage'.我需要遍历“类别”列中的行,并检查它是否包含特定字符串,在本例中为“饮料”,之后我会将类别更新为“饮料”。 This link is the closest i found on stackoverflow, but doesnt tell me how to go thru the whole dataset.这个链接是我在 stackoverflow 上找到的最接近的链接,但没有告诉我如何浏览整个数据集。

Replace whole string if it contains substring in pandas 如果包含熊猫中的子字符串,则替换整个字符串

Here's my sample code.这是我的示例代码。

for index,row in df.iterrows():
    if row.str.contains('beverage', na=False):
        df.loc[index,'categories_en'] = 'Beverages' 
    elif row.str.contains('salty',na=False):
        df.loc[index,'categories_en'] = 'Salty Snack'
     ....<and other conditions>

How can I achive this?我怎样才能做到这一点? Thanks all!谢谢大家!

Create following dicts , then using replace创建以下 dicts ,然后使用replace

Yourdict2={1:'Beverages',2:'salty'}
Yourdict1={'beverage':1,'salty':2}
df_a.categories.replace(Yourdict1,regex=True).map(Yourdict2)
Out[275]: 
0    Beverages
1        salty
2    Beverages
3    Beverages
4        salty
Name: categories, dtype: object

You can use你可以使用

df_a.loc[df_a.categories.str.contains('beverage'), 'categories'] = 'beverage'


    categories      product_name
0   beverage        coca-cola
1   salty snacks    salted pistachios
2   beverage        fruit juice
3   beverage        lemon tea
4   salty crackers  roasted peanuts

Use the __contains__() method of Pythons string class:使用 Pythons 字符串类的__contains__()方法:

for a in df_a["categories"]:
if a.__contains__("beverage"):
    df_a["categories"].replace(a, "beverage", inplace=True)

Maybe you can try something like this:也许你可以尝试这样的事情:

def selector(x):
    if 'beverage' in x:
        return 'Beverages'
    if 'salty' in x:
        return 'Salty snack'

df_a['categories_en'] = df_a['categories'].apply(selector)

Use apply to generate a new categories column.使用apply生成新的categories列。 Then assign it to the categories_en column of the dataframe.然后将其分配给categories_en数据框的列。

def map_categories(cat: str) -> str:
    if cat.find("beverage") != -1:
        return "beverage"
    else:
        return str
new_col = df['categories'].apply(map_categories)
df['categories_en'] = new_col

Thanks for all the various solutions to my question.感谢您对我的问题的所有各种解决方案。 Based on all your inputs, I have come up with this solution, which works.根据您的所有输入,我提出了这个有效的解决方案。

def transformCat(df):

df.loc[df.categories_en.str.lower().str.contains('beers|largers|wines|rotwein|biere',na=False)] = 'Alcoholic,Beverages'
df.loc[df.categories_en.str.lower().str.contains('cheese',na=False)] = 'Dairies,Cheeses'
df.loc[df.categories_en.str.lower().str.contains('yogurts',na=False)] = 'Dairies,Yogurts'
df.loc[df.categories_en.str.lower().str.contains(r'sauce.*ketchup|ketchup.*sauce',na=False)] = 'Sauces,Ketchups'

Would appreciate any inputs.将不胜感激任何投入。 Thanks all!谢谢大家!

PS - I am aware there should be an indent beginning at df.loc, but since i am new to stackoverflow (i will learn, i promise), somehow I cant get the indentation correct. PS - 我知道应该从 df.loc 开始缩进,但由于我是 stackoverflow 的新手(我会学习,我保证),不知何故我无法正确缩进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM