简体   繁体   中英

Extracting string with the help of function

Actually I have data frames of clickstream with about 4 million rows. I have many columns and two of them are based on URL and Domain. I have a dictionary and want to use it as a condition. For example: If the domain is equal to amazon.de and Url contains a Keyword pillow then the column will have a value pillow . And so on.

dictionary_keywords = {"amazon.de": "pillow", "rewe.com": "apple"}

ID   Domain                  URL
1    amazon.de               www.amazon.de/ssssssss/exapmle/pillow
2    rewe.de                 www.rewe.de/apple

The expected output should be the new column:

ID   Domain                  URL                                    New_Col
1    amazon.de               www.amazon.de/ssssssss/exapmle/pillow  pillow
2    rewe.de                 www.rewe.de/apple                       apple

I can manually use.str.contain method but need to define a function which takes the dictionary key and value as a condition.

Something like this df[df['domain] == 'amazon.de'] & df[df['url'].str.contains('pillow')

But I am not sure. I am new in this.

The way I prefer to solve this kind of problem is by using df.apply() by row ( axis=1 ) with a custom function to deal with the logic.

import pandas as pd

dictionary_keywords = {"amazon.de": "Pillow", "rewe.de": "Apple"}
df = pd.DataFrame({
    'Domain':['amazon.de','rewe.de'],
    'URL':['www.amazon.de/ssssssss/exapmle/pillow', 'www.rewe.de/apple']
})

def f(row):
    global dictionary_keywords
    try:
        url = row['URL'].lower()
        domain = url.split('/')[0].strip('www.')
        if dictionary_keywords[domain].lower() in url:
            return dictionary_keywords[domain]
    except Exception as e:
        print(row.name, e)
    return None #or False, or np.nan

df['New_Col'] = df.apply(f, axis=1)

Output:

print(df)

      Domain                                    URL New_Col
0  amazon.de  www.amazon.de/ssssssss/exapmle/pillow  Pillow
1    rewe.de                      www.rewe.de/apple   Apple

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM