Actually I have data frames of clickstream with about 4 million rows. I have many columns and two of them are based on URL and Domain. I have a dictionary and want to use it as a condition. For example: If the domain is equal to amazon.de
and Url contains a Keyword pillow
then the column will have a value pillow
. And so on.
dictionary_keywords = {"amazon.de": "pillow", "rewe.com": "apple"}
ID Domain URL
1 amazon.de www.amazon.de/ssssssss/exapmle/pillow
2 rewe.de www.rewe.de/apple
The expected output should be the new column:
ID Domain URL New_Col
1 amazon.de www.amazon.de/ssssssss/exapmle/pillow pillow
2 rewe.de www.rewe.de/apple apple
I can manually use.str.contain method but need to define a function which takes the dictionary key and value as a condition.
Something like this df[df['domain] == 'amazon.de'] & df[df['url'].str.contains('pillow')
But I am not sure. I am new in this.
The way I prefer to solve this kind of problem is by using df.apply()
by row ( axis=1
) with a custom function to deal with the logic.
import pandas as pd
dictionary_keywords = {"amazon.de": "Pillow", "rewe.de": "Apple"}
df = pd.DataFrame({
'Domain':['amazon.de','rewe.de'],
'URL':['www.amazon.de/ssssssss/exapmle/pillow', 'www.rewe.de/apple']
})
def f(row):
global dictionary_keywords
try:
url = row['URL'].lower()
domain = url.split('/')[0].strip('www.')
if dictionary_keywords[domain].lower() in url:
return dictionary_keywords[domain]
except Exception as e:
print(row.name, e)
return None #or False, or np.nan
df['New_Col'] = df.apply(f, axis=1)
Output:
print(df)
Domain URL New_Col
0 amazon.de www.amazon.de/ssssssss/exapmle/pillow Pillow
1 rewe.de www.rewe.de/apple Apple
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.