I'm using python 2.7 and want to create a column depending on the existence of each value of a list in every cell.
Here's an example of data:
| query |
-----------------
| handbag woman |
| shoe man |
| t-shirt baby |
| watch unisex |
| dress |
I have a list of value that I want to check:
gender_list=['woman', 'man', 'baby', 'unisex']
the result that I expect:
| query | gender
-----------------------
| handbag | woman
| shoe | man
| t-shirt | baby
| watch | unisex
| dress | None
Here's what I have already tried:
for gender in gender_list:
df['gender']=df['query'].map(lambda x : gender if (x.find(gender) != -1) else None)
df['query']=df['query'].map(lambda x : x.replace(gender, '').strip() if (x.find(gender) != -1) else x)
First in pandas the best is not used loops, because slow (apply are loops under the hood) and rather use vectorized solutions.
Use extract
and replace
by regex join all values by |
and use word boundary
for exact match:
gender_list=['woman', 'man', 'baby', 'unisex']
#exact match is not important
#pat = '|'.join(gender_list)
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
print (pat)
\bwoman\b|\bman\b|\bbaby\b|\bunisex\b
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe man
2 t-shirt baby
3 watch unisex
4 dress NaN
Differences:
print (df)
query
0 handbag woman
1 shoe many <-man change to many
2 t-shirt baby
3 watch unisex
4 dress
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe many NaN <-many not extracted
2 t-shirt baby
3 watch unisex
4 dress NaN
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe y man <-stay y from many
2 t-shirt baby
3 watch unisex
4 dress NaN
Timings :
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
print (df)
df = pd.concat([df] * 10000, ignore_index=True)
In [299]: %%timeit
...: pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
...: df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
...: df['query'] = df['query'].str.replace(pat, '').str.strip()
...:
...:
1 loop, best of 3: 143 ms per loop
In [300]: %%timeit
...: gender_set = set(gender_list)
...:
...: def gender_sep(row):
...: lst = row['query'].split(' ')
...: gender = next(iter(gender_set & set(lst)), None)
...: return (' '.join(lst), None) if not gender else \
...: (' '.join(i for i in lst if i!= gender), gender)
...:
...: df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
...:
1 loop, best of 3: 933 ms per loop
EDIT:
For more common general solution need escape regex values by re.escape
:
import re
gender_list=['woman', 'man', 'baby', 'girl & boy']
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
This is one way. It's not the most efficient, but it is readable and easy to adapt / maintain.
import pandas as pd
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
gender_list = ['woman', 'man', 'baby', 'unisex']
gender_set = set(gender_list)
def gender_sep(row):
lst = row['query'].split(' ')
gender = next(iter(gender_set & set(lst)), None)
return (' '.join(lst), None) if not gender else \
(' '.join(i for i in lst if i!= gender), gender)
df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
# query gender
# 0 handbag woman
# 1 shoe man
# 2 t-shirt baby
# 3 watch unisex
# 4 dress None
# 5 manpower None
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.