[英]Python Pandas checking for a value if it exists from one DataFrame to another DataFrame
[英]Pandas: checking if a value exists in each cell of a Dataframe
我正在使用python 2.7,并希望根据每个单元格中列表的每个值的存在来创建一列。
这是一个数据示例:
| query |
-----------------
| handbag woman |
| shoe man |
| t-shirt baby |
| watch unisex |
| dress |
我有一个要检查的值列表:
gender_list=['woman', 'man', 'baby', 'unisex']
我期望的结果:
| query | gender
-----------------------
| handbag | woman
| shoe | man
| t-shirt | baby
| watch | unisex
| dress | None
这是我已经尝试过的:
for gender in gender_list:
df['gender']=df['query'].map(lambda x : gender if (x.find(gender) != -1) else None)
df['query']=df['query'].map(lambda x : x.replace(gender, '').strip() if (x.find(gender) != -1) else x)
首先,在大熊猫中,最好不要使用循环,因为速度慢(适用于引擎盖下的循环),而是使用矢量化解决方案。
使用extract
和正则表达式replace
用|
连接所有值|
并使用word boundary
进行完全匹配:
gender_list=['woman', 'man', 'baby', 'unisex']
#exact match is not important
#pat = '|'.join(gender_list)
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
print (pat)
\bwoman\b|\bman\b|\bbaby\b|\bunisex\b
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe man
2 t-shirt baby
3 watch unisex
4 dress NaN
差异:
print (df)
query
0 handbag woman
1 shoe many <-man change to many
2 t-shirt baby
3 watch unisex
4 dress
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe many NaN <-many not extracted
2 t-shirt baby
3 watch unisex
4 dress NaN
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe y man <-stay y from many
2 t-shirt baby
3 watch unisex
4 dress NaN
时间 :
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
print (df)
df = pd.concat([df] * 10000, ignore_index=True)
In [299]: %%timeit
...: pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
...: df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
...: df['query'] = df['query'].str.replace(pat, '').str.strip()
...:
...:
1 loop, best of 3: 143 ms per loop
In [300]: %%timeit
...: gender_set = set(gender_list)
...:
...: def gender_sep(row):
...: lst = row['query'].split(' ')
...: gender = next(iter(gender_set & set(lst)), None)
...: return (' '.join(lst), None) if not gender else \
...: (' '.join(i for i in lst if i!= gender), gender)
...:
...: df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
...:
1 loop, best of 3: 933 ms per loop
编辑:
对于更常见的通用解决方案,需要通过re.escape
转义正则表达式值:
import re
gender_list=['woman', 'man', 'baby', 'girl & boy']
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
这是一种方式。 它不是最有效的,但可读性强,易于调整/维护。
import pandas as pd
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
gender_list = ['woman', 'man', 'baby', 'unisex']
gender_set = set(gender_list)
def gender_sep(row):
lst = row['query'].split(' ')
gender = next(iter(gender_set & set(lst)), None)
return (' '.join(lst), None) if not gender else \
(' '.join(i for i in lst if i!= gender), gender)
df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
# query gender
# 0 handbag woman
# 1 shoe man
# 2 t-shirt baby
# 3 watch unisex
# 4 dress None
# 5 manpower None
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.