[英]Pandas: checking if a value exists in each cell of a Dataframe
I'm using python 2.7 and want to create a column depending on the existence of each value of a list in every cell. 我正在使用python 2.7,并希望根据每个单元格中列表的每个值的存在来创建一列。
Here's an example of data: 这是一个数据示例:
| query |
-----------------
| handbag woman |
| shoe man |
| t-shirt baby |
| watch unisex |
| dress |
I have a list of value that I want to check: 我有一个要检查的值列表:
gender_list=['woman', 'man', 'baby', 'unisex']
the result that I expect: 我期望的结果:
| query | gender
-----------------------
| handbag | woman
| shoe | man
| t-shirt | baby
| watch | unisex
| dress | None
Here's what I have already tried: 这是我已经尝试过的:
for gender in gender_list:
df['gender']=df['query'].map(lambda x : gender if (x.find(gender) != -1) else None)
df['query']=df['query'].map(lambda x : x.replace(gender, '').strip() if (x.find(gender) != -1) else x)
First in pandas the best is not used loops, because slow (apply are loops under the hood) and rather use vectorized solutions. 首先,在大熊猫中,最好不要使用循环,因为速度慢(适用于引擎盖下的循环),而是使用矢量化解决方案。
Use extract
and replace
by regex join all values by |
使用extract
和正则表达式replace
用|
连接所有值|
and use word boundary
for exact match: 并使用word boundary
进行完全匹配:
gender_list=['woman', 'man', 'baby', 'unisex']
#exact match is not important
#pat = '|'.join(gender_list)
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
print (pat)
\bwoman\b|\bman\b|\bbaby\b|\bunisex\b
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe man
2 t-shirt baby
3 watch unisex
4 dress NaN
Differences: 差异:
print (df)
query
0 handbag woman
1 shoe many <-man change to many
2 t-shirt baby
3 watch unisex
4 dress
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe many NaN <-many not extracted
2 t-shirt baby
3 watch unisex
4 dress NaN
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe y man <-stay y from many
2 t-shirt baby
3 watch unisex
4 dress NaN
Timings : 时间 :
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
print (df)
df = pd.concat([df] * 10000, ignore_index=True)
In [299]: %%timeit
...: pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
...: df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
...: df['query'] = df['query'].str.replace(pat, '').str.strip()
...:
...:
1 loop, best of 3: 143 ms per loop
In [300]: %%timeit
...: gender_set = set(gender_list)
...:
...: def gender_sep(row):
...: lst = row['query'].split(' ')
...: gender = next(iter(gender_set & set(lst)), None)
...: return (' '.join(lst), None) if not gender else \
...: (' '.join(i for i in lst if i!= gender), gender)
...:
...: df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
...:
1 loop, best of 3: 933 ms per loop
EDIT: 编辑:
For more common general solution need escape regex values by re.escape
: 对于更常见的通用解决方案,需要通过re.escape
转义正则表达式值:
import re
gender_list=['woman', 'man', 'baby', 'girl & boy']
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
This is one way. 这是一种方式。 It's not the most efficient, but it is readable and easy to adapt / maintain. 它不是最有效的,但可读性强,易于调整/维护。
import pandas as pd
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
gender_list = ['woman', 'man', 'baby', 'unisex']
gender_set = set(gender_list)
def gender_sep(row):
lst = row['query'].split(' ')
gender = next(iter(gender_set & set(lst)), None)
return (' '.join(lst), None) if not gender else \
(' '.join(i for i in lst if i!= gender), gender)
df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
# query gender
# 0 handbag woman
# 1 shoe man
# 2 t-shirt baby
# 3 watch unisex
# 4 dress None
# 5 manpower None
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.