I would like to search for substrings in each row of a dataframe column. I read somewhere that it is faster to search if the column can be converted into a set. I am trying to use the approaches suggested here: How to convert list into set in pandas? But I get some unexpected output. My dataframe looks like this:
R_id Badges
0 7LBCS New Reviewer - 1 Review
1 8FKME New Reviewer - 1 Review; New Photographer - 1 Photo; Reviewer - 3 Reviews;
When I use the following approaches:
df['Badges'] = df.apply(lambda row: set(row['Badges']), axis=1)
OR
df['Badges'] = df['Badges'].apply(set)
the output that I get for each row in the dataframe above is a set with unique characters of the string in the row. I am not able to replicate the exact output, because for some reason, as soon as the output is generated, the Spyder IDE crashes. But the output for the first row looks something like:
{'1', '-', 'N', 'e', 'w', 'R', 'v', 'i', 'r'}
What could be going wrong here in the conversion to sets?
You have to split before you use set:
In [11]: df.Badges.str.split('\s*;\s*').apply(set)
Out[11]:
0 {New Reviewer - 1 Review}
1 {Reviewer - 3 Reviews, , New Photographer - 1 ...
Name: Badges, dtype: object
To throw away the empties I might tweak it as follows:
In [12]: df.Badges.str.split('\s*;\s*').apply(lambda bs: set(b for b in bs if b))
Out[12]:
0 {New Reviewer - 1 Review}
1 {Reviewer - 3 Reviews, New Photographer - 1 Ph...
Name: Badges, dtype: object
or you could strip the ';' first (if that's the only place empty comes from):
In [13]: df.Badges.str.strip(';').str.split('\s*;\s*').apply(set)
Out[13]:
0 {New Reviewer - 1 Review}
1 {Reviewer - 3 Reviews, New Photographer - 1 Ph...
Name: Badges, dtype: object
The latter might be slightly more efficient...
Your data is not in a format that makes it easy to work with. I'd recommend an extension of Andy's code that results in each entry getting its own row, so you can then filter your data much more efficiently.
Start with str.split
, and then extract key-value pairs using str.extract
.
df = df.set_index('R_id')\
.Badges.str.split('\s*;\s*', expand=True)\
.stack().reset_index(level=1, drop=1)\
.str.extract('(?P<Name>[^-]+).*(?P<Val>\d+)', expand=True)\
.dropna()
print(df)
Name Val
R_id
7LBCS New Reviewer 1
8FKME New Reviewer 1
8FKME New Photographer 1
8FKME Reviewer 3
An hour's pain may a century gain.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.