简体   繁体   中英

Conversion of list to sets in pandas dataframe

I would like to search for substrings in each row of a dataframe column. I read somewhere that it is faster to search if the column can be converted into a set. I am trying to use the approaches suggested here: How to convert list into set in pandas? But I get some unexpected output. My dataframe looks like this:

      R_id        Badges
0    7LBCS        New Reviewer - 1 Review
1    8FKME        New Reviewer - 1 Review; New Photographer - 1 Photo; Reviewer - 3 Reviews;

When I use the following approaches:

df['Badges'] = df.apply(lambda row: set(row['Badges']), axis=1)

OR

df['Badges'] = df['Badges'].apply(set)

the output that I get for each row in the dataframe above is a set with unique characters of the string in the row. I am not able to replicate the exact output, because for some reason, as soon as the output is generated, the Spyder IDE crashes. But the output for the first row looks something like:

{'1', '-', 'N', 'e', 'w', 'R', 'v', 'i', 'r'}

What could be going wrong here in the conversion to sets?

You have to split before you use set:

In [11]: df.Badges.str.split('\s*;\s*').apply(set)
Out[11]:
0                            {New Reviewer - 1 Review}
1    {Reviewer - 3 Reviews, , New Photographer - 1 ...
Name: Badges, dtype: object

To throw away the empties I might tweak it as follows:

In [12]: df.Badges.str.split('\s*;\s*').apply(lambda bs: set(b for b in bs if b))
Out[12]:
0                            {New Reviewer - 1 Review}
1    {Reviewer - 3 Reviews, New Photographer - 1 Ph...
Name: Badges, dtype: object

or you could strip the ';' first (if that's the only place empty comes from):

In [13]: df.Badges.str.strip(';').str.split('\s*;\s*').apply(set)
Out[13]:
0                            {New Reviewer - 1 Review}
1    {Reviewer - 3 Reviews, New Photographer - 1 Ph...
Name: Badges, dtype: object

The latter might be slightly more efficient...

Your data is not in a format that makes it easy to work with. I'd recommend an extension of Andy's code that results in each entry getting its own row, so you can then filter your data much more efficiently.

Start with str.split , and then extract key-value pairs using str.extract .

df = df.set_index('R_id')\
       .Badges.str.split('\s*;\s*', expand=True)\
       .stack().reset_index(level=1, drop=1)\
       .str.extract('(?P<Name>[^-]+).*(?P<Val>\d+)', expand=True)\
       .dropna()

print(df)
                    Name Val
R_id                        
7LBCS      New Reviewer    1
8FKME      New Reviewer    1
8FKME  New Photographer    1
8FKME          Reviewer    3

An hour's pain may a century gain.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM