简体   繁体   English

将列表转换为熊猫数据框中的集合

[英]Conversion of list to sets in pandas dataframe

I would like to search for substrings in each row of a dataframe column. 我想在数据框列的每一行中搜索子字符串。 I read somewhere that it is faster to search if the column can be converted into a set. 我在某处读到,搜索该列是否可以转换为一组更快。 I am trying to use the approaches suggested here: How to convert list into set in pandas? 我正在尝试使用此处建议的方法: 如何将列表转换成熊猫集? But I get some unexpected output. 但是我得到了一些意外的输出。 My dataframe looks like this: 我的数据框如下所示:

      R_id        Badges
0    7LBCS        New Reviewer - 1 Review
1    8FKME        New Reviewer - 1 Review; New Photographer - 1 Photo; Reviewer - 3 Reviews;

When I use the following approaches: 当我使用以下方法时:

df['Badges'] = df.apply(lambda row: set(row['Badges']), axis=1)

OR 要么

df['Badges'] = df['Badges'].apply(set)

the output that I get for each row in the dataframe above is a set with unique characters of the string in the row. 我为上面的数据框中的每一行获得的输出是一组具有该行中字符串的唯一字符的集合。 I am not able to replicate the exact output, because for some reason, as soon as the output is generated, the Spyder IDE crashes. 我无法复制确切的输出,因为由于某种原因,一旦生成输出,Spyder IDE就会崩溃。 But the output for the first row looks something like: 但是第一行的输出如下所示:

{'1', '-', 'N', 'e', 'w', 'R', 'v', 'i', 'r'}

What could be going wrong here in the conversion to sets? 转换为集合时,这里可能出什么问题?

You have to split before you use set: 您必须先分割才能使用set:

In [11]: df.Badges.str.split('\s*;\s*').apply(set)
Out[11]:
0                            {New Reviewer - 1 Review}
1    {Reviewer - 3 Reviews, , New Photographer - 1 ...
Name: Badges, dtype: object

To throw away the empties I might tweak it as follows: 要丢弃空容器,我可以对其进行如下调整:

In [12]: df.Badges.str.split('\s*;\s*').apply(lambda bs: set(b for b in bs if b))
Out[12]:
0                            {New Reviewer - 1 Review}
1    {Reviewer - 3 Reviews, New Photographer - 1 Ph...
Name: Badges, dtype: object

or you could strip the ';' 或者您可以删除';' first (if that's the only place empty comes from): 首先(如果这是唯一的空白来源):

In [13]: df.Badges.str.strip(';').str.split('\s*;\s*').apply(set)
Out[13]:
0                            {New Reviewer - 1 Review}
1    {Reviewer - 3 Reviews, New Photographer - 1 Ph...
Name: Badges, dtype: object

The latter might be slightly more efficient... 后者可能会更有效率...

Your data is not in a format that makes it easy to work with. 您的数据的格式不便于使用。 I'd recommend an extension of Andy's code that results in each entry getting its own row, so you can then filter your data much more efficiently. 我建议对Andy代码进行扩展,以使每个条目都有自己的行,这样您就可以更加有效地过滤数据。

Start with str.split , and then extract key-value pairs using str.extract . str.split开始,然后使用str.extract提取键值对。

df = df.set_index('R_id')\
       .Badges.str.split('\s*;\s*', expand=True)\
       .stack().reset_index(level=1, drop=1)\
       .str.extract('(?P<Name>[^-]+).*(?P<Val>\d+)', expand=True)\
       .dropna()

print(df)
                    Name Val
R_id                        
7LBCS      New Reviewer    1
8FKME      New Reviewer    1
8FKME  New Photographer    1
8FKME          Reviewer    3

An hour's pain may a century gain. 一个小时的痛苦可能会增加一个世纪。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM