简体   繁体   English

根据字符串部分为 DF 行赋值

[英]Assign values to DF rows based on string portions

I have a dataframe with 1 column (these are headernames for another dataframe. I am trying to assign weightings to these based on strings names contained in the rows. They all have long names (classes and subclasses like) seperated by underscores, for example: email_Trading Only, readership_unique_client, roadshow_NDR_Con_Call_Meetings, forum_meeting,我有一个 dataframe 有 1 列(这些是另一个 dataframe 的标题名称。我试图根据行中包含的字符串名称为这些分配权重。它们都有长名称(类和子类),用下划线分隔,例如: email_Trading Only, readership_unique_client, roadshow_NDR_Con_Call_Meetings, forum_meeting,

I would like to assign weights to these based on string instances that occur before/inbetween/after underscores.我想根据出现在下划线之前/之间/之后的字符串实例为这些分配权重。

Was thinking about creating a dictionary of sorts, but not sure how to loop and iterate through all the rows properly.正在考虑创建各种字典,但不确定如何正确循环和迭代所有行。 Pseudocode here:伪代码在这里:

for i in rows: 
     if i contains 'email' #before first underscore
          then 0.5 #assigned to corresponding row in new column of DF

Sample Data and output (based on first string portion before underscore(:示例数据和 output(基于下划线之前的第一个字符串部分(:

                                TITLES   WEIGHTS     
2                        emp_full_name     0
3                      emp_office_code     0
4              emp_country_office_code     0
..                                 ...
171   forum_presentation_Platinum Plus     0.5
172  forum_presentation_Private Client     0.5
173          forum_presentation_Silver     0.5

See the user guide on how to test for string that contains a pattern .请参阅用户指南, 了解如何测试包含模式的字符串

You can solve it with something like你可以用类似的东西解决它

df['WEIGHTS'] = df.TITLES.str.contains('email') * 0.5

Or create the column and then update it或者创建列然后更新它

df['WEIGHTS'] = 0
df.loc[df.TITLES.str.contains('email'), 'WEIGHTS'] = 0.5

Update更新

.str accessors work with regex by default so you can include optional patterns like .str访问器默认使用正则表达式,因此您可以包含可选模式,例如

df.loc[df.TITLES.str.contains('(email)|(forum)'), 'WEIGHTS'] = 0.5

You can also get the first part of the strings with您还可以获取字符串的第一部分

label = df.TITLES.str.split().str[0]

Then use a mapper with series.replace , but you would need to include all possible suffixes然后使用带有series.replace的映射器,但您需要包含所有可能的后缀

df['WEIGHTS'] = label.replace({'email': 0.5, 'forum': 0.2 ...})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM