[英]Assign values to DF rows based on string portions
I have a dataframe with 1 column (these are headernames for another dataframe. I am trying to assign weightings to these based on strings names contained in the rows. They all have long names (classes and subclasses like) seperated by underscores, for example: email_Trading Only, readership_unique_client, roadshow_NDR_Con_Call_Meetings, forum_meeting,我有一个 dataframe 有 1 列(这些是另一个 dataframe 的标题名称。我试图根据行中包含的字符串名称为这些分配权重。它们都有长名称(类和子类),用下划线分隔,例如: email_Trading Only, readership_unique_client, roadshow_NDR_Con_Call_Meetings, forum_meeting,
I would like to assign weights to these based on string instances that occur before/inbetween/after underscores.我想根据出现在下划线之前/之间/之后的字符串实例为这些分配权重。
Was thinking about creating a dictionary of sorts, but not sure how to loop and iterate through all the rows properly.正在考虑创建各种字典,但不确定如何正确循环和迭代所有行。 Pseudocode here:
伪代码在这里:
for i in rows:
if i contains 'email' #before first underscore
then 0.5 #assigned to corresponding row in new column of DF
Sample Data and output (based on first string portion before underscore(:示例数据和 output(基于下划线之前的第一个字符串部分(:
TITLES WEIGHTS
2 emp_full_name 0
3 emp_office_code 0
4 emp_country_office_code 0
.. ...
171 forum_presentation_Platinum Plus 0.5
172 forum_presentation_Private Client 0.5
173 forum_presentation_Silver 0.5
See the user guide on how to test for string that contains a pattern .请参阅用户指南, 了解如何测试包含模式的字符串。
You can solve it with something like你可以用类似的东西解决它
df['WEIGHTS'] = df.TITLES.str.contains('email') * 0.5
Or create the column and then update it或者创建列然后更新它
df['WEIGHTS'] = 0
df.loc[df.TITLES.str.contains('email'), 'WEIGHTS'] = 0.5
Update更新
.str
accessors work with regex by default so you can include optional patterns like .str
访问器默认使用正则表达式,因此您可以包含可选模式,例如
df.loc[df.TITLES.str.contains('(email)|(forum)'), 'WEIGHTS'] = 0.5
You can also get the first part of the strings with您还可以获取字符串的第一部分
label = df.TITLES.str.split().str[0]
Then use a mapper with series.replace
, but you would need to include all possible suffixes然后使用带有
series.replace
的映射器,但您需要包含所有可能的后缀
df['WEIGHTS'] = label.replace({'email': 0.5, 'forum': 0.2 ...})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.