[英]Fill empty Pandas column based on condition on substring
I have this dataset with the following data.我有包含以下数据的数据集。 I have a Job_Title column and I added a Categories column that I want to use to categorize my job titles.
我有一个 Job_Title 列,我添加了一个 Categories 列,我想用它来对我的职位进行分类。 For example, all the job titles that contains the word 'Analytics' will be categorize as Data.
例如,所有包含“分析”一词的职位名称都将归类为数据。 This label Data will appear on the Categories table.
此 label 数据将出现在类别表中。
I have created a dictionary with the words I want to identify on the Job_Title column as key and the values I want to add on the Categories column as values.我创建了一个字典,其中包含我想在 Job_Title 列上识别的词作为键,以及我想添加到 Categories 列上的值作为值。
#Creating a new dictionary with the new categories
cat_type_dic = {}
cat_type_file = open("categories.txt")
for line in cat_type_file:
key, value = line.split(";")
cat_type_dic[key] = value
print(cat_type_dic)
Then, I tried to create a loop based on a condition.然后,我尝试根据条件创建一个循环。 Basically, if the key on the dictionary is a substring of the column Job_Title, fill the column Categories with the value.
基本上,如果字典上的键是 Job_Title 列的 substring,则用该值填充 Categories 列。 This is what I tried:
这是我试过的:
for i in range(len(df)):
if df.loc["Job_Title"].str.contains(cat_type_dic[i]):
df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
Of course, it's not working.当然,这是行不通的。 I think I am not accessing correctly to the key and value.
我想我没有正确访问键和值。 Any clue?
有什么线索吗?
This is the message error that I am getting:这是我收到的消息错误:
TypeError Traceback (most recent call last) in 1 for i in range(len(df)): ----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]): 3 df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
1 for i in range(len(df)): ----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]): 3 df[ “类别”] = df[“类别”].str.replace(cat_type_dic[i], cat_type_dic.get(i))
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable, axis=axis) 932 933 def _is_scalar_access(self, key: tuple):
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable,轴=轴)932 933 def _is_scalar_access(自身,键:元组):
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key): -> 1563 raise TypeError("Cannot index by location index with a non-integer key") 1564 1565 # validate the location
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key): -> 1563 raise TypeError("Cannot使用非整数键按位置索引索引") 1564 1565 # 验证位置
TypeError: Cannot index by location index with a non-integer key
类型错误:无法使用非整数键按位置索引进行索引
Thanks a lot!非常感谢!
Does the following code give you what you need?以下代码是否为您提供了您所需要的?
import pandas as pd
df = pd.DataFrame()
df['Job_Title'] = ['Business Analyst', 'Data Scientist', 'Server Analyst']
cat_type_dic = {'Business': ['CatB1', 'CatB2'], 'Scientist': ['CatS1', 'CatS2', 'CatS3']}
list_keys = list(cat_type_dic.keys())
def label_extracter(x):
list_matched_keys = list(filter(lambda y: y in x['Job_Title'], list_keys))
category_label = ' '.join([' '.join(cat_type_dic[key]) for key in list_matched_keys])
return category_label
df['Categories'] = df.apply(lambda x: label_extracter(x), axis=1)
print(df)
Job_Title Categories
0 Business Analyst CatB1 CatB2
1 Data Scientist CatS1 CatS2 CatS3
2 Server Analyst
apply
helps when loop necessary. apply
在需要循环时提供帮助。Job_Title
contains a key in the dictionary which is assigned earlier.Job_Title
是否包含之前分配的字典中的键。 I preferred convert keys to a list to make checking process easier.label_extracter
gets values assigned to key in list format. label_extracter
以列表格式获取分配给键的值。 It is converted to str by putting ' ' (white space) between values.' '.join
.' '.join
创建的字符串列表。 So outer ' '.join
convert it to string format.' '.join
将其转换为字符串格式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.