简体   繁体   English

根据 substring 的条件填充空 Pandas 列

[英]Fill empty Pandas column based on condition on substring

I have this dataset with the following data.我有包含以下数据的数据集。 I have a Job_Title column and I added a Categories column that I want to use to categorize my job titles.我有一个 Job_Title 列,我添加了一个 Categories 列,我想用它来对我的职位进行分类。 For example, all the job titles that contains the word 'Analytics' will be categorize as Data.例如,所有包含“分析”一词的职位名称都将归类为数据。 This label Data will appear on the Categories table.此 label 数据将出现在类别表中。

数据集 1

I have created a dictionary with the words I want to identify on the Job_Title column as key and the values I want to add on the Categories column as values.我创建了一个字典,其中包含我想在 Job_Title 列上识别的词作为键,以及我想添加到 Categories 列上的值作为值。

#Creating a new dictionary with the new categories
cat_type_dic = {}
cat_type_file = open("categories.txt")
for line in cat_type_file:
   key, value = line.split(";")
   cat_type_dic[key] = value

print(cat_type_dic)

Then, I tried to create a loop based on a condition.然后,我尝试根据条件创建一个循环。 Basically, if the key on the dictionary is a substring of the column Job_Title, fill the column Categories with the value.基本上,如果字典上的键是 Job_Title 列的 substring,则用该值填充 Categories 列。 This is what I tried:这是我试过的:

for i in range(len(df)):
   if df.loc["Job_Title"].str.contains(cat_type_dic[i]):
      df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))

Of course, it's not working.当然,这是行不通的。 I think I am not accessing correctly to the key and value.我想我没有正确访问键和值。 Any clue?有什么线索吗?

This is the message error that I am getting:这是我收到的消息错误:

TypeError Traceback (most recent call last) in 1 for i in range(len(df)): ----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]): 3 df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i)) 1 for i in range(len(df)): ----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]): 3 df[ “类别”] = df[“类别”].str.replace(cat_type_dic[i], cat_type_dic.get(i))

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable, axis=axis) 932 933 def _is_scalar_access(self, key: tuple): C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable,轴=轴)932 933 def _is_scalar_access(自身,键:元组):

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key): -> 1563 raise TypeError("Cannot index by location index with a non-integer key") 1564 1565 # validate the location C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key): -> 1563 raise TypeError("Cannot使用非整数键按位置索引索引") 1564 1565 # 验证位置

TypeError: Cannot index by location index with a non-integer key类型错误:无法使用非整数键按位置索引进行索引

Thanks a lot!非常感谢!

Does the following code give you what you need?以下代码是否为您提供了您所需要的?

import pandas as pd

df = pd.DataFrame()
df['Job_Title'] = ['Business Analyst', 'Data Scientist', 'Server Analyst']

cat_type_dic = {'Business': ['CatB1', 'CatB2'], 'Scientist': ['CatS1', 'CatS2', 'CatS3']}

list_keys = list(cat_type_dic.keys())

def label_extracter(x):
    list_matched_keys = list(filter(lambda y: y in x['Job_Title'], list_keys))
    category_label = ' '.join([' '.join(cat_type_dic[key]) for key in list_matched_keys])
    return category_label

df['Categories'] = df.apply(lambda x: label_extracter(x), axis=1)

print(df)

          Job_Title         Categories
0  Business Analyst        CatB1 CatB2
1    Data Scientist  CatS1 CatS2 CatS3
2    Server Analyst                   
EDIT: Explaination added.编辑:添加了解释。 @SofyPond @SofyPond
  • apply helps when loop necessary. apply在需要循环时提供帮助。
  • I defined a function which checks if Job_Title contains a key in the dictionary which is assigned earlier.我定义了一个 function 来检查Job_Title是否包含之前分配的字典中的键。 I preferred convert keys to a list to make checking process easier.我更喜欢将密钥转换为列表,以使检查过程更容易。
  • (list_label renamed to category_label since it is not list anymore) category_label in function label_extracter gets values assigned to key in list format. (list_label 重命名为 category_label,因为它不再是列表)function 中的 category_label label_extracter以列表格式获取分配给键的值。 It is converted to str by putting ' ' (white space) between values.通过在值之间放置 ' '(空白)将其转换为 str。 In the case, length of list_matched_keys is greater than 0, it will contains list of string which are created by inner ' '.join .在这种情况下, list_matched_keys 的长度大于 0,它将包含由 inner ' '.join创建的字符串列表。 So outer ' '.join convert it to string format.所以 outer ' '.join将其转换为字符串格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM