繁体   English   中英

如果这些值的一部分在 pandas 的预定义列表中,则替换列中的值的有效方法

[英]Efficient way to replace values in column if part of those values are in predefined lists in pandas

所以我实际上已经解决了这个问题,但我这样做的方式可能不是最有效的。

对于我的数据库中的一列 - Industry - 我想替换值。 如果一个值包含“技术”、“技术”或类似的词,我想用“技术”这个词替换那个值。

我使用apply遵循下面的基本算法,它基本上循环通过预定义的列表(例如science )并检查当前Industry单元格中是否存在任何值,如果存在则替换它们。

然后它对下一个列表执行相同的操作。 到目前为止,我只有两个列表,但一旦完成,我可能会有十几个。

def industry_convert(row):
    
    science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
    tech = ["technology", "tech", "software"]

    for v in science:
        if v.lower() in row.Industry.lower():
            row.Industry = "Research, Science, & Education"
            
    for v in tech:
        if v.lower() in row.Industry.lower():
            row.Industry = "Technology"
            
    return row

df = df.apply(industry_convert, axis = 1)

我只是想知道这是否是最好的方法,或者是否有更pythonicpandas的方法?

编辑:

这是一些行业专栏的样子:

Industry
Research Scientist
Science: Education
Tech
Technical Assistance
Technology
Medical
Hospitality

这是应用代码后的样子:

Industry            
Research, Science, & Education
Research, Science, & Education
Technology
Technology
Technology
Medical
Hospitality

告诉我这是否可行,我在您的 function 中更新了 for 循环

science = list(map(lambda x:x.lower(),["research", "science", "scientific", "scientist", "academia", "education", "academic"]))
tech = list(map(lambda x:x.lower(),["technology", "tech", "software"]))
def industry_convert(row):
    global science,tech
    


  
     if row.Industry.lower() in science:
          row.Industry = "Research, Science, & Education"
            
    
     if row.Industry.lower() in science:
          row.Industry = "Technology"
            
    return row

df = df.apply(industry_convert, axis = 1)

我计算的列表只降低了一次,因此它不会被重新计算并且for循环的计算被保存希望它工作快乐编码^-^

就个人而言,我会使用str.contains.loc来分配新值。

这将比单独循环检查每一行快很多倍。 (这是关于 pandas API 的反模式)

science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
tech = ["technology", "tech", "software"]

df.loc[df['Industry'].str.contains(f"{'|'.join(science)}",regex=True,case=False),
                         'industry_new'] = "Research, Science, & Education"

df.loc[df['Industry'].str.contains(f"{'|'.join(tech)}",regex=True,case=False),
                         'industry_new'] = "Technology"


df['industry_new'] = df['industry_new'].fillna(df['Industry'])  

print(df)

               Industry                    industry_new
0    Research Scientist  Research, Science, & Education
1    Science: Education  Research, Science, & Education
2                  Tech                      Technology
3  Technical Assistance                      Technology
4            Technology                      Technology
5               Medical                         Medical
6           Hospitality                     Hospitality

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM