跨大熊猫DataFrame包含地图str。

Question

python初学者-我正在寻找创建字符串和相关值的字典映射。 我有一个数据框，想创建一个新列，如果字符串匹配，它会将列标记为x。

df = pd.DataFrame({'comp':['dell notebook', 'dell notebook S3', 'dell notepad', 'apple ipad', 'apple ipad2', 'acer chromebook', 'acer chromebookx', 'mac air', 'mac pro', 'lenovo x4'],
              'price':range(10)})

例如，我想使用上面的df并创建一个新列df['company']并将其设置为字符串映射。

我在想做类似的事情

product_map = {'dell':'Dell Inc.',
               'apple':'Apple Inc.',
               'acer': 'Acer Inc.',
               'mac': 'Apple Inc.',
               'lenovo': 'Dell Inc.'}

然后，我想遍历它以检查df.comp列，查看每个条目是否包含这些字符串之一，并将df.company列设置为df.company中的值。

虽然不确定如何正确执行此操作。

Answer 1

有很多方法可以做到这一点。 一种方法是：

def like_function(x):
    group = "unknown"
    for key in product_map:
        if key in x:
            group = product_map[key]
            break
    return group

df['company'] = df.comp.apply(like_function)

Answer 2

由MaxU解决方案启发而来的矢量化解决方案，用于解决类似问题。

x = df.comp.str.split(expand=True)
df['company'] = None
df['company'] = df['company'].fillna(x[x.isin(product_map.keys())]\
                                     .ffill(axis=1).bfill(axis=1).iloc[:, 0])
df['company'].replace(product_map, inplace=True)
print(df)
#               comp  price     company
#0     dell notebook      0   Dell Inc.
#1  dell notebook S3      1   Dell Inc.
#2      dell notepad      2   Dell Inc.
#3        apple ipad      3  Apple Inc.
#4       apple ipad2      4  Apple Inc.
#5   acer chromebook      5   Acer Inc.
#6  acer chromebookx      6   Acer Inc.
#7           mac air      7  Apple Inc.
#8           mac pro      8  Apple Inc.
#9         lenovo x4      9   Dell Inc.

Answer 3

这是一种有趣的方式，特别是如果您正在学习python。 您可以将dict子类化并覆盖__getitem__来查找部分字符串。

class dict_partial(dict):
    def __getitem__(self, value):
        for k in self.keys():
            if k in value:
                return self.get(k)
        else:
            return self.get(None)

product_map = dict_partial({'dell':'Dell Inc.', 'apple':'Apple Inc.',
                            'acer': 'Acer Inc.', 'mac': 'Apple Inc.',
                            'lenovo': 'Dell Inc.'})

df['company'] = df['comp'].apply(lambda x: product_map[x])

               comp  price     company
# 0     dell notebook      0   Dell Inc.
# 1  dell notebook S3      1   Dell Inc.
# 2      dell notepad      2   Dell Inc.
# 3        apple ipad      3  Apple Inc.
# 4       apple ipad2      4  Apple Inc.
# 5   acer chromebook      5   Acer Inc.
# 6  acer chromebookx      6   Acer Inc.
# 7           mac air      7  Apple Inc.
# 8           mac pro      8  Apple Inc.
# 9         lenovo x4      9   Dell Inc.

我对该方法的唯一烦恼是，子类dict不会与[]语法同时覆盖dict.get 。 如果可能的话，我们可以摆脱lambda并使用df['comp'].map(product_map.get) 。 似乎没有明显的解决方案。

Answer 4

据我所知，熊猫没有附带“子字符串映射”方法。 .map()方法不支持子字符串， .str.contains()方法仅适用于正则表达式（扩展性不佳）。

您可以通过编写一个简单的函数来获得所需的结果。 然后，您可以将.apply()与lambda function以生成所需的“公司”列。 额外的好处是，它可使代码保持可读性，并且可以重复使用该功能。 希望能有所帮助。

这应该为您提供所需的“公司”列：

def map_substring(s, dict_map):
    for key in dict_map.keys():
        if key in s: return dict_map[key]
    return np.nan
df['company'] = df['product'].apply(lambda x: map_substring(x, product_map))

跨大熊猫DataFrame包含地图str。

问题描述

4 个解决方案

解决方案1
6 已采纳 2018-02-02 20:45:34

解决方案2
2 2018-02-02 21:02:55

解决方案3
1 2018-02-02 21:36:32

解决方案4
0 2019-11-25 17:50:06

跨大熊猫DataFrame包含地图str。

问题描述

4 个解决方案

解决方案1 6 已采纳 2018-02-02 20:45:34

解决方案2 2 2018-02-02 21:02:55

解决方案3 1 2018-02-02 21:36:32

解决方案4 0 2019-11-25 17:50:06

解决方案1
6 已采纳 2018-02-02 20:45:34

解决方案2
2 2018-02-02 21:02:55

解决方案3
1 2018-02-02 21:36:32

解决方案4
0 2019-11-25 17:50:06