简体   繁体   English

Pandas 基于另一个带有子字符串的 dataframe 对 dataframe 进行分类

[英]Pandas categorize a dataframe based on another dataframe with substrings

I'm trying to learn pandas and python to transfer some problems from excel to pandas/python.我正在尝试学习 pandas 和 python 将一些问题从 excel 转移到 pandas/python。 I have a big csv file from my bank with over 10000 records.我的银行有一个大的 csv 文件,其中包含超过 10000 条记录。 I want to categorize the records based on the description.我想根据描述对记录进行分类。 For that I have a big mapping file with keywords.为此,我有一个带有关键字的大映射文件。 In excel I used vLookup and I'm trying to get this solution into Pandas/python在 excel 中,我使用了 vLookup,我正在尝试将此解决方案放入 Pandas/python

So I can read the csv into a dataframe dfMain.所以我可以将 csv 读入 dataframe dfMain。 One column (in dfMain) with text called description is for me input to categorize it based on an the mapping file called dfMap.一列(在 dfMain 中)带有名为 description 的文本,供我输入以根据名为 dfMap 的映射文件对其进行分类。

dfMain looks simplified something like this: dfMain 看起来像这样简化:

Datum       Bedrag  Description     
2020-01-01  -166.47 een cirkel voor je uit
2020-01-02  -171.79 even een borreling
2020-01-02  -16.52  stilte zacht geluid
2020-01-02  -62.88  een steentje in het water
2020-01-02  -30.32  gooi jij je zorgen weg
2020-01-02  -45.99  dan ben je laf weet je dat
2020-01-02  -322.44 je klaagt ook altijd over pech
2020-01-03  -4.80   jij kan niet ophouden zorgen
2020-01-07  5.00    de wereld te besnauwen

dfMap looks simplified like this dfMap 看起来像这样简化

    sleutel     code
0   borreling   A1
1   zorgen      B2
2   steentje    C2
3   een         C1

dfMap contains keywords('sleutel') and a Category code ('code'). dfMap 包含关键字('sleutel')和类别代码('code')。

When the 'sleutel' is a substring of 'description' in dfMain an extra to be added column called 'category' in dfMain will get the value of the code.当 'sleutel' 是 dfMain 中的 'description' 的 substring 时,dfMain 中将额外添加一个名为 'category' 的列将获得代码的值。 I'm aware that multiple keywords can apply to certain values of description but first come counts, in other words: the number of rows in dfMain must stay the same.我知道多个关键字可以应用于某些描述值,但首先是计数,换句话说:dfMain 中的行数必须保持不变。

the resulting data frame must then look like this:生成的数据框必须如下所示:

Out[34]:

Datum        Bedrag Description                     category        
2020-01-01  -166.47 een cirkel voor je uit          C1
2020-01-02  -171.79 even een borreling              A1
2020-01-02  -16.52  stilte zacht geluid             NaN
2020-01-02  -62.88  een steentje in het water       C2
2020-01-02  -30.32  gooi jij je zorgen weg          B2
2020-01-02  -45.99  dan ben je laf weet je dat      NaN
2020-01-02  -322.44 je klaagt ook altijd over pech  NaN
2020-01-03  -4.80   jij kan niet ophouden zorgen    B2
2020-01-07  5.00    de wereld te besnauwen          NaN
 

I tried a lot of things with join but can't get it to work.我用 join 尝试了很多东西,但无法让它工作。

An efficient solution is to use a regex with extract and then to map the result:一个有效的解决方案是使用带有extract的正则表达式,然后使用map结果:

regex = '(%s)' % dfMap['sleutel'].str.cat(sep='|')

dfMain['category'] = (
 dfMain['Description']
   .str.extract(regex, expand=False)
   .map(dfMap.set_index('sleutel')['code'])
 )

Output: Output:

        Datum  Bedrag                     Description category
0  2020-01-01 -166.47          een cirkel voor je uit       C1
1  2020-01-02 -171.79              even een borreling       C1
2  2020-01-02  -16.52             stilte zacht geluid      NaN
3  2020-01-02  -62.88       een steentje in het water       C1
4  2020-01-02  -30.32          gooi jij je zorgen weg       B2
5  2020-01-02  -45.99      dan ben je laf weet je dat      NaN
6  2020-01-02 -322.44  je klaagt ook altijd over pech      NaN
7  2020-01-03   -4.80    jij kan niet ophouden zorgen       B2
8  2020-01-07    5.00          de wereld te besnauwen      NaN

The regex generated will end up as '(borreling|zorgen|steentje|een)'生成的正则表达式最终将作为'(borreling|zorgen|steentje|een)'

Try this:尝试这个:

import pandas as pd

# prepare the data

Datum = ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03']
Bedrag = [-166.47, -171.79, -16.52, -62.88, -30.32, -45.99, -322.44, -4.80, 5.00]
Description = ["een cirkel voor je uit", "even een borreling", "stilte zacht geluid", "een steentje in het water",
               "gooi jij je zorgen weg", "dan ben je laf weet je dat", "je klaagt ook altijd over pech", "jij kan niet ophouden zorgen", "de wereld te besnauwen"]

dfMain = pd.DataFrame(Datum, columns=['Datum'])
dfMain['Bedrag'] = Bedrag
dfMain['Description'] = Description

sleutel = ["borreling", "zorgen", "steentje", "een"]
code = ["A1", "B2", "C2", "C1"]

dfMap = pd.DataFrame(sleutel, columns=['sleutel'])
dfMap['code'] = code
print(dfMap)
# solution

map_code = pd.Series(dfMap.code.values ,index=dfMap.sleutel).to_dict()


def extract_codes(row):
    for item in map_code:
        if item in row:
            return map_code[item]
    return "NaN"

dfMain['category'] = dfMain['Description'].apply(extract_codes)

print(dfMain)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM