Filter dataframe for a specific string

Question

I am working with a dataframe of that kind.

    num person                  cash
0   1   personone               29
1   1   persontwo               81
2   1   personone:              17
3   1   personone               75
4   1   personone and persontwo 62
5   1   personone's friend      55

I am using the following code to filter a dataframe based on vector of strings.

people = ["personone", "persontwo"] 
pattern = r"\b{}\b".format("|".join(people))

df[df["person"].str.match(pattern)]

The problem is that the code produces a bit more than I need. When I check for df["person"].unique() it gives me an array of:

array(["personone", "persontwo", "personone:", "personone ", "personone and persontwo", "personone's friend"], dtype=object)

Even though I used the word boundaries \b .

The outcome I would like to achieve is: combine personone , personone: and personone (the last one with the space at the end). And produce the output of:

    num person    cash
0   1   personone 121
1   1   persontwo 81

Ie combine the three variants of personone and disregard its all other appearances. The cash for personone is the sum of 29+17+75=121.

Answer 1

One option is to match with optional \W instead of \b and force begin and end string:

people = ["personone", "persontwo"] 
pattern = r"^\W?({})\W?$".format("|".join(people))

s = df["person"].str.extract(pattern,expand=False)

df[s.notna()].groupby(['num',s])['cash'].sum()

Output:

num  person   
1    personone    121
     persontwo     81
Name: cash, dtype: int64

Answer 2

what you are asking is not entirely clear. For instance, why are you dropping the case personone and persontwo ?

Anyway, one way to proceed is to create a flag variable that flags the good observations, such as

df['flag'] = df.person.str.contains('personone:?$')

and then you simply sum

df.loc[df.flag == True, 'cash'].sum()

Answer 3

You can use ^ and $ to limit the matches:

>>> people = ["personone", "persontwo"]
>>> patt = fr"^({'|'.join(people)}).?$"
>>> (
      df.groupby(df.person.str.extract(patt, expand=False))
        .agg(cash=('cash', 'sum'), num=('num', 'first'))
        .reset_index().reindex(df.columns, axis=1)
    )
   num     person  cash
0    1  personone   121
1    1  persontwo    81

Filter dataframe for a specific string

Question

3 answers

solution1
1 ACCPTED 2021-03-02 21:51:03

solution2
0 2021-03-02 21:52:33

solution3
0 2021-03-02 22:10:12

Filter dataframe for a specific string

Question

3 answers

solution1 1 ACCPTED 2021-03-02 21:51:03

solution2 0 2021-03-02 21:52:33

solution3 0 2021-03-02 22:10:12

solution1
1 ACCPTED 2021-03-02 21:51:03

solution2
0 2021-03-02 21:52:33

solution3
0 2021-03-02 22:10:12