简体   繁体   中英

pandas number of times a string occurs in one column based on another column

I have a very large dataframe of cars. Like this:

                                Text                               Terms
0       Car's model porche year in data                [tech, window, tech]
1  we’re simply making fossil fuel cars       [brakes, window, Italy, nice]
2          Year of cars Ferrari to make    [Detroit, window, seats, engine]
3     reading the specs of Ferrari file  [tech, window, engine, v8, window]
4     likelihood Porche in the car list                 [from, wheel, tech]

And, these:

term_list = ['tech', 'engine', 'window']
cap_list = ['Ferrari', 'porche']
term_cap_dict = {'Ferrari': ['engine', 'window'], 'Porche': ['tech']}

I want a resulting dataframe that computes the number of times each term (in term_list) that occurs in 'Terms' column - to be counted only when the 'Text' column contains the corresponding 'key' (from term_cap_dict). For eg: The conditional-count of the term 'tech' (given Porche) = 3 (because the corresponding 'Text' has 'Porche' in them. ...even though, the total number of times 'tech' appears is 4). If either the count is 0 or the conditional-text is absent, then, conditional-count defaults to 0. The desired output:

    Terms        Cap  ConditionalCount  
0  engine    Ferrari  2
1  engine     porche  0
2    tech    Ferrari  0
3    tech     porche  3
4  window    Ferrari  3
5  window     porche  1

Here is what I have so far (just computing TotalCount...not conditional count):

term_cap_dict = {k.lower(): list(map(str.lower, v)) for k, v in term_cap_dict.items()}
terms_counter = Counter(chain.from_iterable(df['Terms']))
terms_series = pd.Series(terms_counter)
terms_df = pd.DataFrame({'Term': terms_series.index, 'TotalCount': terms_series.values})
df1 = terms_df[terms_df['Term'].isin(term_list)]
product_terms = product(term_list, cap_list)
df_cp = pd.DataFrame(product_terms, columns=['Terms', 'Capability'])
dff = df_cp.set_index('Terms').combine_first(df1.set_index('Term')).reset_index()
dff.rename(columns={'index': 'Terms'}, inplace=True)

which gives TotalCount:

    Terms Capability  TotalCount
0  engine    Ferrari  3.0
1  engine     porche  3.0
2    tech    Ferrari  4.0
3    tech     porche  4.0
4  window    Ferrari  4.0
5  window     porche  4.0

From this point onwards, I do not know how to compute ConditionalCount. Any suggestion is appreciated.

df.to_dict()

{'Title': {0: "Car's model porche year in data",
      1: 'we’re simply making fossil fuel cars',
      2: 'Year of cars Ferrari to make',
      3: 'reading the specs of Ferrari file',
      4: 'likelihood Porche in the car list'},
     'Terms': {0: ['tech', 'window', 'tech'],
      1: ['brakes', 'engine', 'Italy', 'nice'],
      2: ['Detroit', 'window', 'seats', 'engine'],
      3: ['tech', 'window', 'engine', 'v8', 'window'],
      4: ['from', 'wheel', 'tech']}}

Update:

df1 = df.explode(column='Terms')

regcap = '|'.join(cap_list)
df1['Cap'] = df1['Text'].str.extract(f'({regcap})')
filter_df =pd.concat([pd.DataFrame({'Cap':v, 'Terms':k}) for v, k in term_cap_dict.items()])
filter_df = filter_df.apply(lambda x: x.str.lower())

df1 = df1.apply(lambda x: x.str.lower())
df1_filt = df1.merge(filter_df)
idx = pd.MultiIndex.from_product([term_list, list(map(str.lower, cap_list))], names=['Term','Cap'])
df_out = df1_filt[df1_filt['Terms'].isin(term_list)].groupby(['Terms','Cap']).count()\
                                       .rename(columns= {'Text':'Count'})\
                                       .reindex(idx, fill_value=0).reset_index()
print(df_out)

Output:

     Term      Cap  Count
0    tech  ferrari      0
1    tech   porche      2
2  engine  ferrari      2
3  engine   porche      0
4  window  ferrari      3
5  window   porche      0

IIUC, try this:

df1 = df.explode(column='Terms')

regcap = '|'.join(cap_list)
df1['Cap'] = df1['Text'].str.extract(f'({regcap})')

idx = pd.MultiIndex.from_product([term_list, cap_list], names=['Term','Cap'])
df_out = df1[df1['Terms'].isin(term_list)].groupby(['Terms','Cap']).count()\
                                          .rename(columns= {'Text':'Count'})\
                                          .reindex(idx, fill_value=0).reset_index()
print(df_out)

Output:

     Term      Cap  Count
0    tech  Ferrari      1
1    tech   porche      2
2  engine  Ferrari      2
3  engine   porche      0
4  window  Ferrari      3
5  window   porche      1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM