简体   繁体   中英

Extract the data from the data-frame using pandas

I have following data-frame.

PredictedFeature    Document_IDs                                   did  avg
   2000.0          [160, 384, 3, 217, 324, 11, 232, 41, 377, 48]    11  0.6
 2664.0        [160, 384, 3, 217, 324, 294,13,11]                     13  0.9

SO, like this I have a dataframe which has more data like this. Now, what I am trying is I have this did column in which I have Id ,

Now there is one more column Document_IDs , which has id's , so, I want to check weather the 11 document ID is present in this Document ID's column which is an array like wise.

So, like,

Final output would be like,

 did   avg  present    
   11   0.6    2
   13   0.9    1

2 is 2 times document id 11 is present in this Document Id's column .

I am totally new to this. So any small help will be great.

You can extract column Document_IDs with DataFrame.pop , then flatten values by chain.from_iterable , so possible sum matched values in generator with apply :

import ast
from  itertools import chain

df['Document_IDs'] = df['Document_IDs'].fillna('[]').apply(ast.literal_eval)

s = list(chain.from_iterable(df.pop('Document_IDs')))

df['pres'] = df['did'].map(lambda x: sum(y == x for y in s))
print (df)
   PredictedFeature  did  avg  pres
0            2000.0   11  0.6     2
1            2664.0   13  0.9     1

Or:

import ast
from itertools import chain
from collections import Counter

df['Document_IDs'] = df['Document_IDs'].fillna('[]').apply(ast.literal_eval)

df['pres'] = df['did'].map(Counter(chain.from_iterable(df.pop('Document_IDs'))))
print (df)
   PredictedFeature  did  avg  pres
0            2000.0   11  0.6     2
1            2664.0   13  0.9     1

EDIT:

from ast import literal_eval

def literal_eval_cust(x):
    try:
        return literal_eval(x)
    except Exception:
        return []


df['Document_IDs'] = df['Document_IDs'].apply(literal_eval_cust)

Solution using Counter and map

import collections
c = collections.Counter(df.Document_IDs.sum())    
df['Present'] = df.did.map(c)

df[['did', 'avg', 'Present']]

Out[584]:
   did  avg  Present
0  11   0.6  2
1  13   0.9  1

If you want to use a pandas native solution, try this:

df['pres'] = df.apply(lambda x: list(x['Document_IDs']).count(x['did']), axis=1)

I have not tested for calculation speed.

You can also count instances of an item in a list.

For example mylist.count(item)

So I would create a function to apply this to the rows:

def get_id(row):

    res = x['Document_IDs'].count(x['did'])

    return res

Then apply it, creating a new result column.

df['result'] = df.apply(get_id,axis=1)

Although I'm sure somebody will come along with a faster version:)

Given the following input:

df = pd.DataFrame([[[3,4,5,6,3,3,5,4], 3], [[1,4,7,8,4,5,1], 4]], columns=['Document_IDs','did'])

In one line:

df['Present'] = df.apply(lambda row: row.Document_IDs.count(row.did), axis=1)

If you want to print the results that interest you:

print(df[['did', 'avg', 'Present']])

   did  avg  Present
0    3  0.6        3
1    4  0.8        2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM