使用 pandas 从数据帧中提取数据

Question

I have following data-frame.我有以下数据框。

PredictedFeature    Document_IDs                                   did  avg
   2000.0          [160, 384, 3, 217, 324, 11, 232, 41, 377, 48]    11  0.6
 2664.0        [160, 384, 3, 217, 324, 294,13,11]                     13  0.9

SO, like this I have a dataframe which has more data like this.所以，像这样我有一个 dataframe 有更多这样的数据。 Now, what I am trying is I have this did column in which I have Id ,现在，我正在尝试的是我有这个我有Id的did column ，

Now there is one more column Document_IDs , which has id's , so, I want to check weather the 11 document ID is present in this Document ID's column which is an array like wise.现在还有一列Document_IDs ，它有id's ，所以，我想检查一下这个Document ID's列中是否存在11文档 ID，这也是一个数组。

So, like,所以，就像，

Final output would be like,最后的 output 会像，

 did   avg  present    
   11   0.6    2
   13   0.9    1

2 is 2 times document id 11 is present in this Document Id's column . 2 是文档 ID 11 出现在此Document Id's column中的 2 倍。

I am totally new to this.我对此完全陌生。 So any small help will be great.所以任何小的帮助都会很棒。

Answer 1

You can extract column Document_IDs with DataFrame.pop , then flatten values by chain.from_iterable , so possible sum matched values in generator with apply :您可以使用DataFrame.pop提取列Document_IDs ，然后通过chain.from_iterable展平值，因此可能在生成器中使用apply sum匹配值：

import ast
from  itertools import chain

df['Document_IDs'] = df['Document_IDs'].fillna('[]').apply(ast.literal_eval)

s = list(chain.from_iterable(df.pop('Document_IDs')))

df['pres'] = df['did'].map(lambda x: sum(y == x for y in s))
print (df)
   PredictedFeature  did  avg  pres
0            2000.0   11  0.6     2
1            2664.0   13  0.9     1

Or:或者：

import ast
from itertools import chain
from collections import Counter

df['Document_IDs'] = df['Document_IDs'].fillna('[]').apply(ast.literal_eval)

df['pres'] = df['did'].map(Counter(chain.from_iterable(df.pop('Document_IDs'))))
print (df)
   PredictedFeature  did  avg  pres
0            2000.0   11  0.6     2
1            2664.0   13  0.9     1

EDIT:编辑：

from ast import literal_eval

def literal_eval_cust(x):
    try:
        return literal_eval(x)
    except Exception:
        return []


df['Document_IDs'] = df['Document_IDs'].apply(literal_eval_cust)

Answer 2

Solution using Counter and map使用Counter和map解决方案

import collections
c = collections.Counter(df.Document_IDs.sum())    
df['Present'] = df.did.map(c)

df[['did', 'avg', 'Present']]

Out[584]:
   did  avg  Present
0  11   0.6  2
1  13   0.9  1

Answer 3

If you want to use a pandas native solution, try this:如果你想使用 pandas 原生解决方案，试试这个：

df['pres'] = df.apply(lambda x: list(x['Document_IDs']).count(x['did']), axis=1)

I have not tested for calculation speed.我没有测试计算速度。

Answer 4

You can also count instances of an item in a list.您还可以计算列表中某个项目的实例。

For example mylist.count(item)例如mylist.count(item)

So I would create a function to apply this to the rows:因此，我将创建一个 function 将其应用于行：

def get_id(row):

    res = x['Document_IDs'].count(x['did'])

    return res

Then apply it, creating a new result column.然后应用它，创建一个新的result列。

df['result'] = df.apply(get_id,axis=1)

Although I'm sure somebody will come along with a faster version:)虽然我确信有人会提供更快的版本:)

Answer 5

Given the following input:给定以下输入：

df = pd.DataFrame([[[3,4,5,6,3,3,5,4], 3], [[1,4,7,8,4,5,1], 4]], columns=['Document_IDs','did'])

In one line:在一行中：

df['Present'] = df.apply(lambda row: row.Document_IDs.count(row.did), axis=1)

If you want to print the results that interest you:如果您想打印您感兴趣的结果：

print(df[['did', 'avg', 'Present']])

   did  avg  Present
0    3  0.6        3
1    4  0.8        2

使用 pandas 从数据帧中提取数据

问题描述

5 个解决方案

解决方案1
1 2019-09-20 09:20:01

解决方案2
1 2019-09-20 09:29:29

解决方案3
0 2019-09-20 09:22:50

解决方案4
0 2019-09-20 09:24:20

解决方案5
0 2019-09-20 11:41:54

使用 pandas 从数据帧中提取数据

问题描述

5 个解决方案

解决方案1 1 2019-09-20 09:20:01

解决方案2 1 2019-09-20 09:29:29

解决方案3 0 2019-09-20 09:22:50

解决方案4 0 2019-09-20 09:24:20

解决方案5 0 2019-09-20 11:41:54

解决方案1
1 2019-09-20 09:20:01

解决方案2
1 2019-09-20 09:29:29

解决方案3
0 2019-09-20 09:22:50

解决方案4
0 2019-09-20 09:24:20

解决方案5
0 2019-09-20 11:41:54