将熊猫数据框相乘作为掩码

Question

我有 2 个熊猫数据框，如下所示：

发生

doc    0    1    2    ...    1809(=n)
  0    0    0    1    ...       1
  1    0    0    1    ...       0
  2    0    0    1    ...       0
  ..  ..    ..   ..   ...       .
  m   ......................... 0

字典

id    term
 0     foo
 1     bar
 2     lorem
 ..    ..
 n     ipsum

我想要做的是为每一行“出现”检索具有“1”作为单元格值的术语（通过 id，即第一个数据框中的列标题）。 在我的示例中，考虑第一行出现的情况，我将有： ['lorem','ipsum']

谢谢

Answer 1

这是np.where的一个想法

occurrences = pd.DataFrame([[0,0,1,1],[0,1,0,1], [1,0,1,0]])
dictionary=pd.DataFrame({'term':['foo','bar', 'lorem', 'ipsum']})

idx = np.where(occurrences)
(pd.Series(dictionary.values[idx[1]].ravel())
   .groupby(idx[0]).agg(list)
)

输出：

0    [lorem, ipsum]
1      [bar, ipsum]
2      [foo, lorem]
dtype: object

Answer 2

经过一些尝试，我让它以这种方式工作（也许不是那么酷..）

def get_vocabulary(occurences, dictionary):
    for index, row in dtm_.iterrows():
        # iterate on each row == each document
        doc = row.values.tolist() # convert row to list
        ngrams = []
        for i in range(len(doc)): # for each element
            if doc[i] != 0: 
                ngrams.append(dictionary.iloc[i, 1]) # match from vocabulary the term with positional index
    return ngrams

最终输出是：

['scheduling', 'distributed', 'deadline', .... , 'rate monotonic scheduling algorithm']

将熊猫数据框相乘作为掩码

问题描述

2 个解决方案

解决方案1
1 2020-02-26 20:14:02

解决方案2
0 已采纳 2020-02-26 22:31:14

将熊猫数据框相乘作为掩码

问题描述

2 个解决方案

解决方案1 1 2020-02-26 20:14:02

解决方案2 0 已采纳 2020-02-26 22:31:14

解决方案1
1 2020-02-26 20:14:02

解决方案2
0 已采纳 2020-02-26 22:31:14