如何提取具有非零列值的行？

Question

Given a tsv file like this: 给定这样的tsv文件：

doc_id/query_id 1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
1000001 0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1000002 0   0   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

The first row is the header role with the doc_id/query_id as the first column header and 150 integer from [1,150] . 第一行是标头角色，其中doc_id/query_id是第一列标头，并且是[1,150] 150整数。

The value rows is made up of an ID in the first column and zeroes or ones other columns. 值行由第一列中的ID和零或其他列组成。

The goal is to extract pairs of the IDs and the names of the columns where it's non-zero, eg given the two rows of data above the desired output is: 目的是提取ID对和非零列名称，例如，给定所需输出上方的两行数据为：

There are 800,000 rows in the data, so I'll avoid pandas and use sframe , I've tried: 数据中有800,000行，因此我将避免使用pandas并尝试使用sframe ：

import turicreate as tc
from tqdm import tqdm

df = tc.SFrame('data.tsv')

with open('ground_truth.non-zeros.tsv', 'w') as fout:
    for i in tqdm(range(len(df))):
        for j in range(1,151):
            if df[i][str(j)]:
                print(df[i]['doc_id/query_id', j)

Is there a simpler way to extract the non-zeros values and the row IDs? 有没有更简单的方法来提取非零值和行ID？

Pandas solutions or other dataframe solutions are appreciated too! 熊猫解决方案或其他数据框解决方案也受到赞赏！ Please do state the limitations if known and if any =) 请说明限制（如果已知），如果有的话=）

Answer 1

Here's a pandaic approach using stack and query : 这是一种使用stack和query ：

(df.set_index('doc_id/query_id')
   .stack()
   .to_frame('tmp')
   .query('tmp == 1')
   .index
   .values)

array([(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')],
      dtype=object)

This is an elegance first, performance later approach. 这是一种先优雅，后性能的方法。

You can also start with numpy, this is for max performance. 您也可以从numpy开始，这是为了获得最佳性能。

arr = np.loadtxt(filename, skiprows=1, usecols=np.r_[1:151], dtype=int)
index = np.loadtxt(filename, skiprows=1, usecols=[0], dtype=int)

r, c = np.where(arr)
np.column_stack([index[r], c+1])

array([[1000001,       4],
       [1000001,       9],
       [1000002,       7],
       [1000002,       8]])

Answer 2

Here is one way based on numpy , I think should slightly speed up the whole process 这是基于numpy一种方法，我认为应该稍微加快整个过程

t,v=np.where(df.iloc[:,1:]==1)
list(zip(df['doc_id/query_id'].iloc[t],df.columns[v+1]))
Out[135]: [(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')]

Answer 3

A non-pandas answer, you could just iterate over your file, and grab the columns where necessary: 一个非熊猫的答案，您可以遍历您的文件，并在必要时获取各列：

results = []

with open('yourfile.csv') as fh:
    headers = next(fh).split()
    for line in fh:
        _id, *line = line.split()
        non_zero = [{_id: header} for header, val in zip(headers[1:], line) if val!="0"]
        results.extend(non_zero)

# Where you now have the option to throw it into whatever data structure you want
results

[{'1000001': '4'}, {'1000001': '9'}, {'1000002': '7'}, {'1000002': '8'}]

This way you don't load the entire file into memory, you only grab what you need, though you do pay for the list.extend operation 这样，您无需将整个文件加载到内存中，尽管您确实为list.extend操作付费，但您只抓住了需要的list.extend

如何提取具有非零列值的行？

问题描述

3 个解决方案

解决方案1
2 2019-06-04 00:56:31

解决方案2
2 2019-06-04 01:01:57

解决方案3
1 2019-06-04 01:09:57

如何提取具有非零列值的行？

问题描述

3 个解决方案

解决方案1 2 2019-06-04 00:56:31

解决方案2 2 2019-06-04 01:01:57

解决方案3 1 2019-06-04 01:09:57

解决方案1
2 2019-06-04 00:56:31

解决方案2
2 2019-06-04 01:01:57

解决方案3
1 2019-06-04 01:09:57