简体   繁体   English

如何提取具有非零列值的行?

[英]How to extract rows with non-zeros column values?

Given a tsv file like this: 给定这样的tsv文件:

doc_id/query_id 1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
1000001 0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1000002 0   0   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

The first row is the header role with the doc_id/query_id as the first column header and 150 integer from [1,150] . 第一行是标头角色,其中doc_id/query_id是第一列标头,并且是[1,150] 150整数。

The value rows is made up of an ID in the first column and zeroes or ones other columns. 值行由第一列中的ID和零或其他列组成。

The goal is to extract pairs of the IDs and the names of the columns where it's non-zero, eg given the two rows of data above the desired output is: 目的是提取ID对和非零列名称,例如,给定所需输出上方的两行数据为:

1000001 4
1000001 9
1000002 7
1000002 8

There are 800,000 rows in the data, so I'll avoid pandas and use sframe , I've tried: 数据中有800,000行,因此我将避免使用pandas并尝试使用sframe

import turicreate as tc
from tqdm import tqdm

df = tc.SFrame('data.tsv')

with open('ground_truth.non-zeros.tsv', 'w') as fout:
    for i in tqdm(range(len(df))):
        for j in range(1,151):
            if df[i][str(j)]:
                print(df[i]['doc_id/query_id', j)

Is there a simpler way to extract the non-zeros values and the row IDs? 有没有更简单的方法来提取非零值和行ID?

Pandas solutions or other dataframe solutions are appreciated too! 熊猫解决方案或其他数据框解决方案也受到赞赏! Please do state the limitations if known and if any =) 请说明限制(如果已知),如果有的话=)

Here's a pandaic approach using stack and query : 这是一种使用stackquery

(df.set_index('doc_id/query_id')
   .stack()
   .to_frame('tmp')
   .query('tmp == 1')
   .index
   .values)

array([(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')],
      dtype=object)

This is an elegance first, performance later approach. 这是一种先优雅,后性能的方法。


You can also start with numpy, this is for max performance. 您也可以从numpy开始,这是为了获得最佳性能。

arr = np.loadtxt(filename, skiprows=1, usecols=np.r_[1:151], dtype=int)
index = np.loadtxt(filename, skiprows=1, usecols=[0], dtype=int)

r, c = np.where(arr)
np.column_stack([index[r], c+1])

array([[1000001,       4],
       [1000001,       9],
       [1000002,       7],
       [1000002,       8]])

Here is one way based on numpy , I think should slightly speed up the whole process 这是基于numpy一种方法,我认为应该稍微加快整个过程

t,v=np.where(df.iloc[:,1:]==1)
list(zip(df['doc_id/query_id'].iloc[t],df.columns[v+1]))
Out[135]: [(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')]

A non-pandas answer, you could just iterate over your file, and grab the columns where necessary: 一个非熊猫的答案,您可以遍历您的文件,并在必要时获取各列:

results = []

with open('yourfile.csv') as fh:
    headers = next(fh).split()
    for line in fh:
        _id, *line = line.split()
        non_zero = [{_id: header} for header, val in zip(headers[1:], line) if val!="0"]
        results.extend(non_zero)

# Where you now have the option to throw it into whatever data structure you want
results

[{'1000001': '4'}, {'1000001': '9'}, {'1000002': '7'}, {'1000002': '8'}]

This way you don't load the entire file into memory, you only grab what you need, though you do pay for the list.extend operation 这样,您无需将整个文件加载到内存中,尽管您确实为list.extend操作付费,但您只抓住了需要的list.extend

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Keras的数组中找到非零的数量? - How to find the number of non-zeros in an array in Keras? 如何添加只填充零而不影响 Python OpenCV 中的非零 - How to add such that only zeros are filled leaving non-zeros unaffected in Python OpenCV 计算沿行和列连接的非零点的数量,但不计算 shell 脚本中矩阵中的对角线 - Count the number of connected non-zeros along rows and columns but not diagonaly in a Matrix in shell script Python Scipy如何从csr_matrix遍历上/下三角部分非零 - Python Scipy How to traverse upper/lower trianglar portion non-zeros from csr_matrix 如何删除包含所有零值的行但不使用非零值的零 - How to drop rows with ALL zero values but not zeros WITH non zero values 在Python中有效地找到scipy / numpy中非零的间隔? - efficiently finding the interval with non-zeros in scipy/numpy in Python? 用沿数组轴的非零均值替换零 - Python / NumPy - Replace zeros with mean of non-zeros along an axis of array - Python / NumPy 如何提取指定列值组合重复的数据帧的行? - How to extract the rows of a dataframe where a combination of specified column values are duplicated? Numpy 数组:如何根据列中的值提取整行 - Numpy array: How to extract whole rows based on values in a column 用同一列中相邻行的平均值替换数据框中的零 - Replace zeros in the data frame with average values of adjacent rows in the same column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM