Python：将 CSV 列解析为 np.array 的行和列索引？

Question

My CSV file is formatted as such:我的 CSV 文件格式如下：

Id,Prediction
r1_c1,1
r3_c1,3
...

When I read the csv file as such:当我这样阅读 csv 文件时：

df = pd.read_csv('data/data_train.csv', delimiter=',')

I can get a matrix of size (N,1), where N is the number of values present in my input CSV file.我可以得到一个大小为 (N,1) 的矩阵，其中 N 是我的输入 CSV 文件中存在的值的数量。 Do notice that there are some missing values inside my input CSV file, therefore I cannot do a simple np.reshape请注意我的输入 CSV 文件中有一些缺失值，因此我不能做一个简单的np.reshape

Is there a fancy function or procedure, within pandas or np that fills a matrix A such that A[i][j] = v_ij , where v_ij is the value with associated 'Id' equal to ri_cj ?是否有一个花哨的 function 或过程，在pandas或np中填充矩阵 A 使得A[i][j] = v_ij ，其中v_ij是关联的“Id”等于ri_cj的值？

One could do it evidently with a for loop, but consider when the size of the input CSV file is rather large... one would be interested in leveraging the parallelism/vectorization implemented in numpy , for example.显然可以使用 for 循环来做到这一点，但考虑一下当输入 CSV 文件的大小相当大时......例如，人们会对利用numpy中实现的并行/矢量化感兴趣。 I couldn't describe my problem with keywords, so apologies if I couldn't find the associated documentation.我无法用关键字描述我的问题，所以如果我找不到相关文档，我深表歉意。

Answer 1

First, while reading csv file with pd.read_csv read Prediction column as pre-defined dtype (in the case of this example it's np.int32 ) just to make it more efficient.首先，在使用pd.read_csv读取 csv 文件时，将Prediction列读取为预定义的dtype （在本示例中为np.int32 ），以提高效率。

Then run the fallowing code:然后运行下面的代码：

# Parse Id column string values as integer pairs of output array indices.
indices = np.empty((len(df), 2), np.int32)
for i, id_string in enumerate(df.Id):
    id_parts = id_string.split('_')
    indices[i, :] = int(id_parts[0][1:]), int(id_parts[1][1:])

# Infer shape of output array (OPTIONAL: used if output shape is not given).
n, m = np.max(indices, axis=0) + 1

# Create predictions output array with the same dtype.
out = np.zeros((n, m), df.dtypes.Prediction)

# Assign Prediction column values.
out[indices[:, 0], indices[:, 1]] = df.Prediction

Small test data:小测试数据：

    Id  Prediction
0   r0_c2   5
1   r1_c1   8
2   r2_c0   7

Output (without specifying output shape (n, m) ): Output（未指定 output 形状(n, m) ）：

[[0 0 5]
 [0 8 0]
 [7 0 0]]

I've also timed-it using %%timeit function while running 900000 rows of data with output shape of (1000, 1000) and here are the results:我还使用%%timeit function 对其进行计时，同时运行 900000 行数据，output 形状为(1000, 1000) ，结果如下：

1 loop, best of 3: 1.61 s per loop 1 个循环，最好的 3 个：每个循环 1.61 秒

I hope this answers your question, good luck!我希望这能回答你的问题，祝你好运！

Python：将 CSV 列解析为 np.array 的行和列索引？

问题描述

1 个解决方案

解决方案1
0 2020-08-04 19:17:42

Python：将 CSV 列解析为 np.array 的行和列索引？

问题描述

1 个解决方案

解决方案1 0 2020-08-04 19:17:42

解决方案1
0 2020-08-04 19:17:42