简体   繁体   English

Python:将 CSV 列解析为 np.array 的行和列索引?

[英]Python: Parse a CSV column as a row and column index for a np.array?

My CSV file is formatted as such:我的 CSV 文件格式如下:

Id,Prediction
r1_c1,1
r3_c1,3
...

When I read the csv file as such:当我这样阅读 csv 文件时:

df = pd.read_csv('data/data_train.csv', delimiter=',')

I can get a matrix of size (N,1), where N is the number of values present in my input CSV file.我可以得到一个大小为 (N,1) 的矩阵,其中 N 是我的输入 CSV 文件中存在的值的数量。 Do notice that there are some missing values inside my input CSV file, therefore I cannot do a simple np.reshape请注意我的输入 CSV 文件中有一些缺失值,因此我不能做一个简单的np.reshape

Is there a fancy function or procedure, within pandas or np that fills a matrix A such that A[i][j] = v_ij , where v_ij is the value with associated 'Id' equal to ri_cj ?是否有一个花哨的 function 或过程,在pandasnp中填充矩阵 A 使得A[i][j] = v_ij ,其中v_ij是关联的“Id”等于ri_cj的值?

One could do it evidently with a for loop, but consider when the size of the input CSV file is rather large... one would be interested in leveraging the parallelism/vectorization implemented in numpy , for example.显然可以使用 for 循环来做到这一点,但考虑一下当输入 CSV 文件的大小相当大时......例如,人们会对利用numpy中实现的并行/矢量化感兴趣。 I couldn't describe my problem with keywords, so apologies if I couldn't find the associated documentation.我无法用关键字描述我的问题,所以如果我找不到相关文档,我深表歉意。

First, while reading csv file with pd.read_csv read Prediction column as pre-defined dtype (in the case of this example it's np.int32 ) just to make it more efficient.首先,在使用pd.read_csv读取 csv 文件时,将Prediction列读取为预定义的dtype (在本示例中为np.int32 ),以提高效率。

Then run the fallowing code:然后运行下面的代码:

# Parse Id column string values as integer pairs of output array indices.
indices = np.empty((len(df), 2), np.int32)
for i, id_string in enumerate(df.Id):
    id_parts = id_string.split('_')
    indices[i, :] = int(id_parts[0][1:]), int(id_parts[1][1:])

# Infer shape of output array (OPTIONAL: used if output shape is not given).
n, m = np.max(indices, axis=0) + 1

# Create predictions output array with the same dtype.
out = np.zeros((n, m), df.dtypes.Prediction)

# Assign Prediction column values.
out[indices[:, 0], indices[:, 1]] = df.Prediction

Small test data:小测试数据:

    Id  Prediction
0   r0_c2   5
1   r1_c1   8
2   r2_c0   7

Output (without specifying output shape (n, m) ): Output(未指定 output 形状(n, m) ):

[[0 0 5]
 [0 8 0]
 [7 0 0]]

I've also timed-it using %%timeit function while running 900000 rows of data with output shape of (1000, 1000) and here are the results:我还使用%%timeit function 对其进行计时,同时运行 900000 行数据,output 形状为(1000, 1000) ,结果如下:

1 loop, best of 3: 1.61 s per loop 1 个循环,最好的 3 个:每个循环 1.61 秒

I hope this answers your question, good luck!我希望这能回答你的问题,祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM