[英]Python: Parse a CSV column as a row and column index for a np.array?
My CSV file is formatted as such:我的 CSV 文件格式如下:
Id,Prediction
r1_c1,1
r3_c1,3
...
When I read the csv file as such:当我这样阅读 csv 文件时:
df = pd.read_csv('data/data_train.csv', delimiter=',')
I can get a matrix of size (N,1), where N is the number of values present in my input CSV file.我可以得到一个大小为 (N,1) 的矩阵,其中 N 是我的输入 CSV 文件中存在的值的数量。 Do notice that there are some missing values inside my input CSV file, therefore I cannot do a simple np.reshape
请注意我的输入 CSV 文件中有一些缺失值,因此我不能做一个简单的np.reshape
Is there a fancy function or procedure, within pandas
or np
that fills a matrix A such that A[i][j] = v_ij
, where v_ij
is the value with associated 'Id' equal to ri_cj
?是否有一个花哨的 function 或过程,在pandas
或np
中填充矩阵 A 使得A[i][j] = v_ij
,其中v_ij
是关联的“Id”等于ri_cj
的值?
One could do it evidently with a for loop, but consider when the size of the input CSV file is rather large... one would be interested in leveraging the parallelism/vectorization implemented in numpy
, for example.显然可以使用 for 循环来做到这一点,但考虑一下当输入 CSV 文件的大小相当大时......例如,人们会对利用numpy
中实现的并行/矢量化感兴趣。 I couldn't describe my problem with keywords, so apologies if I couldn't find the associated documentation.我无法用关键字描述我的问题,所以如果我找不到相关文档,我深表歉意。
First, while reading csv file with pd.read_csv
read Prediction
column as pre-defined dtype
(in the case of this example it's np.int32
) just to make it more efficient.首先,在使用pd.read_csv
读取 csv 文件时,将Prediction
列读取为预定义的dtype
(在本示例中为np.int32
),以提高效率。
Then run the fallowing code:然后运行下面的代码:
# Parse Id column string values as integer pairs of output array indices.
indices = np.empty((len(df), 2), np.int32)
for i, id_string in enumerate(df.Id):
id_parts = id_string.split('_')
indices[i, :] = int(id_parts[0][1:]), int(id_parts[1][1:])
# Infer shape of output array (OPTIONAL: used if output shape is not given).
n, m = np.max(indices, axis=0) + 1
# Create predictions output array with the same dtype.
out = np.zeros((n, m), df.dtypes.Prediction)
# Assign Prediction column values.
out[indices[:, 0], indices[:, 1]] = df.Prediction
Small test data:小测试数据:
Id Prediction
0 r0_c2 5
1 r1_c1 8
2 r2_c0 7
Output (without specifying output shape (n, m)
): Output(未指定 output 形状(n, m)
):
[[0 0 5]
[0 8 0]
[7 0 0]]
I've also timed-it using %%timeit
function while running 900000 rows of data with output shape of (1000, 1000)
and here are the results:我还使用%%timeit
function 对其进行计时,同时运行 900000 行数据,output 形状为(1000, 1000)
,结果如下:
1 loop, best of 3: 1.61 s per loop 1 个循环,最好的 3 个:每个循环 1.61 秒
I hope this answers your question, good luck!我希望这能回答你的问题,祝你好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.