[英]Parsing colon separated sparse data with pandas and numpy
I would like to parse data file with the format col_index:value in pandas/numpy. 我想以pandas / numpy中的col_index:value格式解析数据文件。 For example:
例如:
0:23 3:41 1:31 2:65
would correspond to this matrix: 将对应于此矩阵:
[[23 0 0 41] [0 31 65 0]]
It seems like a pretty common way to represent sparse data in a file, but I can't find an easy way to parse this without having to do some sort of iteration after calling read_csv. 这似乎是表示文件中稀疏数据的一种很常见的方法,但是我找不到一种简便的方法来解析此数据,而不必在调用read_csv之后进行某种迭代。
I found out recently that this is in fact svm-light format and you may be able to read a dataset like this using an svm loader like: 我最近发现这实际上是svm-light格式,您可以使用svm loader来读取像这样的数据集:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
So, is parsing the file line by line an option, like: 因此,逐行解析文件是一个选项,例如:
from scipy.sparse import coo_matrix
rows, cols, values = [], [], []
with open('sparse.txt') as f:
for i, line in enumerate(f):
for cell in line.strip().split(' '):
col, value = cell.split(':')
rows.append(i)
cols.append(int(col))
values.append(int(value))
matrix = coo_matrix((values, (rows, cols)))
print matrix.todense()
Or do you need a faster one-step implementation? 还是您需要更快的单步实施? Not sure if this is possible.
不知道这是否可能。
Edit #1: You can avoid one iteration splitting each line in one step using regular expressions leading to the following alternative implementation: 编辑#1:您可以避免使用正则表达式一步一步地将每一行拆分成一行,从而导致以下替代实现:
import numpy as np
from scipy.sparse import coo_matrix
import re
rows, cols, values = [], [], []
with open('sparse.txt') as f:
for i, line in enumerate(f):
numbers = map(int, re.split(':| ', line))
rows.append([i] * (len(numbers) / 2))
cols.append(numbers[::2])
values.append(numbers[1::2])
matrix = coo_matrix((np.array(values).flatten(),
(np.array(rows).flatten(),
np.array(cols).flatten())))
print matrix.todense()
Edit #2: I found an even shorter solution without explicit loop: 编辑#2:我发现了没有显式循环的更短解决方案:
from scipy.sparse import coo_matrix, vstack
def parseLine(line):
nums = map(int, line.split(' '))
return coo_matrix((nums[1::2], ([0] * len(nums[0::2]), nums[0::2])), (1, 4))
with open('sparse.txt') as f:
lines = f.read().replace(':', ' ').split('\n')
cols = max(map(int, " ".join(lines).split(" "))[::2])
M = vstack(map(parseLine, lines))
print M.todense()
The loop is hidden within the map
commands that act on lines
. 该循环隐藏在作用于
lines
的map
命令中。 I think there is no solution without loops at all, since most built-in functions use them and many string-parsing methods like re.finditer
yield iterators only. 我认为根本没有没有循环的解决方案,因为大多数内置函数都使用循环,并且许多字符串解析方法(如
re.finditer
仅产生迭代器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.