简体   繁体   English

Python:处理大量数据。 Scipy还是Rpy? 如何?

[英]Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed. 在我的python环境中,已经安装了Rpy和Scipy包。

The problem I want to tackle is such: 我想解决的问题是这样的:

1) A huge set of financial data are stored in a text file. 1)大量财务数据存储在文本文件中。 Loading into Excel is not possible 无法加载到Excel

2) I need to sum a certain fields and get the totals. 2)我需要总结某些字段并得到总数。

3) I need to show the top 10 rows based on the totals. 3)我需要根据总数显示前10行。

Which package (Scipy or Rpy) is best suited for this task? 哪个包(Scipy或Rpy)最适合此任务?

If so, could you provide me some pointers (eg documentation or online example) that can help me to implement a solution? 如果是这样,你能否提供一些可以帮助我实施解决方案的指针(例如文档或在线示例)?

Speed is a concern. 速度是一个问题。 Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory 理想情况下scipy和Rpy可以处理大文件,即使文件太大而无法安装到内存中

Neither Rpy or Scipy is necessary, although numpy may make it a bit easier. Rpy或Scipy都不是必需的,尽管numpy可能会让它变得容易一些。 This problem seems ideally suited to a line-by-line parser. 这个问题似乎非常适合逐行解析器。 Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line. 只需打开文件,将一行读入一个字符串,将该行扫描成一个数组(请参阅numpy.fromstring),更新您的运行总和并移至下一行。

Python's File I/O doesn't have bad performance, so you can just use the file module directly. Python的File I / O没有糟糕的性能,所以你可以直接使用file模块。 You can see what functions are available in it by typing help (file) in the interactive interpreter. 您可以通过在交互式解释器中键入help (file)来查看其中可用的功能。 Creating a file is part of the core language functionality and doesn't require you to import file . 创建文件是核心语言功能的一部分,不需要您import file

Something like: 就像是:

f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
    #line is a string type
    #do whatever you want to do on a per-line basis here, for example:
    print len(line)

Disclaimer: This is a Python 2 answer. 免责声明:这是Python 2的答案。 I'm not 100% sure this works in Python 3. 我不是100%肯定这适用于Python 3。

I'll leave it to you to figure out how to show the top 10 rows and find the row sums. 我会留给你弄清楚如何显示前10行并找到行总和。 This can be done with simple program logic that shouldn't be a problem without any special libraries. 这可以通过简单的程序逻辑来完成,如果没有任何特殊库,这应该不是问题。 Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter). 当然,如果行有某种复杂的格式,使得它难以解析出的值,您可能需要使用某种模块的解析, re例如(类型help(re)进入交互式解释)。

How huge is your data, is it larger than your PC's memory? 您的数据有多大,是否比PC的内存大? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. 如果可以将其加载到内存中,则可以使用numpy.loadtxt()将文本数据加载到numpy数组中。 for example: 例如:

import numpy as np
with file("data.csv", "rb") as f:
   title = f.readline()  # if your data have a title line.
   data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
   print np.sum(data, axis=0)  # sum along 0 axis to get the sum of every column

As @gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). 正如@ gsk3所说, bigmemory是一个很好的包,以及biganalyticsbigtabulate包(还有更多,但这些值得一试)。 There's also ff , though that isn't as easy to use. 还有ff ,虽然它不是那么容易使用。

Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. R和Python的共同点是支持HDF5(参见R中的ncdf4NetCDF4包),这使得它非常快速且易于访问磁盘上的海量数据集。 Personally, I primarily use bigmemory , though that's R specific. 就个人而言,我主要使用bigmemory ,尽管这是R特定的。 As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python. 由于HDF5可以在Python中使用并且速度非常快,因此它可能是您在Python中最好的选择。

I don't know anything about Rpy. 我对Rpy一无所知。 I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem. 我知道SciPy习惯于使用真正大的数据集进行严格的数字运算,因此它应该适用于您的问题。

As zephyr noted, you may not need either one; 正如西风所说,你可能不需要任何一个; if you just need to keep some running sums, you can probably do it in Python. 如果你只需要保留一些运行总和,你可以用Python做。 If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values. 如果它是CSV文件或其他常见文件格式,请检查并查看是否有一个Python模块将为您解析它,然后编写一个循环,它将相应的值相加。

I'm not sure how to get the top ten rows. 我不知道如何获得前十行。 Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? 您可以随时随地收集它们,还是需要计算总和然后选择行? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). 要收集它们,您可能希望使用字典来跟踪当前的10个最佳行,并使用键来存储您用于对它们进行排名的度量标准(以便在另一行取代它时便于查找和抛出一行) )。 If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows. 如果您需要在计算完成后找到行,则将所有数据啜饮到numpy.array中,否则只需要再次通过该文件来拉出十行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM