简体   繁体   English

在Python中仅从逗号分隔字符串中提取某些字段的最快方法

[英]Fastest way to extract only certain fields from comma separated string in Python

Say I have a string containing data from a DB or spreadsheet in comma separated format. 假设我有一个字符串,其中包含来自数据库或电子表格的数据,采用逗号分隔格式。

For example: 例如:

data = "hello,how,are,you,232.3354,good morning"

Assume that there are maybe 200 fields in these "records". 假设这些“记录”中可能有200个字段。

I am interested in looking at just certain fields of this record. 我有兴趣查看此记录的某些字段。 What is the fastest way in Python to get at them? Python中最快的方法是什么?

The most simple way would be something like: 最简单的方法是:

fields = data.split(",")
result = [fields[4], fields[12], fields[123]]

Is there a faster way to do this, making use of the fact that: 有没有更快的方法来做到这一点,利用以下事实:

  1. You only need to allocate a list with 3 elements and 3 string objects for the result. 您只需要为结果分配一个包含3个元素和3个字符串对象的列表。
  2. You can stop scanning the data string once you reach field 123. 到达字段123后,您可以停止扫描数据字符串。

I have tried to write some code using repeated calls to find to skip passed commas but if the last field is too far down the string this becomes slower than the basic split solution. 我曾尝试使用重复调用来编写一些代码来查找跳过传递的逗号,但如果最后一个字段在字符串中太远,则会比基本的拆分解决方案慢。

I am processing several million records so any speedup would be welcome. 我正在处理数百万条记录,所以任何加速都会受到欢迎。

You're not going to do too much better than loading everything into memory and then dropping the parts that you need. 你不会做太多比将所有内容加载到内存然后丢弃你需要的部分更好。 My recommendation is compression and a better library. 我的建议是压缩和更好的库。

As it happens I have a couple reasonably sized csv's lying around (this one is 500k lines). 碰巧我有几个合理大小的csv(这个是500k行)。

> import gzip
> import pandas as pd
> %timeit pd.read_csv(gzip.open('file.csv.gz'))
1 loops, best of 3: 545 ms per loop

Dropping the columns is also pretty fast, I'm not sure what the major cost is. 删除列也很快,我不确定主要成本是多少。

> %timeit csv[['col1', 'col2']]
100 loops, best of 3: 5.5 ms per loop

If result can be a tuple instead of a list, you might gain a bit of a speedup (if you're doing multiple calls) using operator.itemgetter : 如果result可以是tuple而不是列表,那么使用operator.itemgetter可能会获得一些加速(如果您正在进行多次调用):

from operator import itemgetter
indexer = itemgetter(4,12,123)
result = indexer(data.split(','))

You'd need to timeit to actually see if you get a speedup or not though. 你需要timeit实际上看,如果你得到一个加速与否虽然。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM