[英]Fastest way to extract only certain fields from comma separated string in Python
Say I have a string containing data from a DB or spreadsheet in comma separated format. 假设我有一个字符串,其中包含来自数据库或电子表格的数据,采用逗号分隔格式。
For example: 例如:
data = "hello,how,are,you,232.3354,good morning"
Assume that there are maybe 200 fields in these "records". 假设这些“记录”中可能有200个字段。
I am interested in looking at just certain fields of this record. 我有兴趣查看此记录的某些字段。 What is the fastest way in Python to get at them?
Python中最快的方法是什么?
The most simple way would be something like: 最简单的方法是:
fields = data.split(",")
result = [fields[4], fields[12], fields[123]]
Is there a faster way to do this, making use of the fact that: 有没有更快的方法来做到这一点,利用以下事实:
I have tried to write some code using repeated calls to find to skip passed commas but if the last field is too far down the string this becomes slower than the basic split solution. 我曾尝试使用重复调用来编写一些代码来查找跳过传递的逗号,但如果最后一个字段在字符串中太远,则会比基本的拆分解决方案慢。
I am processing several million records so any speedup would be welcome. 我正在处理数百万条记录,所以任何加速都会受到欢迎。
You're not going to do too much better than loading everything into memory and then dropping the parts that you need. 你不会做太多比将所有内容加载到内存然后丢弃你需要的部分更好。 My recommendation is compression and a better library.
我的建议是压缩和更好的库。
As it happens I have a couple reasonably sized csv's lying around (this one is 500k lines). 碰巧我有几个合理大小的csv(这个是500k行)。
> import gzip
> import pandas as pd
> %timeit pd.read_csv(gzip.open('file.csv.gz'))
1 loops, best of 3: 545 ms per loop
Dropping the columns is also pretty fast, I'm not sure what the major cost is. 删除列也很快,我不确定主要成本是多少。
> %timeit csv[['col1', 'col2']]
100 loops, best of 3: 5.5 ms per loop
If result
can be a tuple
instead of a list, you might gain a bit of a speedup (if you're doing multiple calls) using operator.itemgetter
: 如果
result
可以是tuple
而不是列表,那么使用operator.itemgetter
可能会获得一些加速(如果您正在进行多次调用):
from operator import itemgetter
indexer = itemgetter(4,12,123)
result = indexer(data.split(','))
You'd need to timeit
to actually see if you get a speedup or not though. 你需要
timeit
实际上看,如果你得到一个加速与否虽然。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.