[英]Most efficient way to iterate through list of lists
I'm currently collecting data from quandl and is saved as a list of lists. 我目前正在从quandl收集数据,并保存为列表列表。 The list looks something like this (Price data):
该列表如下所示(价格数据):
['2', 1L, datetime.date(1998, 1, 2), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), '82.1900', '83.6200', '81.7500', '83.5000', '28.5183', 1286500.0]
This is typically 1 of about 5000 lists, and every once in awhile Quandl will spit back some NaN
values that don't like being saved into the database. 这通常是大约5000个列表中的1个,每隔一段时间Quandl将吐出一些不喜欢保存到数据库中的
NaN
值。
['2', 1L, datetime.date(1998, 1, 2), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), 'nan', 'nan', 'nan', 'nan', 'nan', 0]
What would be the most efficient way of iterating through the list of lists to change 'nan' values into zeros? 遍历列表列表以将“ nan”值更改为零的最有效方法是什么?
I know I could do something like this, but it seems rather inefficient. 我知道我可以做这样的事情,但是效率似乎很低。 This operation will need to be performed on 11 different values * 5000 different dates * 500 companies:
此操作将需要在11个不同的值* 5000个不同的日期* 500个公司上执行:
def screen_data(data):
new_data = []
for d in data:
new_list = []
for x in d:
new_value = x
if math.isNan(x):
new_value = 0
new_list.append(new_value)
new_data.append(new_list)
return new_data
I would be interested in any solution that could reduce the time. 我将对任何可以减少时间的解决方案感兴趣。 I know DataFrames might work, but not sure how it would solve the NaN issue.
我知道DataFrames可能会起作用,但不确定如何解决NaN问题。
Or if there is a way to include NaN values in an SQLServer5.6 database along with floats, changing the database is also a viable option. 或者,如果有一种方法可以将SQL Server5.6数据库中的NaN值与浮点数一起包括在内,则更改数据库也是可行的选择。
Don't create a new list - rather, edit the old list in-place: 不要创建新列表,而是就地编辑旧列表:
import math
def screenData(L):
for subl in L:
for i,n in enumerate(subl):
if math.isnan(n): subl[i] = 0
The only way I can think of, to make this faster, would be with multiprocessing 我能想到的唯一方法是更快地进行多处理
I haven't timed it but have you tried using nested list comprehension with conditional expressions ? 我尚未计时,但您是否尝试过将嵌套列表理解与条件表达式一起使用 ?
For example: 例如:
import datetime
data = [
['2', 1, datetime.date(1998, 1, 2),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
'82.1900', '83.6200', '81.7500', '83.5000',
'28.5183', 1286500.0],
['2', 1, datetime.date(1998, 1, 2),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
'nan', 'nan', 'nan', 'nan', 'nan', 0],
]
new_data = [[y if str(y).lower() != 'nan' else 0 for y in x] for x in data]
print(new_data)
I did not use math.isnan(y)
because you have to be sure that y
is a float number or you'll get an error. 我没有使用
math.isnan(y)
因为您必须确保y
是浮点数,否则会出现错误。 This is much more difficult to do while almost everything has a string representation. 在几乎所有内容都有字符串表示形式的情况下,这样做要困难得多。 But I still made sure that I did the lower case comparison to 'nan' (with .lower() ) since 'NaN' or 'Nan' are legal ways to express "Not a Number".
但是我仍然确保我对'nan'(使用.lower() )进行了小写比较,因为'NaN'或'Nan'是表达“ Not a Number”的合法方法。
how about this 这个怎么样
import math
def clean_nan(data_list,value=0):
for i,x in enumerate(data_list):
if math.isnan(x):
data_list[i] = value
return data_list
(the return is optional, as the modification was made in-place, but it is needed if used with map
or similar, assuming of course that data_list is well a list or similar container) (返回值是可选的,因为修改是就地进行的,但是如果与
map
或类似内容一起使用,则需要这样做,当然,假设data_list是列表或类似容器也是可以的)
depending on how you get your data and how you work with it will determined how to use it, for instance if you do something like this 取决于您如何获取数据以及如何使用它,将决定如何使用它,例如,如果您这样做
for data in (my database/Quandl/whatever):
#do stuff with data
you can change it to 您可以将其更改为
for data in (my database/Quandl/whatever):
clean_nan(data)
#do stuff with data
or use map or if you are in python 2 imap 或使用地图,或者如果您在python 2 imap中
for data in map(clean_nan,(my database/Quandl/whatever)):
#do stuff with data
that way you get to work with your data as soon as that arrive from the database/Quandl/whatever, granted if the place where you get the data also work as a generator, that is don't process the whole thing all at once, and if it does, procure to change it to a generator if possible. 这样,您就可以立即从数据库/ Quandl /任何地方获取数据,如果您获取数据的地方也可以作为生成器工作,那就是不要一次全部处理整个数据,如果可以,请设法将其更改为发电机。 In either case with this you get to work with your data as soon as possible.
无论哪种情况,您都可以尽快使用数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.