简体   繁体   English

Python:在numpy数组(大数据集)中计算出现次数的更快方法

[英]Python: faster way of counting occurences in numpy arrays (large dataset)

I am new to Python. 我是Python的新手。 I have a numpy.array which size is 66049x1 (66049 rows and 1 column). 我有一个numpy.array ,大小是66049x1 (66049行和1列)。 The values are sorted smallest to largest and are of float type, with some of them being repeated. 值从最小到最大排序,并且是float类型,其中一些是重复的。

I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed , eg X<=x in statistical terms ), in order to later plot the Sample Cumulative Distribution Function. 我需要确定每个值的出现频率(给定值等于但未超过的次数 ,例如统计术语中X <= x ),以便稍后绘制样本累积分布函数。

The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. 我目前使用的代码如下,但它非常慢,因为它必须循环66049x66049=4362470401次。 Is there any way to augment the speed of such piece of code? 有没有办法增加这段代码的速度? Will perhaps the use of dictionaries help in any way? 也许使用dictionaries会有什么帮助吗? Unfortunately I cannot change the size of the arrays I am working with. 不幸的是,我无法改变我正在使用的数组的大小。

+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))

#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
    temp=x3[i]
    for j in range(len(x3)):
       if (temp<=x3[j]):
           count[j]=count[j]+1

#Creates a 2D array with (value, occurrences)
    x4=numpy.zeros((len(x3), 2))
    for i in range(len(x3)):
    x4[i,0]=x3[i]
    x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++

You should use np.where and then count the length of the obtained vector of indices: 你应该使用np.where然后计算获得的索引向量的长度:

indices = np.where(x3 <= value)
count = len(indices[0]) 
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data'])

df_p.T.plot(kind='hist')

plt.show()

That whole script took a very short period to execute (~2s) for ( 100,000x1) array. 对于( 100,000x1)数组,整个脚本花了很短的时间来执行(~2s)。 I didn't time, but if you provide the time it took to do yours we can compare. 我没有时间,但如果你提供你做的时间我们可以比较。

在此输入图像描述

I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). 我使用collections [Counter][2]来计算出现次数,我对它的体验总是很棒(时间上)。 I converted it into DataFrame to plot and used T to transpose. 我将其转换为DataFrame以绘制并使用T进行转置。

Your data does replicate a bit, but you can try and refine it some more. 您的数据确实会复制一些,但您可以尝试进一步优化它。 As it is, it's pretty fast. 事实上,它非常快。

Edit 编辑

Create CDF using cumsum() 使用cumsum()创建CDF

import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data']).T


df_p['cumu'] = df_p['data'].cumsum()

df_p['cumu'].plot(kind='line')

plt.show()

在此输入图像描述

Edit 2 编辑2

For scatter() plot you must specify the (x,y) explicitly. 对于scatter()图,您必须明确指定(x,y)。 Also, calling df_p['cumu'] will result in a Series , not a DataFrame . 此外,调用df_p['cumu']将导致Series ,而不是DataFrame

To properly display a scatter plot you'll need the following: 要正确显示散点图,您需要以下内容:

import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data']).T


df_p['cumu'] = df_p['data'].cumsum()

df_p.plot(kind='scatter', x='data', y='cumu')

plt.show()

在此输入图像描述

If efficiency counts, you can use the numpy function bincount, which need integers : 如果效率很重要,你可以使用numpy函数bincount,它需要整数:

import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))

it takes about 1ms. 大约需要1毫秒。

Regards. 问候。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM