简体   繁体   English

处理大型阵列时避免“内存错误”

[英]Avoid 'Memory Error" when dealing with large arrays

I'm facing sometimes Memory Error , sometimes it goes through fine and sometimes it pops up.. Specifically when trying to subtract large array by one. 我有时遇到Memory Error ,有时会遇到问题,有时会弹出。特别是在尝试将大数组减一时。 I tried many ways to do this subtraction, is there any way to avoid this? 我尝试了很多方法来进行这种减法,有什么办法可以避免这种情况? and is my other code parts will also sometime arise this error? 我的其他代码部分是否有时也会出现此错误?

Here is my code: 这是我的代码:

def home(request):
    if request.method=="POST":
        img = UploadForm(request.POST, request.FILES)
        no_clus = int(request.POST.get('num_clusters', 10))
        if img.is_valid():

            paramFile =io.TextIOWrapper(request.FILES['pic'].file)
            portfolio1 = csv.DictReader(paramFile)

            users = []

            users = [row["BASE_NAME"] for row in portfolio1]
            print(len(users))

            my_list = users
            vectorizer = CountVectorizer()
            dtm = vectorizer.fit_transform(my_list)

            lsa = TruncatedSVD(n_components=100)
            dtm_lsa = lsa.fit_transform(dtm)
            dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
            dist1 = (1- np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T))
            # print(1-similarity)
            k = len(my_list)
         #   dist1 = (1- similarity)
            # dist1=similarity
            # dist1.astype(float)
            #print(dist1)
            # print(cosine_similarity(tfidf_matrix[3:4], tfidf_matrix))
            # float dist = 1 - similarity;
            data2 = np.asarray(dist1)
            arr_3d = data2.reshape((1, k, k))
            # arr_3d= 1- arr_3d
            #print(arr_3d)

            no_cluster = number_cluster(len(my_list))
            print(no_cluster)
            for i in range(len(arr_3d)):
                # print (i+1910)
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='ward').fit(arr_3d[i])
                km = AgglomerativeClustering(n_clusters=no_cluster, linkage='average').fit(arr_3d[i])
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='complete').fit(arr_3d[i])
                # km = MeanShift()
                # km = KMeans(n_clusters=no_clus, init='k-means++')
                # km = MeanShift()
                #  km = km.fit(arr_3d[i])
                # print km
                labels = km.labels_

            csvfile = settings.MEDIA_ROOT +'\\'+ 'images\\export.csv'

            csv_input = pd.read_csv(csvfile, encoding='latin-1')
            csv_input['cluster_ID'] = labels
            csv_input['BASE_NAME'] = my_list
            csv_input.to_csv(settings.MEDIA_ROOT +'/'+ 'output.csv', index=False)
            clus_groups = list()
            for j in range(no_cluster):
                # print(" cluster no %i:%s" % (j, [my_list[i] for i, x in enumerate(labels) if x == j]))
                list_of_ints = ([my_list[i] for i, x in enumerate(labels) if x == j])
                clus_groups.append('  '.join(list_of_ints))
            vectorizer = CountVectorizer()
            dtm = vectorizer.fit_transform(my_list)

            lsa = TruncatedSVD(n_components=100)
            dtm_lsa = lsa.fit_transform(dtm)
            dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
            dist1 = (1 - np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T))
           # similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
            k = len(my_list)
          #  dist1 = 1 - similarity

            data2 = np.asarray(dist1)
            arr_3d = data2.reshape((1, k, k))
            # arr_3d= 1- arr_3d

            #no_clus = 5
           # no_clus=get_name(request)
            for i in range(len(arr_3d)):
                # print (i+1910)
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='ward').fit(arr_3d[i])
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='average').fit(arr_3d[i])
                # km = AgglomerativeClustering(n_clusters=no_clus, linkage='complete').fit(arr_3d[i])
                km = KMeans(n_clusters=no_clus, init='k-means++')
                km = km.fit(arr_3d[i])
                # print km
                labels2 = km.labels_
                # error = km.inertia_
                print(labels2)

            labels = labels.tolist()
            labels2 = labels2.tolist()
            # new=list()


            csv_input = pd.read_csv(settings.MEDIA_ROOT +'/'+ 'output.csv',encoding='latin-1')
            labels1 = csv_input['cluster_ID']
            new_list = []
            for k in labels1:
                new_list.append(labels2[k])  # lookup the value in list2 at the index given by list1

            print(new_list)
            print(len(new_list))
            csv_input = pd.read_csv(settings.MEDIA_ROOT +'/'+ 'output.csv',encoding='latin-1')
            csv_input['cluster_ID'] = labels
            csv_input['BASE_NAME'] = my_list
            csv_input['User_Map'] = new_list
            csv_input.to_csv(settings.MEDIA_ROOT + '/' + 'output1.csv', index=False)
            #filename= settings.MEDIA_ROOT +'/'+ 'output.csv'
            send_file(request)
           # my_list = portfolio
            #save_file('output1.csv')
          #  csv(request)
          #  return HttpResponseRedirect(reverse('labels'))
            return render(request, 'new.html', {'labels': labels})
    else:
        img=UploadForm()
    images=Upload.objects.all()
    return render(request,'new.html',{'form':img,'images':images})

the error is when trying to do dist1 = (1- np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)) .. I also tried to create new array with all ones with the same size and then subtract.. How should I modify this to prevent this error? error是在尝试执行dist1 = (1- np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)) ..然后减去。.我应该如何修改以防止出现此错误? Note that the user interface that will run this code can be operated on any pc! 请注意,可以在任何PC上操作将运行此代码的用户界面!

Not sure, but in the incriminated line you use numpy.asmatrix(dtm_lsa), which is a function call that allocates some memory. 不能确定,但​​是在上标行中,您使用numpy.asmatrix(dtm_lsa),这是分配一些内存的函数调用。

You're doing it twice so it creates twice too much memory (that's before it's garbage collected, but it's too late in some cases) 您要做两次,所以它会创建两倍的内存(这是在垃圾回收之前的,但是在某些情况下为时已晚)

(not patronizing you at all: That's the common trap in mathematical formulae, but those formulae must be adapted when programmed into a computer). (一点也不光顾您:这是数学公式中的常见陷阱,但是在计算机中编程时必须对这些公式进行调整)。

I would suggest to replace the line by those lines: 我建议将这些行替换为:

temp_matrix = numpy.asmatrix(dtm_lsa)
product = temp_matrix * temp_matrix.T
# maybe call the garbage collector at this point: gc.collect()
dist1 = (1- np.asarray(product))

That way 1) less copy/paste and 2) not a lot of big matrix allocations in a single line 这样一来,1)减少了复制/粘贴操作; 2)一行中没有大量的大型矩阵分配

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM