简体   繁体   中英

A fast way to count non-empty regions

I am writing some code that chooses n random hyperplanes in 5 dimensions that go through the origin. It then samples no_points points uniformly at random on the unit sphere and counts how many of the regions created by the hyperplanes have at least one point in them. This is relatively simple to do using the following Python code.

import numpy as np

def points_on_sphere(dim, N, norm=np.random.normal):
    """
    http://en.wikipedia.org/wiki/N-sphere#Generating_random_points
    """
    normal_deviates = norm(size=(N, dim))
    radius = np.sqrt((normal_deviates ** 2).sum(axis=0))
    points = normal_deviates / radius
    return points

n = 100
d = 5
hpoints = points_on_sphere(n, d).T
for no_points in xrange(0, 10000000,100000):
    test_points = points_on_sphere(no_points,d).T 
    #The next two lines count how many of the test_points are in different regions created by the hyperplanes
    signs = np.sign(np.inner(test_points, hpoints))
    print no_points, len(set(map(tuple,signs)))

Unfortunately, my naive method of counting how many of the points are in different regions is slow. Overall the method takes O(no_points * n * d) time and in practice it is too slow and too RAM hungry once no_points reaches about 1000000 . In particular it reaches 4GB of RAM at no_points = 900,000 .

Can this be done more efficiently so that no_points can get all the way to 10,000,000 (actually it would be great if it could go to 10 times that) fairly quickly and using less than 4GB of RAM?

Storing how each test point classifies with respect to each hyperplane is a lot of data. I would suggest an implicit radix sort on the point labels, eg,

import numpy as np


d = 5
n = 100
N = 100000
is_boundary = np.zeros(N, dtype=bool)
tpoints = np.random.normal(size=(N, d))
tperm = np.arange(N)
for i in range(n):
    hpoint = np.random.normal(size=d)
    region = np.cumsum(is_boundary) * 2 + (np.inner(hpoint, tpoints) < 0.0)[tperm]
    region_order = np.argsort(region)
    is_boundary[1:] = np.diff(region[region_order])
    tperm = tperm[region_order]
print(np.sum(is_boundary))

This code keeps a permutation of test points ( tperm ) such that all points in the same region are consecutive. boundary indicates whether each point is in a different region from the previous in permutation order. For each successive hyperplane, we partition each of the existing regions and effectively discard the empty regions to avoid storage for 2^100 of them.

Actually, since you have so many points and so few hyperplanes, it makes more sense not to store the points. The following code packs the region signature into two doubles using binary encoding.

import numpy as np


d = 5
hpoints = np.random.normal(size=(100, d))
bits = np.zeros((2, 100))
bits[0, :50] = 2.0 ** np.arange(50)
bits[1, 50:] = 2.0 ** np.arange(50)
N = 100000
uniques = set()
for i in xrange(0, N, 1000):
    tpoints = np.random.normal(size=(1000, d))
    signatures = np.inner(np.inner(tpoints, hpoints) < 0.0, bits)
    uniques.update(map(tuple, signatures))
print(len(uniques))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM