I have an input 2D histogram that I want to do 2-fold cross-validation with. The problem is I don't know how to extract two mutually exclusive random samples of the data from a histogram. If it was a couple of lists of the positional information of each data point, that would be easy - shuffle the data in the lists in the same way, and split the lists equally.
So for a list I would do this:
list1 = [1,2,3,3,5,6,1];
list2 = [1,3,6,6,5,2,1];
idx = randperm(length(list1)); % ie. idx = [4 3 1 5 6 2 7]
shlist1 = list1(idx); % shlist1 = [3,3,1,5,6,2,1]
shlist2 = list2(idx); % shlist2 = [6,6,1,5,2,3,1]
slist1 = shlist1(1:3); % slist1 = [3,3,1]
elist1 = shlist1(4:6); % elist1 = [5,6,2,1]
slist2 = shlist2(1:3); % slist2 = [6,6,1]
elist2 = shlist2(4:6); % elist2 = [5,2,3,1]
But if this same data was presented to me as a histogram
hist = [2 0 0 0 0 0]
[0 0 0 0 0 1]
[0 1 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 1 0]
[0 0 2 0 0 0]
I want the result to be something like this
hist1 = [0 0 0 0 0 0]
[0 0 0 0 0 1]
[0 1 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 1 0 0 0]
hist2 = [2 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 1 0]
[0 0 1 0 0 0]
so that different halves of the data are randomly, and equally assigned to two new histograms.
Would this be equivalent to taking a random integer height of each bin hist(i,j), and adding that to the equivalent bin in hist1(i,j), and the difference to hist2(i,j)?
% hist as shown above
hist1 = zeros(6);
hist2 = zeros(6);
for i = 1:length(hist(:,1))*length(hist(1,:))
randNum = rand;
hist1(i) = round(hist(i)*randNum);
hist2(i) = hist(i) - hist1(i);
end
And if that is equivalent, is there a better way/built-in way of doing it?
My actual histogram is 300x300 bins, and contains about 6,000,000 data points, and it needs to be fast.
Thanks for any help :)
EDIT: The suggested bit of code I made is not equivalent to taking a random sample of positional points from a list, as it does not maintain the overall probability density function of the data. Halving the histograms should be fine for my 6,000,000 points, but I was hoping for a method that would still work for few points.
You can use rand
or randi
to generate two histograms. The first method is more efficient however the second is more random.
h = [[2 0 0 0 0 0]
[0 0 0 0 0 1]
[0 1 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 1 0]
[0 0 2 0 0 0]];
%using rand
h1 = round(rand(size(h)).*h);
h2 = h - h1;
%using randi
h1 = zeros(size(h));
for k = 1:numel(h)
h1(k) = randi([0 h(k)]);
end
h2 = h - h1;
Suppose H is your 2D histogram. The following code extracts a single random index with a probability proportional to the count at that index - which I think is what you want.
cc = cumsum(H(:));
if cc(1) ~= 0
cc = [0; cc];
end
m = cc(end);
ix = find(cc > m*rand, 1);
To extract multiple samples, you need to write your own find function (preferably a binary search for efficiency) that extracts some n number of samples in one call. This will give you a vector of indices (call it ix_vec) chosen with probability proportional to the Histogram count at each index.
Then if we denote by X the numerical values corresponding to each location in the Histogram, your random sample is:
R1 = X(ix_vec);
Repeat for the second random sample set.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.