简体   繁体   中英

How to calculate distance (similarity) between two continuous random samples with different length using python?

I want to calculate similarity or distance between two sample sets.
Each set indicates game play times of a user.
For example, suppose there are two users and the first user (X1) play five times, and the other one play four times as four times as follows.

X1={1,2,3,1,2}
X2={1,2,3,4}

I want to calculate similarity or distance between X1 and X2 using python. How can I calculate it?

Note 1. the order is not important.
I mean, {1,2,3,4} and {4,1,2,3} should be considered as the same set.

Note 2. element (ie, 1, 2, 3, 4) is not fixed. I mean, the play time is a continuous variable.

Well, you could use Kolmogorov-Smirnov 2 samples test, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html , described here

It should work for samples of different size.

In Python, eg

import scipy.stats as st    

x = np.random.normal(0,1,1000)
y = np.random.normal(0,1,1000)
z = np.random.normal(1.1,0.9,1000)

st.ks_2samp(x, y)
st.ks_2samp(x, z)

It returns D statistics (as well as p-value), which is the absolute max distance (supremum) between the CDFs of the two samples. This is your distance. See here for details

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM