How to calculate distance (similarity) between two continuous random samples with different length using python?

Question

I want to calculate similarity or distance between two sample sets.
Each set indicates game play times of a user.
For example, suppose there are two users and the first user (X1) play five times, and the other one play four times as four times as follows.

X1={1,2,3,1,2}
X2={1,2,3,4}

I want to calculate similarity or distance between X1 and X2 using python. How can I calculate it?

Note 1. the order is not important.
I mean, {1,2,3,4} and {4,1,2,3} should be considered as the same set.

Note 2. element (ie, 1, 2, 3, 4) is not fixed. I mean, the play time is a continuous variable.

Answer 1

Well, you could use Kolmogorov-Smirnov 2 samples test, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html , described here

It should work for samples of different size.

In Python, eg

import scipy.stats as st    

x = np.random.normal(0,1,1000)
y = np.random.normal(0,1,1000)
z = np.random.normal(1.1,0.9,1000)

st.ks_2samp(x, y)
st.ks_2samp(x, z)

It returns D statistics (as well as p-value), which is the absolute max distance (supremum) between the CDFs of the two samples. This is your distance. See here for details

How to calculate distance (similarity) between two continuous random samples with different length using python?

Question

1 answers

solution1
2 ACCPTED 2021-05-14 14:47:06

How to calculate distance (similarity) between two continuous random samples with different length using python?

Question

1 answers

solution1 2 ACCPTED 2021-05-14 14:47:06

solution1
2 ACCPTED 2021-05-14 14:47:06