I want to calculate similarity or distance between two sample sets.
Each set indicates game play times of a user.
For example, suppose there are two users and the first user (X1) play five times, and the other one play four times as four times as follows.
X1={1,2,3,1,2}
X2={1,2,3,4}
I want to calculate similarity or distance between X1
and X2
using python. How can I calculate it?
Note 1. the order is not important.
I mean, {1,2,3,4} and {4,1,2,3} should be considered as the same set.
Note 2. element (ie, 1, 2, 3, 4) is not fixed. I mean, the play time is a continuous variable.
Well, you could use Kolmogorov-Smirnov 2 samples test, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html , described here
It should work for samples of different size.
In Python, eg
import scipy.stats as st
x = np.random.normal(0,1,1000)
y = np.random.normal(0,1,1000)
z = np.random.normal(1.1,0.9,1000)
st.ks_2samp(x, y)
st.ks_2samp(x, z)
It returns D statistics (as well as p-value), which is the absolute max distance (supremum) between the CDFs of the two samples. This is your distance. See here for details
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.