简体   繁体   中英

How to calculate percentage properly

I have three dataframes that have column "City". All three dataframes have a different set of city names.

I want to find the percentage of total matches between this column of each dataframe.

For this purpose I used set method and got three arrays

set1 = set(df1['City'])
set2 = set(df2['City'])
set3 = set(df3['City'])

But how should I find the percentage? I used these functions, but I'm not sure I did everything right

(len(set1) - len(set2))/len(set1)*100
(len(set1) - len(set3))/len(set1)*100
(len(set2) - len(set3))/len(set2)*100

Is this record right?

You probably want this:

percentage = ( len(set1.intersection(set2)) / len(set1.union(set2)) )*100

which gives you the percentage of common elements in set1 and set2 .

This is also known as Jaccard Index , a measurement for similarity of sets.

From the pure mathimatical side of things: I assume that you want to find the percentage of cities matching between respectively set1 & set2, set1 & set3 and set2 & set3.

To calculate this percentage, you need to find the number of matches and the length of the set of cities compared.

Then the percentage can be calculated as follows:

Percentage match 1 & 2 = [(number of matches between 1 & 2)/(length of the set)]*100

For the code side of things: i agree with Sparkofska.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM