简体   繁体   中英

inner join not working in pandas dataframes

I have the following 2 pandas dataframes:

    city Population
0            New York City   20153634
1              Los Angeles   13310447
2   San Francisco Bay Area    6657982
3                  Chicago    9512999
4        Dallas–Fort Worth    7233323
5         Washington, D.C.    6131977
6             Philadelphia    6070500
7                   Boston    4794447
8   Minneapolis–Saint Paul    3551036
9                   Denver    2853077
10   Miami–Fort Lauderdale    6066387
11                 Phoenix    4661537
12                 Detroit    4297617
13                 Toronto    5928040
14                 Houston    6772470
15                 Atlanta    5789700
16          Tampa Bay Area    3032171
17              Pittsburgh    2342299
18               Cleveland    2055612
19                 Seattle    3798902
20              Cincinnati    2165139
21             Kansas City    2104509
22               St. Louis    2807002
23               Baltimore    2798886
24               Charlotte    2474314
25            Indianapolis    2004230
26               Nashville    1865298
27               Milwaukee    1572482
28             New Orleans    1268883
29                 Buffalo    1132804
30                Montreal    4098927
31               Vancouver    2463431
32                 Orlando    2441257
33                Portland    2424955
34                Columbus    2041520
35                 Calgary    1392609
36                  Ottawa    1323783
37                Edmonton    1321426
38          Salt Lake City    1186187
39                Winnipeg     778489
40               San Diego    3317749
41             San Antonio    2429609
42              Sacramento    2296418
43               Las Vegas    2155664
44            Jacksonville    1478212
45           Oklahoma City    1373211
46                 Memphis    1342842
47                 Raleigh    1302946
48               Green Bay     318236
49                Hamilton     747545
50                  Regina     236481


            

      city  W/L Ratio
0                   Boston   2.500000
1                  Buffalo   0.555556
2                  Calgary   1.057143
3                  Chicago   0.846154
4                 Columbus   1.500000
5        Dallas–Fort Worth   1.312500
6                   Denver   1.433333
7                  Detroit   0.769231
8                 Edmonton   0.900000
9                Las Vegas   2.125000
10             Los Angeles   1.655862
11   Miami–Fort Lauderdale   1.466667
12  Minneapolis-Saint Paul   1.730769
13                Montreal   0.725000
14               Nashville   2.944444
15                New York   1.517241
16           New York City   0.908870
17                  Ottawa   0.651163
18            Philadelphia   1.615385
19                 Phoenix   0.707317
20              Pittsburgh   1.620690
21                 Raleigh   1.028571
22  San Francisco Bay Area   1.666667
23               St. Louis   1.375000
24               Tampa Bay   2.347826
25                 Toronto   1.884615
26               Vancouver   0.775000
27        Washington, D.C.   1.884615
28                Winnipeg   2.600000

And I do a join like this:

result = pd.merge(df, nhl_df , on="city")

The result should have 28 rows, instead I have 24 rows.

One of the missing one is for example Miami-Fort Lauderdale

I have double checked on both dataframes and there are NO typographical errors. So, why isnt it in the end dataframe?

 city Population  W/L Ratio
0            New York City   20153634   0.908870
1              Los Angeles   13310447   1.655862
2   San Francisco Bay Area    6657982   1.666667
3                  Chicago    9512999   0.846154
4        Dallas–Fort Worth    7233323   1.312500
5         Washington, D.C.    6131977   1.884615
6             Philadelphia    6070500   1.615385
7                   Boston    4794447   2.500000
8                   Denver    2853077   1.433333
9                  Phoenix    4661537   0.707317
10                 Detroit    4297617   0.769231
11                 Toronto    5928040   1.884615
12              Pittsburgh    2342299   1.620690
13               St. Louis    2807002   1.375000
14               Nashville    1865298   2.944444
15                 Buffalo    1132804   0.555556
16                Montreal    4098927   0.725000
17               Vancouver    2463431   0.775000
18                Columbus    2041520   1.500000
19                 Calgary    1392609   1.057143
20                  Ottawa    1323783   0.651163
21                Edmonton    1321426   0.900000
22                Winnipeg     778489   2.600000
23               Las Vegas    2155664   2.125000
24                 Raleigh    1302946   1.028571

I think here is possible check if same chars by integer that represents the character in function ord , here are different with code 150 and with code 8211 , so it is reason why values not matched:

a = df1.loc[10, 'city']
print (a)
Miami–Fort Lauderdale

print ([ord(x) for x in a])
[77, 105, 97, 109, 105, 150, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]


b = df2.loc[11, 'city']
print (b)
Miami–Fort Lauderdale

print ([ord(x) for x in b])
[77, 105, 97, 109, 105, 8211, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]

You can try copy values for replace for select correct - value:

#first – is copied from b, second – from a
df2['city'] = df2['city'].replace('–','–', regex=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM