简体   繁体   中英

Python pandas concatenate: join=“inner” works on toy data, not on real data

I'm working on topic modeling data where I have one data frame with a small selection of topics and their scores for each document or author (called "scores"), and another data frame with the top three words for all 250 topics (called "words").

I'm trying to combine the two data frames in a way to have an extra column in "scores", in which the top three words from "words" appear for each of the topics included in "scores". This is useful for visualizing the data as a heatmap, as seaborn or pyplot will pick up the labels automatically from such a dataframe.

I have tried a wide variety of merge and concat commands, but do not get the desired result. The strange thing is: what seems the most logical command, according to my understanding of the relevant documentation and the examples there (ie use concat on the two df with axis=1 and join="inner" ), works on toy data but does not work on my real data.

Here is my toy data with the code I used to generate it and to do the merge:

import pandas as pd

## Defining the two data frames
scores = pd.DataFrame({'author1': ['1.00', '1.50'],
                    'author2': ['2.75', '1.20'],
                    'author3': ['0.55', '1.25'],
                    'author4': ['0.95', '1.3']},
                     index=[1, 3])                     

words = pd.DataFrame({'words': ['cadavre','fenêtre','musique','mariage']},
                     index=[0, 1, 2, 3])

## Inspecting the two dataframes
print("\n==scores==\n", scores)
print("\n==words==\n", words)

## Merging the dataframes
merged = pd.concat([scores, words], axis=1, join="inner")

## Check the result
print("\n==merged==\n", merged)

And this is the output, as expected:

==scores==
   author1 author2 author3 author4
1    1.00    2.75    0.55    0.95
3    1.50    1.20    1.25     1.3

==words==
      words
0  cadavre
1  fenêtre
2  musique
3  mariage

==merged==
   author1 author2 author3 author4    words
1    1.00    2.75    0.55    0.95  fenêtre
3    1.50    1.20    1.25     1.3  mariage

This is exactly what I would like to accomplish with my real data. And although the two dataframes seem no different from the test data, I get an empty dataframe as the result of the merge.

Here are is a small example from my real data:

someScores (complete table):

      blanche  policier
108  0.003028  0.017494
71   0.002997  0.016956
115  0.029324  0.016127
187  0.004867  0.017631
122  0.002948  0.015118

firstWords (first 5 rows only; the index goes to 249, all index entries in "someScores" have an equivalent in "firstwords"):

                               topicwords
0              château-pays-intendant (0)
1                 esclave-palais-race (1)
2                  linge-voisin-chose (2)
3          question-messieurs-réponse (3)
4        prince-princesse-monseigneur (4)
5               arbre-branche-feuille (5)

My merge command:

dataToPlot = pd.concat([someScores, firstWords], axis=1, join="inner")

And the resulting data frame (empty)!

Empty DataFrame
Columns: [blanche, policier, topicwords]
Index: []

I have tried many variants, like using merge instead or creating extra columns replicating the indexes and then merging on those with left_on and right_on , but then I either get the same result or I just get NaN in the "topicwords" column.

Any hints and help would be greatly appreciated!

Inner join only returns rows whose index is present in both dataframes. Consider your row indices for someScores ( 108 71 115 187 122 ) and firstWords ( 0 1 2 3 4 5 ) contain no common value in row index the resultant is an empty dataframe.

Either set these indeces correctly or specify different criteria for joining.
You can confirm the problem by checking for common values in both index

someScores.index.intersection(firstWords.index)

For different strategies of joining refer documentation .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM