简体   繁体   English

Pandas 合并不同名称的列

[英]Pandas merge columns with different names

I am trying to merge a spreadsheet using the merge function with pandas.我正在尝试使用带有熊猫的合并功能合并电子表格。 I'm trying to combine the columns ID & id together, TrackName & name, ArtistName & artists, Danceability & danceability, etc. from the 2018 and 2019 spreadsheets.我正在尝试将 2018 年和 2019 年电子表格中的列 ID 和 id、TrackName 和名称、ArtistName 和艺术家、Danceability 和 Danceability 等组合在一起。

Here is the code that I tried to use when merging,这是我在合并时尝试使用的代码,

pd.merge(df, df2, left_on=  ['TrackName', 'ArtistName','ID'],
            right_on= ['name', 'artists','id'])

however, I'm always getting an error saying that I can't merge on int64 and object columns.但是,我总是收到一条错误消息,说我无法在 int64 和对象列上合并。 I'm not sure how to use concat to merge these columns together, so could someone help me out?我不确定如何使用 concat 将这些列合并在一起,所以有人可以帮助我吗?

Also, even when I use merge to only merge the object columns and not the ID, (like this)此外,即使我使用合并仅合并对象列而不合并 ID,(像这样)

pd.merge(df, df2, left_on=  ['TrackName', 'ArtistName'],
            right_on= ['name', 'artists'])

it still doesn't work and the columns don't merge properly.它仍然不起作用,并且列没有正确合并。 I'm not sure what I am doing wrong.我不确定我做错了什么。 I'd really appreciate some help if possible!如果可能的话,我真的很感激一些帮助!

Here are the spreadsheets: link以下是电子表格: 链接

pandas.merge() is a class function orientated to produce joins of Databases with primary keys and foreign keys as in SQL Style databases. pandas.merge()是一个类函数,用于生成具有主键外键的数据库连接,就像在 SQL 样式数据库中一样。 See Difference Between Primary and Foreign Key .请参阅主键和外键之间的区别

The problem here is that you are trying to introduce values of different dtypes (use df.dtypes to see the types of all columns in your DataFrames) to an existing column.这里的问题是您试图将不同 dtypes 的值(使用df.dtypes查看 DataFrames 中所有列的类型)引入现有列。 That happens because pandas takes the left DataFrame assigned in the function as the "base", and tries to add new records to it, since the dtype is different, it causes an error.发生这种情况是因为 pandas 将函数中分配的左侧DataFrame作为“基础”,并尝试向其中添加新记录,因为 dtype 不同,因此会导致错误。

In fact, the documentation is more likely to appear as a pd.DataFrame method, because it is behaved as a (say) "Mother DataFrame that receives new rows".事实上,文档更有可能以pd.DataFrame方法的形式出现,因为它表现为(比如)“接收新行的母数据帧”。 See documentation pd.DataFrame.merge请参阅文档pd.DataFrame.merge

The error also recommends to use the pandas.concat method, since it sees that the dtypes are different and thinks you may are willing to just join two DataFrames .该错误还建议使用pandas.concat方法,因为它看到dtypes不同,并认为您可能愿意只加入两个DataFrames Which can be preferible, if there are no existing records that have the same TrackName and Artist (for example), in that case you would like to join them with a concat, because there is no additional information you can gain about a record using another DataFrame .如果没有具有相同TrackNameArtist的现有记录(例如),这可能是可取的,在这种情况下,您希望使用 concat 加入它们,因为您无法使用另一个记录获得有关记录的其他信息DataFrame

My recommendation is: rename columns in DataFrame 2019 as they are in DataFrame 2018 , with the same name if they refer to the same attribute, you can use pd.DataFrame.rename , then, change the dtype of the columns that you will like to do the merge on and make sure they are the same.我的建议是:将DataFrame 2019中的列重命名为DataFrame 2018中的列,如果它们引用相同的属性,则使用相同的名称,您可以使用pd.DataFrame.rename ,然后更改您想要的列的dtype进行合并并确保它们相同。 Finally, try to do an Outer Join with the merge function, using the Song Name, for example.最后,尝试使用merge功能进行外部连接,例如使用歌曲名称。 You will see if there are matches or see that all records may be different databases.您将查看是否有匹配项或查看所有记录可能是不同的数据库。

So you are not able to merge on ID as ID is of object datatype in one table and int in other table:因此,您无法在 ID 上合并,因为 ID 在一个表中属于对象数据类型,而在另一个表中属于 int 类型:

df_2018.dtypes

id                   object
name                 object
artists              object

df_2019.dtypes

ID                 int64
TrackName         object
ArtistName        object

Now I tried merging two tables on 'name' and 'artists' and I was able to do that.现在我尝试合并'name'和'artists'的两个表,我能够做到这一点。 Here is the code:这是代码:

new_df = pd.merge(df_2018, df_2019, left_on=['name','artists'], right_on = ['TrackName','ArtistName'])

new_df.columns

Index(['id', 'name', 'artists', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 'ID', 'TrackName',
       'ArtistName', 'Genre', 'BeatsPerMinute', 'Energy', 'Danceability',
       'LoudnessdB', 'Liveness', 'Valence', 'Length', 'Acousticness',
       'Speechiness', 'Popularity'],
      dtype='object')

I could get all the columns as desired.我可以根据需要获得所有列。 Let me know if you are still facing any issues.如果您仍然面临任何问题,请告诉我。 Do share columns for which you are facing an issue请分享您遇到问题的列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM