繁体   English   中英

Pandas 将另一个数据帧中的值复制到我的数据帧中

[英]Pandas copy values from another dataframe into my dataframe

我有 2 个数据df_mentionsdf_mentions ,我有网址, media ,我有一些期刊的信息。 我需要使用媒体中包含的信息不断更新df_mentions

Mentions=['https://www.lemonde.fr/football/article/2019/07/08/coupe-du-monde-feminine-2109-au-sein-de-chaque-equipe-j-ai-vu-de-grandes-joueuses_5486741_1616938.html','https://www.telegraph.co.uk/world-cup/2019/06/12/womens-world-cup-2019-groups-complete-guide-teams-players-rankings/','https://www.washingtonpost.com/sports/dcunited/us-womens-world-cup-champs-arrive-home-ahead-of-parade/2019/07/08/48df1a84-a1e3-11e9-a767-d7ab84aef3e9_story.html?utm_term=.8f474bba8a1a']
Date=['08/07/2019','08/07/2019','08/07/2019']
Publication=['','','']
Country=['','','']
Foundation=['','','']
Is_in_media=['','','']
df_mentions=pd.DataFrame()
df_mentions['Mentions']=Mentions
df_mentions['Date']=Date
df_mentions['Source']=Source
df_mentions['Country']=Country
df_mentions['Foundation']=Foundation
df_mentions['Is_in_media']=Is_in_media

Source=['New York times','Lemonde','Washington Post']
Link=['https://www.nytimes.com/','https://www.lemonde.fr/','https://www.washingtonpost.com/']
Country=['USA','France','USA']
Foundation=['1851','1944','1877']
media=pd.DataFrame()
media['Source']=Source
media['Link']=Link
media['Country']=Country
media['Foundation']=Foundation
media

它们看起来像这样(但每天有近 1000 行) df_提及

媒体

媒体

我需要检查链接的来源是否包含在媒体中并从中提取数据以填充 df_mentions 并获得以下结果:

预期的: 在此处输入图片说明

而我所做的是:

for index in range(0,len(media)):
    for index2 in range(0,len(df_mentions)):
        if str(media['Link'][index])in str(df_mentions['Mentions'][index2]):
            df_mentions['Publication'][index2]=media['Publication'][index]
            df_mentions['Country'][index2]=media['Country'][index]
            df_mentions['Foundation'][index2]=media['Foundation'][index]
            df_mentions['Is_in_media'][index2]='Yes'
        else:
            df_mentions['Is_in_media'][index2]='No'
df_mentions

但是它在我的笔记本上运行一次,如果我关闭笔记本会给我错误,我使用的是 Pandas 0.24.0。 有没有更好的方法来做到这一点并一直允许工作?

提前致谢! 所有帮助将不胜感激!

您可以做的一件事是提取df_mentions的 URL 并将其用作合并的键

起始数据(删除了df_mentions的空列):

print(df_mentions)
                                            Mentions        Date
0  https://www.lemonde.fr/football/article/2019/0...  08/07/2019
1  https://www.telegraph.co.uk/world-cup/2019/06/...  08/07/2019
2  https://www.washingtonpost.com/sports/dcunited...  08/07/2019

print(media)
            Source                             Link Country Foundation
0   New York times         https://www.nytimes.com/     USA       1851
1          Lemonde          https://www.lemonde.fr/  France       1944
2  Washington Post  https://www.washingtonpost.com/     USA       1877

创建一个包含基本 url 的新列:

df_mentions['url'] = df_mentions['Mentions'].str.extract(r'(http[s]?:\/\/.+?\/)')

   Mentions                                   Date        url
0  https://www.lemonde.fr/football/articl...  08/07/2019  https://www.lemonde.fr/
1  https://www.telegraph.co.uk/world-cup/...  08/07/2019  https://www.telegraph.co.uk/
2  https://www.washingtonpost.com/sports/...  08/07/2019  https://www.washingtonpost.com/

合并时使用该新列作为键:

df_mentions.merge(media,
                  left_on='url',
                  right_on='Link',
                  how='left').drop(columns=['url', 'Link'])

   Mentions                                Date        Source           Country Foundation
0  https://www.lemonde.fr/football/art...  08/07/2019  Lemonde          France  1944     
1  https://www.telegraph.co.uk/world-c...  08/07/2019  NaN              NaN     NaN      
2  https://www.washingtonpost.com/spor...  08/07/2019  Washington Post  USA     1877 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM