简体   繁体   English

熊猫列的列表以分隔行

[英]Pandas Column of Lists to Separate Rows

I've got a dataframe that contains analysed news articles w/ each row referencing an article and columns w/ some information about that article (eg tone). 我有一个数据框,其中包含分析的新闻报道,其中每行引用一篇文章,而列则包括有关该文章的某些信息(例如,语气)。 One column of that df contains a list of FIPS country codes of the locations that were mentioned in that article. 该df的一列包含该文章中提到的位置的FIPS国家代码列表。

I want to "extract" these country codes such that I get a dataframe in which each mentioned location has its own row, along with the other columns of the original row in which that location was referenced (there will be multiple rows with the same information, but different locations, as the same article may mention multiple locations). 我想“提取”这些国家/地区代码,以便获得一个数据框,其中每个提到的位置都有其自己的行,以及该位置所引用的原始行的其他列(会有多行具有相同的信息,但位置不同,因为同一篇文章可能提到多个位置)。

I tried something like this, but iterrows() is notoriously slow, so is there any faster/more efficient way for me to do this? 我尝试了类似的方法,但是iterrows()出了名的慢,所以我有没有更快/更有效的方法呢? Thanks a lot. 非常感谢。

  • 'events' is the column that contains the locations “事件”是包含位置的列
  • 'event_cols' are the columns from the original df that I want to retain in the new df. “ event_cols”是我想保留在新df中的原始df中的列。
  • 'df_events' is the new data frame 'df_events'是新的数据框
for i, row in df.iterrows():
  for location in df.events.loc[i]:
    try:
        df_storage = pd.DataFrame(row[event_cols]).T
        df_storage['loc'] = location 
        df_events = df_events.append(df_storage)
    except ValueError as e:
        continue

I would group the DataFrame with groupby() , explode the lists with a combination of apply and a lambda function, and then reset the index and drop the level column that is created to clean up the resulting DataFrame . 我将使用groupby()DataFrame进行分组,使用applylambda函数的组合来展开列表,然后重置索引并删除创建的level列,以清理所得的DataFrame

df_events = df.groupby(['event_col1', 'event_col2', 'event_col3'])['events']\
                 .apply(lambda x: pd.DataFrame(x.values[0]))\
                 .reset_index().drop('level_3', axis = 1)

In general, I always try to find a way to use apply() before most other methods, because it is often much faster than iterating over each row. 通常,我总是尝试在大多数其他方法之前找到一种使用apply()方法,因为它通常比遍历每一行要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM