简体   繁体   English

如何使用Pandas将共享数据的多行CSV组合成一行?

[英]How do I combine multiple rows of a CSV that share data into one row using Pandas?

I have downloaded the ASCAP database , giving me a CSV that is too large for Excel to handle. 我已经下载了ASCAP数据库 ,为我提供了一个太大而无法处理的CSV。 I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. 我能够将CSV打开以打开它的一部分,问题是数据在其默认格式中不是非常有用。 Each song title has 3+ rows associated with it: 每首歌曲标题都有3行以上:

The first row include the % share that ASCAP has in that song. 第一行包括ASCAP在该歌曲中的%份额。 The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song. 之后的行包括一个字符代码(ROLE_TYPE),指示该行是否包含该歌曲的编剧或执行者。 The first column of each row contains a song title. 每行的第一列包含歌曲标题。

This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it. 此结构使数据混乱,因为在列出%share的行上,NAME列中有空白单元格,因为该行没有与之关联的Writer / Performer。

What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data. 我想要做的是将这些数据从每首歌曲中的3行转换为每首歌曲包含所有相关数据的1行。

So instead of: 所以代替:

TITLE, ROLE_TYPE, NAME, SHARES, NOTE TITLE,ROLE_TYPE,NAME,SHARES,NOTE

I would like to change the data to: 我想将数据更改为:

TITLE, WRITER, PERFORMER, SHARES, NOTE TITLE,WRITER,PERFORMER,SHARES,NOTE

Here is a sample of the data: 以下是数据示例:

TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,

I would like the data to look like: TITLE, WRITER, PERFORMER, SHARES, NOTE SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100, PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100, FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100, 我想数据看起来像:TITLE,WRITER,PERFORMER,SHARES,NOTE SCORE MORE,SMITH ANTONIO RENARD,SMITH SHOW PUBLISHING,100,PEOPLE KNO,SMITH ANTONIO RENARD,SMITH SHOW PUBLISHING,100,FEEDBACK,SMITH ANONIO RENARD, SMITH SHOW PUBLISHING,100,

I'm using python/pandas to try and work with the data. 我正在使用python / pandas来尝试使用数据。 I am able to use groupby('TITLE') to group rows with matching titles. 我可以使用groupby('TITLE')对具有匹配标题的行进行分组。

import pandas as pd

data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)

title_grouped = data.groupby('TITLE')

for TITLE,group in title_grouped:
  print(TITLE)
  print(group)

I was able to groupby('TITLE') of each song, and the output I get seems close to what I want: 我能够对每首歌曲进行分组('TITLE'),我得到的输出似乎接近我想要的:

SCORE MORE
   TITLE          ROLE_TYPE  NAME                        SHARES    NOTE
0  SCORE MORE     ASCAP      Total Current ASCAP Share   100.0     NaN
1  SCORE MORE         W      SMITH ANTONIO RENARD        NaN       NaN
2  SCORE MORE         P      SMITH SHOW PUBLISHING       NaN       NaN 

What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song? 我需要做什么来获取这个组并在CSV文件中生成一行,其中包含与每首歌曲相关的所有数据?

I would recommend: 我建议:

  • Decompose the data by the ROLE_TYPE 通过ROLE_TYPE分解数据
  • Prepare the data for merge (rename columns and drop unnecessary columns) 准备合并数据(重命名列并删除不必要的列)
  • Merge everything back into one DataFrame 将所有内容合并回一个DataFrame

Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case). 将在合并的DataFrame中具有相同名称的列上自动执行合并(在本例中为TITLE)。

Seems to work nicely :) 似乎工作得很好:)

data = pd.read_csv("data2.csv", sep=",")

# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()

# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)

# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)

# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)

# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")

# Print result
print(result)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM