简体   繁体   English

如何迭代一个 pandas df 来制作另一个 pandas df?

[英]How to iterate a pandas df to make another pandas df?

Hi I have a dataframe df that has headers like this:嗨,我有一个 dataframe df ,它有这样的标题:

DATE    COL1    COL2   ...    COL10
date1    a       b      
...     ...     ...            ...

and so on        

Basically each row is just a date and then a bunch of columns on the same row that have some text in or they don't.基本上每一行只是一个日期,然后是同一行上的一堆列,其中包含一些文本或没有文本。

From this I want to create a new df df2 that has a row for each non blank 'cell' in the original data frame consisting of the date and the text from that cell.由此我想创建一个新的 df df2 ,它在原始数据框中的每个非空白“单元格”都有一行,由日期和该单元格中的文本组成。 From the above example we could get从上面的例子我们可以得到

df2=

DATE    COL
date1    a
date1    b

In pseudocode what I want to achieve is:在伪代码中我想要实现的是:

df2 = blank df
for row in df:
    for column in row:
        if cell is not empty:
            append to df2 a row consisting of the date for that row and the value in that cell

So far I have到目前为止我有

import pandas as pd
df = pd.read_csv("data2.csv")

output_df = pd.DataFrame(columns=['Date', 'Col'])

Basically I have read in the df, and created the new df to begin populating.基本上我已经阅读了 df,并创建了新的 df 以开始填充。

Now I am stuck, some investigation has told me I should not use iterrows() as it is not efficient and bad practise and I have 300k+ rows in df.现在我被卡住了,一些调查告诉我我不应该使用iterrows()因为它不是有效和不好的做法,而且我在 df 中有 300k+ 行。

Any suggestions how I can do this please?请问有什么建议吗?

Use df.melt :使用df.melt

data = [{'date': f'date{j}', **{f"col{i}": val for i, val in enumerate('abc')}} for j in range(5)]

df = pd.DataFrame(data)

    date col0 col1 col2
0  date0    a    b    c
1  date1    a    b    c
2  date2    a    b    c
3  date3    a    b    c
4  date4    a    b    c


df2 = df.melt(
    id_vars=['date'], 
    value_vars=df.filter(like='col').columns, 
    value_name='Col'
)[['date', 'Col']]


# to get the ordering the way you want
df2 = df2.sort_values(by='date').reset_index(drop=True)
     date Col
0   date0   a
1   date0   b
2   date0   c
3   date1   a
4   date1   b
5   date1   c
6   date2   a
7   date2   b
8   date2   c
9   date3   a
10  date3   b
11  date3   c
12  date4   a
13  date4   b
14  date4   c

Then, you can filter out any null values from Col :然后,您可以从Col中过滤掉任何 null 值:

df2 = df2[df2['Col'].apply(bool)]

You need to turn the blank cells into NA.您需要将空白单元格转换为 NA。

ie IE

df[df == ''] = np.nan

df.metl('DATE').dropna()

You can join the multiple columns to one list您可以将多列加入一个列表

s = df.filter(like='COL').apply(lambda row: row[row.notna()].tolist(), axis=1)

Then explode on that list然后在那个名单上explode

df_ = pd.DataFrame({'DATE':df['DATE'], 'COL': s})
df_ = df_.explode('COL')
print(df_)

    DATE COL
0  date1   a
0  date1   b
1  date2   c
1  date2   d

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM