返回具有非空值的第一行。如果 null，则返回第一行外观 python-pandas

Question

I have a pandas dataframe containing the following data.我有一个 pandas dataframe 包含以下数据。 the data is sorted by sessionid, datetime (ASC)数据按 sessionid、日期时间 (ASC) 排序

 df = df.sort_values(['datetime','session_id'],ascending=True)

session_id session_id	source资源	datetime约会时间
1 1	facebook facebook	2021-01-23 11:26:34.166000 2021-01-23 11:26:34.166000
1 1	twitter twitter	2021-01-23 11:26:35.202000 2021-01-23 11:26:35.202000
2 2	NULL/NAN空/南	2021-01-23 11:05:10.001000 *2021-01-23 11:05:10.001000*
2 2	twitter twitter	2021-01-23 11:05:17.289000 2021-01-23 11:05:17.289000
3 3	NULL/NAN空/南	2021-01-23 13:12:32.914000 2021-01-23 13:12:32.914000
3 3	NULL/NAN空/南	2021-01-23 13:12:40.883000 2021-01-23 13:12:40.883000

my desired result should be ( row from each ++session_id++ with first non-null value in ++source++ column and if all null, then return first appearance ( case id = 3) )我想要的结果应该是（来自每个 ++session_id++ 的行，在 ++source++ 列中具有第一个非空值，如果所有 null，则返回第一次出现（case id = 3））

session_id session_id	source资源	datetime约会时间
1 1	facebook facebook	2021-01-23 11:26:34.166000 2021-01-23 11:26:34.166000
2 2	twitter twitter	2021-01-23 11:05:17.289000 2021-01-23 11:05:17.289000
3 3	NULL/NAN空/南	2021-01-23 13:12:32.914000 2021-01-23 13:12:32.914000

The functions first_valid_index and first give me somehow the results I want.函数first_valid_index和first以某种方式给了我想要的结果。

The find_first_value : find_first_value ：

returns the index of the row containing the first valid index and if None it returns no index, which causes me to lose one session_id of my original table.返回包含第一个有效索引的行的索引，如果 None 它不返回任何索引，这会导致我丢失原始表的一个 session_id。

session_id session_id	source资源	datetime约会时间
1 1	facebook facebook	2021-01-23 11:26:34.166000 2021-01-23 11:26:34.166000
2 2	twitter twitter	2021-01-23 11:05:17.289000 2021-01-23 11:05:17.289000

     x = df.groupby(by="session_id")'om_source'].transform(pd.Series.first_valid_index ) newdf = df[df.index==x]

The first :第first ：

it returns the first non null value ++but for each one of the columns separated++ which is not what I am looking for它返回第一个非 null 值 ++，但对于分隔的每一列 ++，这不是我想要的

session_id session_id	source资源	datetime约会时间
1 1	facebook facebook	2021-01-23 11:26:34.166000 2021-01-23 11:26:34.166000
2 2	twitter twitter	2021-01-23 11:05:10.001000 *2021-01-23 11:05:10.001000*
3 3	NULL/NAN空/南	2021-01-23 13:12:32.914000 2021-01-23 13:12:32.914000

  newdf =  df.groupby(by="session_id").first()

I tried to do something like this, but this unfortunately did not work.我试图做这样的事情，但不幸的是这没有奏效。

df.groupby(by="session_id")['om_source']
.transform(first if ( pd.Series.first_valid_index is None  ) else pd.Series.first_valid_index)

Do you have any suggestions?你有什么建议吗？ ( I am new to pandas, I am still trying to understand the 'logic' behind it ) （我是 pandas 的新手，我仍在尝试理解其背后的“逻辑”）

Thanks in advance for your time.在此先感谢您的时间。

Answer 1

You can create a 'helper' column like this and sort then drop_duplicates:您可以像这样创建一个“帮助”列，然后对 drop_duplicates 进行排序：

df.assign(sorthelp=df['source'] == 'NULL/NAN')\
  .sort_values(['sorthelp','datetime','session_id'])\
  .drop_duplicates('session_id')

Output: Output：

   session_id    source                    datetime  sorthelp
3           2   twitter  2021-01-23 11:05:17.289000     False
0           1  facebook  2021-01-23 11:26:34.166000     False
4           3  NULL/NAN  2021-01-23 13:12:32.914000      True

and you can drop the helper column afterwards然后你可以删除帮助列

print(df.assign(sorthelp=df['source'] == 'NULL/NAN')
        .sort_values(['sorthelp','datetime','session_id'])
        .drop_duplicates('session_id')
        .drop('sorthelp', axis=1))

Output: Output：

   session_id    source                    datetime
3           2   twitter  2021-01-23 11:05:17.289000
0           1  facebook  2021-01-23 11:26:34.166000
4           3  NULL/NAN  2021-01-23 13:12:32.914000

Answer 2

If your time is already sorted, you can do:如果您的时间已经排序，您可以执行以下操作：

print(
    df.iloc[
        df.groupby("session_id")["source"].apply(
            lambda x: x.first_valid_index()
            if not x.first_valid_index() is None
            else x.index[0]
        )
    ]
)

Prints:印刷：

   session_id    source                    datetime
0           1  facebook  2021-01-23 11:26:34.166000
3           2   twitter  2021-01-23 11:05:17.289000
4           3       NaN  2021-01-23 13:12:32.914000

Or using := operator (Python 3.8+)或使用:=运算符（Python 3.8+）

print(
    df.iloc[
        df.groupby("session_id")["source"].apply(
            lambda x: fi
            if not (fi := x.first_valid_index()) is None
            else x.index[0]
        )
    ]
)

返回具有非空值的第一行。如果 null，则返回第一行外观 python-pandas

问题描述

2 个解决方案

解决方案1
0 2021-03-17 23:13:26

解决方案2
0 2021-03-17 23:23:39

返回具有非空值的第一行。 如果 null，则返回第一行外观 python-pandas

问题描述

2 个解决方案

解决方案1 0 2021-03-17 23:13:26

解决方案2 0 2021-03-17 23:23:39

返回具有非空值的第一行。如果 null，则返回第一行外观 python-pandas

解决方案1
0 2021-03-17 23:13:26

解决方案2
0 2021-03-17 23:23:39