I have a pandas dataframe containing the following data. the data is sorted by sessionid, datetime (ASC)
df = df.sort_values(['datetime','session_id'],ascending=True)
session_id | source | datetime |
---|---|---|
1 | 2021-01-23 11:26:34.166000 | |
1 | 2021-01-23 11:26:35.202000 | |
2 | NULL/NAN | 2021-01-23 11:05:10.001000 |
2 | 2021-01-23 11:05:17.289000 | |
3 | NULL/NAN | 2021-01-23 13:12:32.914000 |
3 | NULL/NAN | 2021-01-23 13:12:40.883000 |
my desired result should be ( row from each ++session_id++ with first non-null value in ++source++ column and if all null, then return first appearance ( case id = 3) )
session_id | source | datetime |
---|---|---|
1 | 2021-01-23 11:26:34.166000 | |
2 | 2021-01-23 11:05:17.289000 | |
3 | NULL/NAN | 2021-01-23 13:12:32.914000 |
The functions first_valid_index
and first
give me somehow the results I want.
The find_first_value
:
session_id | source | datetime |
---|---|---|
1 | 2021-01-23 11:26:34.166000 | |
2 | 2021-01-23 11:05:17.289000 |
x = df.groupby(by="session_id")'om_source'].transform(pd.Series.first_valid_index ) newdf = df[df.index==x]
The first
:
it returns the first non null value ++but for each one of the columns separated++ which is not what I am looking for
session_id | source | datetime |
---|---|---|
1 | 2021-01-23 11:26:34.166000 | |
2 | 2021-01-23 11:05:10.001000 | |
3 | NULL/NAN | 2021-01-23 13:12:32.914000 |
newdf = df.groupby(by="session_id").first()
I tried to do something like this, but this unfortunately did not work.
df.groupby(by="session_id")['om_source']
.transform(first if ( pd.Series.first_valid_index is None ) else pd.Series.first_valid_index)
Do you have any suggestions? ( I am new to pandas, I am still trying to understand the 'logic' behind it )
Thanks in advance for your time.
You can create a 'helper' column like this and sort then drop_duplicates:
df.assign(sorthelp=df['source'] == 'NULL/NAN')\
.sort_values(['sorthelp','datetime','session_id'])\
.drop_duplicates('session_id')
Output:
session_id source datetime sorthelp
3 2 twitter 2021-01-23 11:05:17.289000 False
0 1 facebook 2021-01-23 11:26:34.166000 False
4 3 NULL/NAN 2021-01-23 13:12:32.914000 True
and you can drop the helper column afterwards
print(df.assign(sorthelp=df['source'] == 'NULL/NAN')
.sort_values(['sorthelp','datetime','session_id'])
.drop_duplicates('session_id')
.drop('sorthelp', axis=1))
Output:
session_id source datetime
3 2 twitter 2021-01-23 11:05:17.289000
0 1 facebook 2021-01-23 11:26:34.166000
4 3 NULL/NAN 2021-01-23 13:12:32.914000
If your time is already sorted, you can do:
print(
df.iloc[
df.groupby("session_id")["source"].apply(
lambda x: x.first_valid_index()
if not x.first_valid_index() is None
else x.index[0]
)
]
)
Prints:
session_id source datetime
0 1 facebook 2021-01-23 11:26:34.166000
3 2 twitter 2021-01-23 11:05:17.289000
4 3 NaN 2021-01-23 13:12:32.914000
Or using :=
operator (Python 3.8+)
print(
df.iloc[
df.groupby("session_id")["source"].apply(
lambda x: fi
if not (fi := x.first_valid_index()) is None
else x.index[0]
)
]
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.