I have a DataFrame with a DateTimeIndex, a column I want to group by and a column containing sets of integers:
import pandas as pd
df = pd.DataFrame([['2018-01-01', 1, {1, 2, 3}],
['2018-01-02', 1, {3}],
['2018-01-03', 1, {3, 4, 5}],
['2018-01-04', 1, {5, 6}],
['2018-01-01', 2, {7}],
['2018-01-02', 2, {8}],
['2018-01-03', 2, {9}],
['2018-01-04', 2, {10}]],
columns=['timestamp', 'group', 'ids'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
group ids
timestamp
2018-01-01 1 {1, 2, 3}
2018-01-02 1 {3}
2018-01-03 1 {3, 4, 5}
2018-01-04 1 {5, 6}
2018-01-01 2 {7}
2018-01-02 2 {8}
2018-01-03 2 {9}
2018-01-04 2 {10}
Within each group I want to construct a rolling set union over the last x days. So assuming X=3 the result should be:
group ids
timestamp
2018-01-01 1 {1, 2, 3}
2018-01-02 1 {1, 2, 3}
2018-01-03 1 {1, 2, 3, 4, 5}
2018-01-04 1 {3, 4, 5, 6}
2018-01-01 2 {7}
2018-01-02 2 {7, 8}
2018-01-03 2 {7, 8, 9}
2018-01-04 2 {8, 9, 10}
From the answer to my previous question I got a good idea how to do this without the grouping, so I came up with this solution so far:
grouped = df.groupby('group')
new_df = pd.DataFrame()
for name, group in grouped:
group['ids'] = [
set.union(*group['ids'].to_frame().iloc(axis=1)[max(0, i-2): i+1,0])
for i in range(len(group.index))
]
new_df = new_df.append(group)
Which gives the correct result but looks quite clumsy and also gives the following warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The documentation at the provided link does not really seem to fit my exact situation, though. (At least I can't make sense of it, in this context.)
My question: How can I improve this code to be clean, performant, and not throw the warning message?
As mentioned in the docs , don't use pd.DataFrame.append
in a loop; doing so will be expensive.
Instead, use list
and feed to pd.concat
.
You can avoid SettingWithCopyWarning
by creating copies of data within your list, ie avoid chained indexing via assign
+ iloc
in a list comprehension:
L = [group.assign(ids=[set.union(*group.iloc[max(0, i-2): i+1, -1]) \
for i in range(len(group.index))]) \
for _, group in df.groupby('group')]
res = pd.concat(L)
print(res)
group ids
timestamp
2018-01-01 1 {1, 2, 3}
2018-01-02 1 {1, 2, 3}
2018-01-03 1 {1, 2, 3, 4, 5}
2018-01-04 1 {3, 4, 5, 6}
2018-01-01 2 {7}
2018-01-02 2 {8, 7}
2018-01-03 2 {8, 9, 7}
2018-01-04 2 {8, 9, 10}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.