I am having a dataframe whith 102377 rows, which looks like the following:
Queries
term timestamp
...
queryA 2018-09-27 18:26:47
queryB 2018-09-27 18:26:52
2547 2018-09-27 18:26:58
queryX 2018-09-28 14:29:49
queryP 2018-09-28 14:30:00
2157 2018-09-28 14:30:01
queryA 2018-09-29 10:14:15
queryY 2018-09-29 10:14:19
queryX 2018-09-30 12:20:40
queryP 2018-09-30 12:22:00
queryA 2018-09-30 12:22:01
queryU 2018-09-30 12:26:08
13324 2018-09-30 12:30:00
...
I want to create Itemsets of terms out of the dataframe. For this I would like to proceed as follows: I start at the last term, if the term is numeric a new itemset should be created. If the term before is not numeric and the timedelta is under 10 minutes, the term belongs into the itemset. In the end it should look like this:
itemsets
index terms
0 13324; queryU; queryA; queryP; queryX
1 2157; queryP; queryX
2 2547; queryB; queryA
We can try with Series.str.isnumeric
and DataFrame.pivot_table
:
df2 = df[::-1]
new_df = df2.pivot_table(index=df2['term'].str.isnumeric().cumsum(),
values='term',
aggfunc='; '.join)
print(new_df)
term
term
1 13324; queryU; queryA; queryP; queryX; queryY;...
2 2157; queryP; queryX
3 2547; queryB; queryA
We can also use
pd.to_numeric(df2['term'], errors='coerce').notna().cumsum()
Instead Series.str.isnumeric
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.