简体   繁体   中英

Creating itemsets from pandas dataframe with python

I am having a dataframe whith 102377 rows, which looks like the following:

Queries
        term                     timestamp
        ...
        queryA                   2018-09-27 18:26:47
        queryB                   2018-09-27 18:26:52
        2547                     2018-09-27 18:26:58
        queryX                   2018-09-28 14:29:49
        queryP                   2018-09-28 14:30:00
        2157                     2018-09-28 14:30:01
        queryA                   2018-09-29 10:14:15
        queryY                   2018-09-29 10:14:19
        queryX                   2018-09-30 12:20:40
        queryP                   2018-09-30 12:22:00
        queryA                   2018-09-30 12:22:01
        queryU                   2018-09-30 12:26:08
        13324                    2018-09-30 12:30:00
        ...

I want to create Itemsets of terms out of the dataframe. For this I would like to proceed as follows: I start at the last term, if the term is numeric a new itemset should be created. If the term before is not numeric and the timedelta is under 10 minutes, the term belongs into the itemset. In the end it should look like this:

itemsets
 index    terms
 0        13324; queryU; queryA; queryP; queryX
 1        2157; queryP; queryX
 2        2547; queryB; queryA

We can try with Series.str.isnumeric and DataFrame.pivot_table :

df2 = df[::-1]
new_df = df2.pivot_table(index=df2['term'].str.isnumeric().cumsum(),
                         values='term', 
                         aggfunc='; '.join)
print(new_df)

                                                   term
term                                                   
1     13324; queryU; queryA; queryP; queryX; queryY;...
2                                  2157; queryP; queryX
3                                  2547; queryB; queryA

We can also use

 pd.to_numeric(df2['term'], errors='coerce').notna().cumsum()

Instead Series.str.isnumeric

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM