简体   繁体   中英

How to create sequences out of a dataframe and put them in an array of arrays or a list?

For the input of:

df = pd.DataFrame(np.array([[1,  "A"],[2, "A"],[3, "B"],[4, "C"],[5, "D" ],[6, "A" ],[7, "B" ],[8, "A" ],[9, "C" ],[10, "D" ],[11,"A" ],
                           [12,  "A"],[13, "B"],[14, "B"],[15, "D" ],[16, "A" ],[17, "B" ],[18, "A" ],[19, "C" ],[20, "D" ],[21,"A" ],
                           [22,  "A"],[23, "A"],[24, "C"],[25, "D" ],[26, "A" ],[27, "C" ],[28, "A" ],[29, "C" ],[30, "D" ] ]),
                            columns=['No.',  'Value'])

I get the output of:

    No. Value
0   1   A
1   2   A
2   3   B
3   4   C
4   5   D
5   6   A
6   7   B
7   8   A
8   9   C
9   10  D
10  11  A
11  12  A
12  13  B
13  14  B
14  15  D
15  16  A
16  17  B
17  18  A
18  19  C
19  20  D
20  21  A
21  22  A
22  23  A
23  24  C
24  25  D
25  26  A
26  27  C
27  28  A
28  29  C
29  30  D

Now i want to create sequences of the data. That sequence defines a region of values till value "D" appears. For example in the first sequence there are the rows from No.1 till No.5(included) The second sequence is from No.6 till No.10(included) and so on.

After that i want to code the values into numbers: A -> 1, B->2, C->3, D->4 If in a sequence the value A is followed by another A or many A's it will be summarized to one number 1. The same applies for the other values too.

First sequence = A,A,B,C,D For that i want to have something like that = [1,2,3,4]

For the whole output i want something like that:

result = list([[1,2,3,4],[1,2,1,3,4],[1,2,4],[1,2,1,3,4],[1,3,4],[1,3,1,3,4]])

Output:

[[1, 2, 3, 4],
 [1, 2, 1, 3, 4],
 [1, 2, 4],
 [1, 2, 1, 3, 4],
 [1, 3, 4],
 [1, 3, 1, 3, 4]]

Here I'm using cumsum() to give all elements in the same sequence a "Sequence ID" (the value goes up by 1 every time a "D" is encountered)

Then use groupby() to group by sequence, and output each group to a list, which is in turn getting filtered so consecutive values are unified, like this:

import pandas as pd
import numpy as np
from itertools import groupby
from pprint import pprint

df = pd.DataFrame(np.array([[1,  "A"],[2, "A"],[3, "B"],[4, "C"],[5, "D" ],[6, "A" ],[7, "B" ],[8, "A" ],[9, "C" ],[10, "D" ],[11,"A" ],
                           [12,  "A"],[13, "B"],[14, "B"],[15, "D" ],[16, "A" ],[17, "B" ],[18, "A" ],[19, "C" ],[20, "D" ],[21,"A" ],
                           [22,  "A"],[23, "A"],[24, "C"],[25, "D" ],[26, "A" ],[27, "C" ],[28, "A" ],[29, "C" ],[30, "D" ] ]),
                            columns=['No.',  'Value'])

df["NumVal"] = df["Value"].map({"A":1,"B":2,"C":3,"D":4})
df["SequenceID"] = (df["Value"].shift(1) == "D").cumsum()

result = [[nums[0] for nums in groupby(g["NumVal"].tolist())] for k,g in df.groupby("SequenceID")]

pprint(result)

Output:

[[1, 2, 3, 4],
 [1, 2, 1, 3, 4],
 [1, 2, 4],
 [1, 2, 1, 3, 4],
 [1, 3, 4],
 [1, 3, 1, 3, 4]]

Try:

from itertools import groupby
values = df['Value'].replace({'A':1, 'B':2, 'C':3, 'D':4}).values
idx_list = [idx + 1 for idx, val in enumerate(values) if val == 4]
result = [values[i: j] for i, j in zip([0] + idx_list, idx_list + ([len(values)] if idx_list[-1] != len(values) else []))]
result = [[values[0] for values in groupby(l)] for l in result]
print(result)

[[1, 2, 3, 4], 
 [1, 2, 1, 3, 4], 
 [1, 2, 4], 
 [1, 2, 1, 3, 4], 
 [1, 3, 4], 
 [1, 3, 1, 3, 4]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM