I would like to find a pandas solution for the following problem (the dataframe is very long in reality, therefore performance really is an important topic):
I have an input dataframe df and need to build a new dataframe dfNew, where I need to derive the output in column 'rs' from the values of the other columns.
And the needed logics is the following:
t
is always increasing steadily from 0 to its maximum value. Afterwards its starts again with 0. t
= 0 and the next upcoming pt
= 'X' (including), the value of column td
should be taken for the result column rs
, else the value of column md
should be taken for column rs
. How would a pandas based solution to derive rs
from the other columns look like?
td = ['td0','td1','td2','td3','td4','td5','td6','td7','td8','td9','td10','td11','td12']
md = ['md0','md1','md2','md3','md4','md5','md6','md7','md8','md9','md10','md11','md12']
t = [ 0 , 1 , 2 , 3 , 0 , 1 , 2 , 3 , 4 , 5 , 0 , 1 , 2 ]
pt = [ 'n', 'n', 'X', 'n', 'n', 'n', 'n', 'X', 'n', 'n', 'n', 'X', 'n']
df = pd.DataFrame({'td': td, 'md': md, 't': t, 'pt': pt}, columns=['td', 'md', 't', 'pt'])
df
td md t pt
0 td0 md0 0 n
1 td1 md1 1 n
2 td2 md2 2 X
3 td3 md3 3 n
4 td4 md4 0 n
5 td5 md5 1 n
6 td6 md6 2 n
7 td7 md7 3 X
8 td8 md8 4 n
9 td9 md9 5 n
10 td10 md10 0 n
11 td11 md11 1 X
12 td12 md12 2 n
dfNew
td md t pt rs
0 td0 md0 0 n td0
1 td1 md1 1 n td1
2 td2 md2 2 X td2
3 td3 md3 3 n md3
4 td4 md4 0 n td4
5 td5 md5 1 n td5
6 td6 md6 2 n td6
7 td7 md7 3 X td7
8 td8 md8 4 n md8
9 td9 md9 5 n md9
10 td10 md10 0 n td10
11 td11 md11 1 X td11
12 td12 md12 2 n md12
Here's my take with groupby
and cumsum
# df.t.eq(0).cumsum() marks the range of t
# similarly x.shift().eq('X').cumsum() marks the X range
pt_range = (df.groupby(df.t.eq(0).cumsum())
.pt.apply(lambda x: x.shift().eq('X').cumsum()))
df['rs'] = np.where(pt_range, df.md, df.td)
Output:
+-----+-------+-------+----+-----+------+
| | td | md | t | pt | rs |
+-----+-------+-------+----+-----+------+
| 0 | td0 | md0 | 0 | n | td0 |
| 1 | td1 | md1 | 1 | n | td1 |
| 2 | td2 | md2 | 2 | X | td2 |
| 3 | td3 | md3 | 3 | n | md3 |
| 4 | td4 | md4 | 0 | n | td4 |
| 5 | td5 | md5 | 1 | n | td5 |
| 6 | td6 | md6 | 2 | n | td6 |
| 7 | td7 | md7 | 3 | X | td7 |
| 8 | td8 | md8 | 4 | n | md8 |
| 9 | td9 | md9 | 5 | n | md9 |
| 10 | td10 | md10 | 0 | n | td10 |
| 11 | td11 | md11 | 1 | X | td11 |
| 12 | td12 | md12 | 2 | n | md12 |
+-----+-------+-------+----+-----+------+
I have build an algorithm to break the series after each X
. But not sure how efficient it will be.
# store pt to list
pt_list = df.pt.tolist()
# iterate through the list to get the index of each n after each X
md_map = {}
for idx, item in enumerate(pt_list):
if item == "X" and idx != df.index.max():
key = idx+1
value = "md"
md_map[key] = value
# map it with data frame
df["td_md"] = df.index.map(md_map)
# fill the na with td
df["td_md"] = df.td_md.fillna("td")
# create rs column from index and td_md
df["rs"] = df.td_md + df.index.astype(str)
I did not think abut each and every condition. But you have to build something like that.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.