[英]Transposing columns into rows to create event log data set
could you please help me to transpose columns into rows to create event log series?你能帮我把列转换成行来创建事件日志系列吗?
I want to create an event log data set out of the following columns.我想从以下列创建一个事件日志数据集。
My table looks like the following:我的表如下所示:
ID1 ID2 Event1 Event1_activity Event2 Event2_activity Event3 Event3_activity
10001A 6456 05.09.2019 12:32 Event1_Description 09.09.2019 12:40 Event2_Description 10.09.2019 12:40 Event3_Description
10001A 6456 05.09.2019 12:32 Event1_Description 09.09.2019 12:40 Event2_Description 10.09.2019 12:40 Event3_Description
20001B 8793 03.09.2019 09:45 Event1_Description 10.09.2019 12:25 Event2_Description 11.09.2019 12:25 Event3_Description
20001B 9017 03.09.2019 09:49 Event1_Description 10.09.2019 12:25 Event2_Description 11.09.2019 12:25 Event3_Description
20001B 5454 04.09.2019 12:42 Event1_Description 10.09.2019 12:25 Event2_Description 11.09.2019 12:25 Event3_Description
According to ID1 and ID2 , I want to create a series of event logs based on the columns with respective events and activities.根据ID1和ID2 ,我想根据具有相应事件和活动的列创建一系列事件日志。
Basically my event log table should look like the following:基本上我的事件日志表应该如下所示:
ID Event Activity
6456-10001A 05.09.2019 12:32 Event1_Description
6456-10001A 09.09.2019 12:40 Event2_Description
6456-10001A 10.09.2019 12:40 Event3_Description
6456-10001A 05.09.2019 12:32 Event1_Description
6456-10001A 09.09.2019 12:40 Event2_Description
6456-10001A 10.09.2019 12:40 Event3_Description
8793-20001B 03.09.2019 09:45 Event1_Description
8793-20001B 10.09.2019 12:25 Event2_Description
8793-20001B 04.09.2019 09:45 Event3_Description
9017-20001B 03.09.2019 09:49 Event1_Description
9017-20001B 10.09.2019 12:25 Event2_Description
9017-20001B 04.09.2019 09:49 Event3_Description
5454-20001B 04.09.2019 12:42 Event1_Description
5454-20001B 10.09.2019 12:25 Event2_Description
5454-20001B 05.09.2019 12:42 Event3_Description
Any suggestions woud higly be appreciated!任何建议都将不胜感激!
You can create the new ID
and then concatenate the dataframe subsets and sort by ID您可以创建新
ID
,然后连接 dataframe 子集并按 ID 排序
df['ID'] = df['ID2'].astype(str) + '-' + df['ID1']
n_events = 3
pd.concat([df[['ID', f'Event{i}', f'Event{i}_activity']].rename(columns={f'Event{i}': 'Event', f'Event{i}_activity': 'Activity'})
for i in range(1, n_events+1)]
).sort_values(by='ID').reset_index(drop=True)
ID Event Activity
0 5454-20001B 04.09.2019 12:42 Event1_Description
1 5454-20001B 10.09.2019 12:25 Event2_Description
2 5454-20001B 11.09.2019 12:25 Event3_Description
3 6456-10001A 05.09.2019 12:32 Event1_Description
4 6456-10001A 05.09.2019 12:32 Event1_Description
5 6456-10001A 09.09.2019 12:40 Event2_Description
6 6456-10001A 09.09.2019 12:40 Event2_Description
7 6456-10001A 10.09.2019 12:40 Event3_Description
8 6456-10001A 10.09.2019 12:40 Event3_Description
9 8793-20001B 03.09.2019 09:45 Event1_Description
10 8793-20001B 10.09.2019 12:25 Event2_Description
11 8793-20001B 11.09.2019 12:25 Event3_Description
12 9017-20001B 03.09.2019 09:49 Event1_Description
13 9017-20001B 10.09.2019 12:25 Event2_Description
14 9017-20001B 11.09.2019 12:25 Event3_Description
If you have to retain the original order of ID
, then you have to do differently如果您必须保留
ID
的原始顺序,那么您必须做不同的事情
Using melt .使用 熔体。 Dynamic - more columns (>3) will still work.
动态 - 更多列 (>3) 仍然有效。
df = pd.read_csv(io.StringIO("""ID1 ID2 Event1 Event1_activity Event2 Event2_activity Event3 Event3_activity
10001A 6456 05.09.2019 12:32 Event1_Description 09.09.2019 12:40 Event2_Description 10.09.2019 12:40 Event3_Description
10001A 6456 05.09.2019 12:32 Event1_Description 09.09.2019 12:40 Event2_Description 10.09.2019 12:40 Event3_Description
20001B 8793 03.09.2019 09:45 Event1_Description 10.09.2019 12:25 Event2_Description 11.09.2019 12:25 Event3_Description
20001B 9017 03.09.2019 09:49 Event1_Description 10.09.2019 12:25 Event2_Description 11.09.2019 12:25 Event3_Description
20001B 5454 04.09.2019 12:42 Event1_Description 10.09.2019 12:25 Event2_Description 11.09.2019 12:25 Event3_Description"""
), sep="\s\s+", engine="python")
# pepare ID column as concatenation
df = df.assign(ID=lambda dfa: dfa["ID1"].astype(str)+"-"+dfa["ID2"].astype(str)).drop(columns=["ID1","ID2"])
# melt out both sets of columns for Event and Activity then merge
# NB reset_index() to ensure merge key works. Plus only want ID on LHS dataframe
df2 = pd.merge(
pd.melt(df, id_vars=["ID"],
value_vars=[c for c in df.columns if "Event" in c and "activity" not in c],
value_name="Event").drop(columns="variable").reset_index(),
pd.melt(df, id_vars=["ID"],
value_vars=[c for c in df.columns if "activity" in c],
value_name="Activity").drop(columns=["variable","ID"]).reset_index(),
on="index"
).drop(columns="index").sort_values(["ID","Event"])
ID Event Activity
10001A-6456 05.09.2019 12:32 Event1_Description
10001A-6456 05.09.2019 12:32 Event1_Description
10001A-6456 09.09.2019 12:40 Event2_Description
10001A-6456 09.09.2019 12:40 Event2_Description
10001A-6456 10.09.2019 12:40 Event3_Description
10001A-6456 10.09.2019 12:40 Event3_Description
20001B-5454 04.09.2019 12:42 Event1_Description
20001B-5454 10.09.2019 12:25 Event2_Description
20001B-5454 11.09.2019 12:25 Event3_Description
20001B-8793 03.09.2019 09:45 Event1_Description
20001B-8793 10.09.2019 12:25 Event2_Description
20001B-8793 11.09.2019 12:25 Event3_Description
20001B-9017 03.09.2019 09:49 Event1_Description
20001B-9017 10.09.2019 12:25 Event2_Description
20001B-9017 11.09.2019 12:25 Event3_Description
Use wide_to_long
with create ID
column and swapping columns names like Event1_activity
to activity_Event1
:将
wide_to_long
与create ID
列一起使用,并将Event1_activity
之类的列名称交换为activity_Event1
:
df['ID'] = df.pop("ID1").astype(str) + "-" + df.pop("ID2").astype(str))
df.columns = [f'{x[1]}_{x[0]}' if len(x) == 2 else f'{"".join(x)}'
for x in df.columns.str.split('_')]
df = (pd.wide_to_long(df.reset_index(),
stubnames=['Event','activity_Event'],
i=['index','ID'],
j='tmp')
.reset_index(level=1).reset_index(drop=True))
print (df)
ID Event activity_Event
0 10001A-6456 05.09.2019 12:32 Event1_Description
1 10001A-6456 09.09.2019 12:40 Event2_Description
2 10001A-6456 10.09.2019 12:40 Event3_Description
3 10001A-6456 05.09.2019 12:32 Event1_Description
4 10001A-6456 09.09.2019 12:40 Event2_Description
5 10001A-6456 10.09.2019 12:40 Event3_Description
6 20001B-8793 03.09.2019 09:45 Event1_Description
7 20001B-8793 10.09.2019 12:25 Event2_Description
8 20001B-8793 11.09.2019 12:25 Event3_Description
9 20001B-9017 03.09.2019 09:49 Event1_Description
10 20001B-9017 10.09.2019 12:25 Event2_Description
11 20001B-9017 11.09.2019 12:25 Event3_Description
12 20001B-5454 04.09.2019 12:42 Event1_Description
13 20001B-5454 10.09.2019 12:25 Event2_Description
14 20001B-5454 11.09.2019 12:25 Event3_Description
The reshaping process could be abstracted by using the pivot_longer function from pyjanitor ;重塑过程可以通过使用pyjanitor的pivot_longer function 来抽象; at the moment you have to install the latest development version from github :
目前你必须从github安装最新的开发版本:
Your columns have a pattern - some end with numbers, while the rest end with activity
.您的列有一个模式 - 有些以数字结尾,而 rest 以
activity
结尾。 We can use a regular expression inside the pivot_longer function to get your results:我们可以在pivot_longer function 中使用正则表达式来获取结果:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
( # combine `ID1` and `ID2` into a single column
df.assign(ID=df.ID2.astype(str).str.cat(df.ID1, sep="-"))
.drop(columns=["ID1", "ID2"])
.pivot_longer(
index="ID",
names_to=("Event", "Activity"),
names_pattern=("\d$", "activity$"),
sort_by_appearance=True,
)
)
ID Event Activity
0 6456-10001A 05.09.2019 12:32 Event1_Description
1 6456-10001A 09.09.2019 12:40 Event2_Description
2 6456-10001A 10.09.2019 12:40 Event3_Description
3 6456-10001A 05.09.2019 12:32 Event1_Description
4 6456-10001A 09.09.2019 12:40 Event2_Description
5 6456-10001A 10.09.2019 12:40 Event3_Description
6 8793-20001B 03.09.2019 09:45 Event1_Description
7 8793-20001B 10.09.2019 12:25 Event2_Description
8 8793-20001B 11.09.2019 12:25 Event3_Description
9 9017-20001B 03.09.2019 09:49 Event1_Description
10 9017-20001B 10.09.2019 12:25 Event2_Description
11 9017-20001B 11.09.2019 12:25 Event3_Description
12 5454-20001B 04.09.2019 12:42 Event1_Description
13 5454-20001B 10.09.2019 12:25 Event2_Description
14 5454-20001B 11.09.2019 12:25 Event3_Description
The names_pattern
("\d$", "activity$")
looks for the columns that end with number and activity
and assigns them to the respective column names in names_to
("Event", "Activity")
names_pattern
("\d$", "activity$")
查找以数字和activity
结尾的列,并将它们分配给names_to
("Event", "Activity")
中的相应列名
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.