[英]Merge two columns into one within the same data frame in pandas/python
I have a question to merge two columns into one in the same dataframe(start_end), also remove null value.我有一个问题要在同一个数据帧(start_end)中将两列合并为一列,还要删除空值。 I intend to merge 'Start station' and 'End station' into 'station', and keep 'duration' according to the new column 'station'.
我打算将“起点站”和“终点站”合并到“站”中,并根据新的“站”列保留“持续时间”。 I have tried pd.merge, pd.concat, pd.append, but I cannot work it out.
我尝试过 pd.merge、pd.concat、pd.append,但我无法解决。
dataFrame of Start_end: Start_end 的数据帧:
Duration End station Start station
14 1407 NaN 14th & V St NW
19 509 NaN 21st & I St NW
20 638 15th & P St NW. NaN
27 1532 NaN Massachusetts Ave & Dupont Circle NW
28 759 NaN Adams Mill & Columbia Rd NW
Expected output:预期输出:
Duration stations
14 1407 14th & V St NW
19 509 21st & I St NW
20 638 15th & P St NW
27 1532 Massachusetts Ave & Dupont Circle NW
28 759 Adams Mill & Columbia Rd NW
Code i have so far:我到目前为止的代码:
#start_end is the dataframe, 'start station', 'end station', 'duration'
start_end = pd.concat([df_start, df_end])
This is what I attempted to:这就是我试图做的:
station = pd.merge([start_end['Start station'],start_end['End station']])
fillna
If NaN
are truly nulls如果
NaN
真的是空值
df.assign(**{
'Start station': df['Start station'].fillna(df['End station'])})
Duration End station Start station
14 1407 NaN 14th & V St NW
19 509 NaN 21st & I St NW
20 638 15th & P St NW. 15th & P St NW.
27 1532 NaN Massachusetts Ave & Dupont Circle NW
28 759 NaN Adams Mill & Columbia Rd NW
mask
If NaN
are strings如果
NaN
是字符串
df.assign(**{
'Start station': df['Start station'].mask(
lambda x: x == 'NaN', df['End station'])})
Duration End station Start station
14 1407 NaN 14th & V St NW
19 509 NaN 21st & I St NW
20 638 15th & P St NW. 15th & P St NW.
27 1532 NaN Massachusetts Ave & Dupont Circle NW
28 759 NaN Adams Mill & Columbia Rd NW
>>> df
Duration End station Start station
0 1407 NaN 14th & V St NW
1 509 NaN 21st & I St NW
2 638 15th & P St NW. NaN
3 1532 NaN Massachusetts Ave & Dupont Circle NW
4 759 NaN Adams Mill & Columbia Rd NW
Give the two columns the same name为两列指定相同的名称
>>> df.columns = df.columns.str.replace('.*?station', 'station')
>>> df
Duration station station
0 1407 NaN 14th & V St NW
1 509 NaN 21st & I St NW
2 638 15th & P St NW. NaN
3 1532 NaN Massachusetts Ave & Dupont Circle NW
4 759 NaN Adams Mill & Columbia Rd NW
Stack then unstack.堆叠然后取消堆叠。
>>> s = df.stack()
>>> s
0 Duration 1407
station 14th & V St NW
1 Duration 509
station 21st & I St NW
2 Duration 638
station 15th & P St NW.
3 Duration 1532
station Massachusetts Ave & Dupont Circle NW
4 Duration 759
station Adams Mill & Columbia Rd NW
dtype: object
>>> df = s.unstack()
>>> df
Duration station
0 1407 14th & V St NW
1 509 21st & I St NW
2 638 15th & P St NW.
3 1532 Massachusetts Ave & Dupont Circle NW
4 759 Adams Mill & Columbia Rd NW
>>>
This is how I think this works:这就是我认为的工作方式:
.stack
creates a series with a MultiIndex and takes care of the null values for you. .stack
创建一个带有.stack
的系列并为您处理空值。 It aligns the second level on the column names and because the column names are the same there is only one - unstacking just produces a single column.它在列名上对齐第二级,因为列名相同,所以只有一个 - 取消堆叠只会产生一个列。
That's really just a guess based on the differences between Index's if you don't change the column names.如果您不更改列名,那实际上只是基于索引之间差异的猜测。
>>> # without changing column names
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])
>>> # column names the same
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])
Seems a bit tricky, maybe someone will comment on it.看起来有点棘手,也许有人会评论它。
Alternative - Using pd.concat
and .dropna
替代方案 - 使用
pd.concat
和.dropna
>>> stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()
>>> stations.name = 'stations'
>>> stations
2 15th & P St NW.
0 14th & V St NW
1 21st & I St NW
3 Massachusetts Ave & Dupont Circle NW
4 Adams Mill & Columbia Rd NW
Name: stations, dtype: object
>>> df2 = pd.concat([df['Duration'], stations], axis=1)
>>> df2
Duration stations
0 1407 14th & V St NW
1 509 21st & I St NW
2 638 15th & P St NW.
3 1532 Massachusetts Ave & Dupont Circle NW
4 759 Adams Mill & Columbia Rd NW
Using combine_first
.使用
combine_first
。 replaces null values in col1 with col2
用
col2
替换 col1 中的空值
df["station"] = df["End station"].combine_first(df["Start station"])
df.drop(["End station", "Start station"], 1, inplace=True)
使用ffill
df.iloc[:,2:4]=df.iloc[:,2:4].ffill(1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.