简体   繁体   English

在 pandas/python 的同一数据框中将两列合并为一列

[英]Merge two columns into one within the same data frame in pandas/python

I have a question to merge two columns into one in the same dataframe(start_end), also remove null value.我有一个问题要在同一个数据帧(start_end)中将两列合并为一列,还要删除空值。 I intend to merge 'Start station' and 'End station' into 'station', and keep 'duration' according to the new column 'station'.我打算将“起点站”和“终点站”合并到“站”中,并根据新的“站”列保留“持续时间”。 I have tried pd.merge, pd.concat, pd.append, but I cannot work it out.我尝试过 pd.merge、pd.concat、pd.append,但我无法解决。

dataFrame of Start_end: Start_end 的数据帧:

    Duration    End station     Start station
14  1407        NaN             14th & V St NW
19  509         NaN             21st & I St NW
20  638         15th & P St NW.  NaN
27  1532        NaN              Massachusetts Ave & Dupont Circle NW
28  759         NaN              Adams Mill & Columbia Rd NW

Expected output:预期输出:

    Duration    stations
14  1407        14th & V St NW
19  509         21st & I St NW
20  638         15th & P St NW
27  1532        Massachusetts Ave & Dupont Circle NW
28  759         Adams Mill & Columbia Rd NW

Code i have so far:我到目前为止的代码:

#start_end is the dataframe, 'start station', 'end station', 'duration'
start_end = pd.concat([df_start, df_end])

This is what I attempted to:这就是我试图做的:

station = pd.merge([start_end['Start station'],start_end['End station']])

fillna

If NaN are truly nulls如果NaN真的是空值

df.assign(**{
    'Start station': df['Start station'].fillna(df['End station'])})

    Duration      End station                         Start station
14      1407              NaN                        14th & V St NW
19       509              NaN                        21st & I St NW
20       638  15th & P St NW.                       15th & P St NW.
27      1532              NaN  Massachusetts Ave & Dupont Circle NW
28       759              NaN           Adams Mill & Columbia Rd NW

mask

If NaN are strings如果NaN是字符串

df.assign(**{
    'Start station': df['Start station'].mask(
        lambda x: x == 'NaN', df['End station'])})

    Duration      End station                         Start station
14      1407              NaN                        14th & V St NW
19       509              NaN                        21st & I St NW
20       638  15th & P St NW.                       15th & P St NW.
27      1532              NaN  Massachusetts Ave & Dupont Circle NW
28       759              NaN           Adams Mill & Columbia Rd NW
>>> df
   Duration      End station                         Start station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Give the two columns the same name为两列指定相同的名称

>>> df.columns = df.columns.str.replace('.*?station', 'station')
>>> df
   Duration          station                               station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Stack then unstack.堆叠然后取消堆叠。

>>> s = df.stack()
>>> s
0  Duration                                    1407
   station                           14th & V St NW
1  Duration                                     509
   station                           21st & I St NW
2  Duration                                     638
   station                          15th & P St NW.
3  Duration                                    1532
   station     Massachusetts Ave & Dupont Circle NW
4  Duration                                     759
   station              Adams Mill & Columbia Rd NW
dtype: object
>>> df = s.unstack()
>>> df
  Duration                               station
0     1407                        14th & V St NW
1      509                        21st & I St NW
2      638                       15th & P St NW.
3     1532  Massachusetts Ave & Dupont Circle NW
4      759           Adams Mill & Columbia Rd NW
>>> 

This is how I think this works:这就是我认为的工作方式:

.stack creates a series with a MultiIndex and takes care of the null values for you. .stack创建一个带有.stack的系列并为您处理空值。 It aligns the second level on the column names and because the column names are the same there is only one - unstacking just produces a single column.它在列名上对齐第二级,因为列名相同,所以只有一个 - 取消堆叠只会产生一个列。

That's really just a guess based on the differences between Index's if you don't change the column names.如果您不更改列名,那实际上只是基于索引之间差异的猜测。

>>> # without changing column names
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])

>>> # column names the same
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])

Seems a bit tricky, maybe someone will comment on it.看起来有点棘手,也许有人会评论它。


Alternative - Using pd.concat and .dropna替代方案 - 使用pd.concat.dropna

>>> stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()
>>> stations.name = 'stations'
>>> stations
2                         15th & P St NW.
0                          14th & V St NW
1                          21st & I St NW
3    Massachusetts Ave & Dupont Circle NW
4             Adams Mill & Columbia Rd NW
Name: stations, dtype: object

>>> df2 = pd.concat([df['Duration'], stations], axis=1)
>>> df2
   Duration                              stations
0      1407                        14th & V St NW
1       509                        21st & I St NW
2       638                       15th & P St NW.
3      1532  Massachusetts Ave & Dupont Circle NW
4       759           Adams Mill & Columbia Rd NW

Using combine_first .使用combine_first replaces null values in col1 with col2col2替换 col1 中的空值

df["station"] = df["End station"].combine_first(df["Start station"])
df.drop(["End station", "Start station"], 1, inplace=True)

使用ffill

df.iloc[:,2:4]=df.iloc[:,2:4].ffill(1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python Pandas-在列上合并两个数据框和子字符串 - Python Pandas - Merge two Data Frame and Substring on columns Python Pandas:将具有列名的数据框列合并为一列 - Python Pandas: Merge Columns of Data Frame with column name into one column Pandas:如何在单个数据框中合并包含相同名称的列? - Pandas: How to merge columns containing the same name within a single data frame? 如何根据 Python Pandas 中第二个数据帧中的几列合并两个数据帧? - How to merge two Data Frames based on a few columns in second Data Frame in Python Pandas? 如何在 Python Pandas 的一个数据框中使用几列进行合并? - How to make merge using a few columns in one Data Frame in Python Pandas? 合并两个pandas数据框并跳过右侧的公共列 - merge two pandas data frame and skip common columns of right 如何在同一数据帧(Python,Pandas)中合并1列中的2列? - How to merge 2 columns in 1 within same dataframe (Python, Pandas)? 在python pandas的数据框的最后两列中选择包含相同文本的行 - select the Rows which contains same text in last two columns of data frame in python pandas Python Pandas数据框:如何对两个具有相同名称的列执行操作 - Python pandas data frame: how to perform operations on two columns with the same name 根据某些值合并两个 Python pandas 数据框行 - Merge two Python pandas data frame rows depending on certain values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM