简体   繁体   English

Pandas DataFrame-将系列字符串拆分为多列

[英]Pandas DataFrame - Splitting Series Strings into Multiple Columns

My question is more about the methodology/syntax described into a previous post which addresses different approaches to meet the same objective of splitting string values into lists and assigning each list item to a new column. 我的问题更多地是关于上一篇文章中描述的方法/语法的,该方法/语法解决了实现将字符串值拆分为列表并将每个列表项分配给新列的相同目标的不同方法。 Here's the post: Pandas DataFrame, how do i split a column into two 这是帖子: Pandas DataFrame,我如何将一列分为两部分

df: DF:

                          GDP
Date                        
Mar 31, 2017  19.03 trillion
Dec 31, 2016  18.87 trillion

script 1 + ouput: 脚本1 +输出:

>>> df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1).str
>>> print(df)

                GDP     Units
Date                         
Mar 31, 2017  19.03  trillion
Dec 31, 2016  18.87  trillion

script 2 + output: 脚本2 +输出:

>>> df[['GDP', 'Units']] = df['GDP'].str.split(' ', 1, expand=True)
>>> print(df)

                GDP     Units
Date                         
Mar 31, 2017  19.03  trillion
Dec 31, 2016  18.87  trillion

script 3 + output: 脚本3 +输出:

>>> df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1, expand=True)
>>> print(df)

              GDP  Units
Date                    
Mar 31, 2017    0      1
Dec 31, 2016    0      1

Can anyone explain what is going on? 谁能解释发生了什么? Why does script 3 produce these values in the output? 为什么脚本3在输出中产生这些值?

Let's start by looking at this 让我们从看这个开始

df['GDP'].str.split(' ', 1)

0    [19.03, trillion]
1    [18.87, trillion]
Name: GDP, dtype: object

It produces a series of lists. 它产生一系列列表。 However, the pd.Series.str , aka string accessor allows us to access the first, second, ... parts of these embedded lists via intuitive python list indexing. 但是, pd.Series.str (又名字符串访问器)允许我们通过直观的python列表索引访问这些嵌入式列表的第一,第二,...部分。

df['GDP'].str.split(' ', 1).str[0]

Date
Mar 31, 2017    19.03
Dec 31, 2016    18.87
Name: GDP, dtype: object

Or 要么

df['GDP'].str.split(' ', 1).str[1]

Date
Mar 31, 2017    trillion
Dec 31, 2016    trillion
Name: GDP, dtype: object

So, if we split into two element lists, split(' ', 1) we can treat the return object from an additional str as an iterable 因此,如果我们将元素拆分为两个元素列表split(' ', 1)则可以将其他str的返回对象视为可迭代对象

a, b = df['GDP'].str.split(' ', 1).str

a

Date
Mar 31, 2017    19.03
Dec 31, 2016    18.87
Name: GDP, dtype: object

And

b

Date
Mar 31, 2017    trillion
Dec 31, 2016    trillion
Name: GDP, dtype: object

Ok, we can short-cut the creation of two new columns by leveraging this iterable unpacking 好的,我们可以利用这种可迭代的拆包方式来简化两个新列的创建

df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1).str

However, we can pass a parameter to expand our new lists into new dataframe columns 但是,我们可以传递参数以expand新列表expand为新的数据框列

df['GDP'].str.split(' ', 1, expand=True)

                  0         1
Date                         
Mar 31, 2017  19.03  trillion
Dec 31, 2016  18.87  trillion

Now we can assign a dataframe to new columns of another dataframe like so 现在我们可以将数据框分配给另一个数据框的新列,如下所示

df[['GDP', 'Units']] = df['GDP'].str.split(' ', 1, expand=True)

However, when we do 但是,当我们这样做时

df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1, expand=True)

The return value of df['GDP'].str.split(' ', 1, expand=True) gets unpacked and those results are simply the column values. df['GDP'].str.split(' ', 1, expand=True)的返回值被解压,这些结果只是列值。 If you see just above, you notice they are 0 and 1 . 如果在上方看到,您会注意到它们是01 So in this case, 0 is assigned to the column df['GDP'] and 1 is assigned to the column df['Units'] 因此,在这种情况下,将0分配给df['GDP'] ,将1分配给df['Units']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM