[英]How to split a dataframe string column into two columns?
I have a data frame with one (string) column and I'd like to split it into two (string) columns, with one column header as ' fips'
and the other 'row'
我有一个包含一列(字符串)的数据框,我想将其拆分为两列(字符串),一列 header 作为“
fips'
,另一列作为'row'
My dataframe df
looks like this:我的 dataframe
df
看起来像这样:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
I do not know how to use df.row.str[:]
to achieve my goal of splitting the row cell.我不知道如何使用
df.row.str[:]
来实现拆分行单元格的目标。 I can use df['fips'] = hello
to add a new column and populate it with hello
.我可以使用
df['fips'] = hello
添加一个新列并用hello
填充它。 Any ideas?有任何想法吗?
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
For the simple case of:对于简单的情况:
The simplest solution is:最简单的解决方案是:
df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)
You must use expand=True
if your strings have a non-uniform number of splits and you want None
to replace the missing values.如果您的字符串具有不均匀的拆分数量并且您希望
None
替换缺失值,则必须使用expand=True
。
Notice how, in either case, the .tolist()
method is not necessary.请注意,在任何一种情况下,都不需要
.tolist()
方法。 Neither is zip()
. zip()
也不是。
Andy Hayden's solution is most excellent in demonstrating the power of the str.extract()
method. Andy Hayden 的解决方案最出色地展示了
str.extract()
方法的强大功能。
But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split()
method is enough 1 .但是对于已知分隔符的简单拆分(例如,用破折号拆分或用空格拆分),
.str.split()
方法就足够了1 。 It operates on a column (Series) of strings, and returns a column (Series) of lists:它对一列(系列)字符串进行操作,并返回一列(系列)列表:
>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df
AB
0 A1-B1
1 A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df
AB AB_split
0 A1-B1 [A1, B1]
1 A2-B2 [A2, B2]
1: If you're unsure what the first two parameters of .str.split()
do, I recommend the docs for the plain Python version of the method . 1:如果您不确定
.str.split()
的前两个参数是做什么的,我推荐使用该方法的纯 Python 版本的文档。
But how do you go from:但是你怎么去:
to:至:
Well, we need to take a closer look at the .str
attribute of a column.好吧,我们需要仔细看看列的
.str
属性。
It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:它是一个神奇的对象,用于收集将列中的每个元素视为字符串的方法,然后尽可能高效地在每个元素中应用相应的方法:
>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df
U
0 A
1 B
2 C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df
U L
0 A a
1 B b
2 C c
But it also has an "indexing" interface for getting each element of a string by its index:但它也有一个“索引”接口,用于通过索引获取字符串的每个元素:
>>> df['AB'].str[0]
0 A
1 A
Name: AB, dtype: object
>>> df['AB'].str[1]
0 1
1 2
Name: AB, dtype: object
Of course, this indexing interface of .str
doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:当然,
.str
的这个索引接口并不真正关心它所索引的每个元素是否实际上是一个字符串,只要它可以被索引,所以:
>>> df['AB'].str.split('-', 1).str[0]
0 A1
1 A2
Name: AB, dtype: object
>>> df['AB'].str.split('-', 1).str[1]
0 B1
1 B2
Name: AB, dtype: object
Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do然后,利用 Python 元组对可迭代对象进行解包是一件简单的事情
>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df
AB AB_split A B
0 A1-B1 [A1, B1] A1 B1
1 A2-B2 [A2, B2] A2 B2
Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split()
method can do it for you with the expand=True
parameter:当然,从拆分一列字符串中获取 DataFrame 非常有用,
.str.split()
方法可以使用expand=True
参数为您完成:
>>> df['AB'].str.split('-', 1, expand=True)
0 1
0 A1 B1
1 A2 B2
So, another way of accomplishing what we wanted is to do:因此,完成我们想要的另一种方法是:
>>> df = df[['AB']]
>>> df
AB
0 A1-B1
1 A2-B2
>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))
AB A B
0 A1-B1 A1 B1
1 A2-B2 A2 B2
The expand=True
version, although longer, has a distinct advantage over the tuple unpacking method. expand=True
版本虽然更长,但与元组解包方法相比具有明显的优势。 Tuple unpacking doesn't deal well with splits of different lengths:元组拆包不能很好地处理不同长度的拆分:
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
AB
0 A1-B1
1 A2-B2
2 A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
[...]
ValueError: Length of values does not match length of index
>>>
But expand=True
handles it nicely by placing None
in the columns for which there aren't enough "splits":但是
expand=True
通过在没有足够“拆分”的列中放置None
来很好地处理它:
>>> df.join(
... df['AB'].str.split('-', expand=True).rename(
... columns={0:'A', 1:'B', 2:'C'}
... )
... )
AB A B C
0 A1-B1 A1 B1 None
1 A2-B2 A2 B2 None
2 A3-B3-C3 A3 B3 C3
There might be a better way, but this here's one approach:可能有更好的方法,但这是一种方法:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
columns = ['fips','row'])
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
You can extract the different parts out quite neatly using a regex pattern:您可以使用正则表达式模式非常巧妙地提取不同的部分:
In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]:
fips 1 state county state_code
0 00000 UNITED STATES UNITED STATES NaN NaN
1 01000 ALABAMA ALABAMA NaN NaN
2 01001 Autauga County, AL NaN Autauga County AL
3 01003 Baldwin County, AL NaN Baldwin County AL
4 01005 Barbour County, AL NaN Barbour County AL
[5 rows x 5 columns]
To explain the somewhat long regex:解释有点长的正则表达式:
(?P<fips>\d{5})
\d
) and names them "fips"
.\d
) 并将它们命名为"fips"
。 The next part:下一部分:
((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
Does either ( |
) one of two things:是否 (
|
) 做以下两件事之一:
(?P<state>[A-Z ]*$)
*
) of capital letters or spaces ( [AZ ]
) and names this "state"
before the end of the string ( $
),*
) 的大写字母或空格 ( [AZ ]
) 并在字符串 ( $
) 的结尾之前命名此"state"
, or或者
(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
.*
) then.*
)然后state_code
before the end of the string ( $
).$
) 之前的两位数state_code
。 In the example:在示例中:
Note that the first two rows hit the "state" (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).请注意,前两行命中“州”(将 NaN 留在县和 state_code 列中),而最后三行命中县 state_code(将 NaN 留在州列中)。
df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)
You can use str.split
by whitespace (default separator) and parameter expand=True
for DataFrame
with assign to new columns:您可以使用
str.split
by whitespace(默认分隔符)和参数expand=True
用于DataFrame
并分配给新列:
df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL']})
print (df)
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
row a b
0 00000 UNITED STATES 00000 UNITED STATES
1 01000 ALABAMA 01000 ALABAMA
2 01001 Autauga County, AL 01001 Autauga County, AL
3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 01005 Barbour County, AL 01005 Barbour County, AL
Modification if need remove original column with DataFrame.pop
如果需要使用
DataFrame.pop
删除原始列进行修改
df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
What is same like:什么是一样的:
df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)
a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
If get error:如果得到错误:
#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)
ValueError: Columns must be same length as key
ValueError:列的长度必须与键的长度相同
You can check and it return 4 column DataFrame
, not only 2:您可以检查并返回 4 列
DataFrame
,而不仅仅是 2:
print (df['row'].str.split(expand=True))
0 1 2 3
0 00000 UNITED STATES None
1 01000 ALABAMA None None
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Then solution is append new DataFrame
by join
:然后解决方案是通过
join
追加新的DataFrame
:
df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL'],
'a':range(5)})
print (df)
a row
0 0 00000 UNITED STATES
1 1 01000 ALABAMA
2 2 01001 Autauga County, AL
3 3 01003 Baldwin County, AL
4 4 01005 Barbour County, AL
df = df.join(df['row'].str.split(expand=True))
print (df)
a row 0 1 2 3
0 0 00000 UNITED STATES 00000 UNITED STATES None
1 1 01000 ALABAMA 01000 ALABAMA None None
2 2 01001 Autauga County, AL 01001 Autauga County, AL
3 3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 4 01005 Barbour County, AL 01005 Barbour County, AL
With remove original column (if there are also another columns):删除原始列(如果还有其他列):
df = df.join(df.pop('row').str.split(expand=True))
print (df)
a 0 1 2 3
0 0 00000 UNITED STATES None
1 1 01000 ALABAMA None None
2 2 01001 Autauga County, AL
3 3 01003 Baldwin County, AL
4 4 01005 Barbour County, AL
If you don't want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:如果您不想创建新的数据框,或者您的数据框的列多于您要拆分的列,您可以:
df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]
If you want to split a string into more than two columns based on a delimiter you can omit the 'maximum splits' parameter.如果您想根据分隔符将字符串拆分为多于两列,您可以省略“最大拆分”参数。
You can use:您可以使用:
df['column_name'].str.split('/', expand=True)
This will automatically create as many columns as the maximum number of fields included in any of your initial strings.这将自动创建与任何初始字符串中包含的最大字段数一样多的列。
Surprised I haven't seen this one yet.很惊讶我还没有看到这个。 If you only need two splits, I highly recommend.
如果你只需要两个拆分,我强烈推荐。 .
. .
.
Series.str.partition
partition
performs one split on the separator, and is generally quite performant. partition
在分隔符上执行一次拆分,并且通常具有很高的性能。
df['row'].str.partition(' ')[[0, 2]]
0 2
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
If you need to rename the rows,如果您需要重命名行,
df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
If you need to join this back to the original, use join
or concat
:如果您需要将其加入到原始状态,请使用
join
或concat
:
df.join(df['row'].str.partition(' ')[[0, 2]])
pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)
row 0 2
0 00000 UNITED STATES 00000 UNITED STATES
1 01000 ALABAMA 01000 ALABAMA
2 01001 Autauga County, AL 01001 Autauga County, AL
3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 01005 Barbour County, AL 01005 Barbour County, AL
Use df.assign
to create a new df.使用
df.assign
创建一个新的 df。 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html见https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)
Or in method chain form:或以方法链形式:
df_split = (df_selected
.assign(list_col=lambda df: df['name'].str.split(',', 1, expand=False),
first_name=lambda df: df.list_col.str[0],
last_name=lambda df: df.list_col.str[1])
.drop(columns=['list_col']))
I prefer exporting the corresponding pandas series (ie the columns I need), using the apply function to split the column content into multiple series and then join the generated columns to the existing DataFrame.我更喜欢导出对应的pandas系列(即我需要的列),使用apply函数将列内容拆分成多个系列,然后将生成的列加入到已有的DataFrame中。 Of course, the source column should be removed.
当然,应该删除源列。
eg例如
col1 = df["<col_name>"].apply(<function>)
col2 = ...
df = df.join(col1.to_frame(name="<name1>"))
df = df.join(col2.toframe(name="<name2>"))
df = df.drop(["<col_name>"], axis=1)
To split two words strings function should be something like that:拆分两个单词字符串函数应该是这样的:
lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element
I saw that no one had used the slice method, so here I put my 2 cents here.我看到没有人使用切片方法,所以我把我的 2 美分放在这里。
df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)
This method will create two new columns.此方法将创建两个新列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.