简体   繁体   English

将一列中的文本分为三列

[英]Split text in a column into three columns

This question is a follow up to Pietro's fantastic answer on how to split a column into multiple columns. 这个问题是Pietro关于如何将一列拆分为多列的奇妙答案的后续措施。 My goal is to take a column from an existing data frame, split it on a space, and then take the first three/four split values and place each in a particular column, ignoring the remainder. 我的目标是从现有数据框中获取一列,将其拆分到一个空间上,然后获取前三个/四个拆分值,并将每个值放置在特定的列中,而忽略其余部分。

The issue with this split is that the number of whitespace varies between rows. 此拆分的问题在于,行之间的空格数有所不同。 Sometimes the data appears like "Fort Lee NJ 07024." 有时数据显示为“ Fort Lee NJ 07024”。 Other times, it appears like "NY NY 10000." 在其他时间,它看起来像“ NY NY 10000”。 I'm not sure if there's an easy fix. 我不确定是否有简单的解决方法。

df['City, State, Zip'].str.split()
# Returns a variable length row. 
# I need to take the first three or four values, and add them to columns: City/State/Zip

Assuming that state and zip are always present and contain valid data, one method to solve this problem is to first split your string. 假设状态和邮政编码始终存在并且包含有效数据,则解决此问题的一种方法是首先拆分字符串。 The state and zip are simply the second to last and last columns, respectively. state和zip分别只是倒数第二和最后一列。 I've used a list comprehension to extract them from city_state_zip . 我已经使用列表city_state_zipcity_state_zip提取它们。 To extract the city, I've used a nested list comprehension together with join . 为了提取城市,我使用了嵌套列表推导和join The last two elements are the state and zip, so the length of the list minus two tells you how many elements are contained in the city name. 最后两个元素是州和邮政编码,因此列表的长度减去两个就可以告诉您城市名称中包含多少个元素。 You then just need to join them with a space. 然后,您只需要将它们加入一个空格即可。

df = pd.DataFrame({'city_state_zip': ['Fort Lee NJ 07024', 
                                      'NY NY 10000', 
                                      'Carmel by the Sea CA 93922']})

city_state_zip = df.city_state_zip.apply(lambda x: x.split())
df['city'] = [" ".join([x[c] for c in range(len(x) - 2)]) for x in city_state_zip]
df['state'] = [x[-2] for x in city_state_zip]
df['zip'] = [x[-1] for x in city_state_zip]
>>> df
               city_state_zip               city state    zip
0           Fort Lee NJ 07024           Fort Lee    NJ  07024
1                 NY NY 10000                 NY    NY  10000
2  Carmel by the Sea CA 93922  Carmel by the Sea    CA  93922

EDIT: As suggested by DSM, it looks like the last two words are the state an zip code, in which case you can do 编辑:根据DSM的建议,看起来最后两个字是邮政编码的状态,在这种情况下,您可以

df = pd.DataFrame({'city_state_zip': ['Fort Lee NJ 07024', 
                                      'NY NY 10000', 
                                      'Carmel by the Sea CA 93922']})

In [50]: regex = '(?P<City>[a-zA-z ]*) (?P<State>[A-Z]{2}) (?P<Zip>[\d-]*)'
         df.city_state_zip.str.extract(regex)
Out[50]:
    City             State  Zip
0   Fort Lee            NJ  07024
1   NY                  NY  10000
2   Carmel by the Sea   CA  93922

This method uses extraction by regex using multiple named groups, one each for City, State and Zip. 此方法使用正则表达式使用多个命名组进行提取,每个命名组分别对应于City,State和Zip。 The result of the extract method is a dataframe with 3 columns as shown. 如图所示,提取方法的结果是一个具有3列的数据框。 The syntax for groups is to surround the regex for each group by a bracket. 组的语法是用括号将每个组的正则表达式括起来。 For naming a group insert ?P<group name> in the brackets before the group regex. 要命名组,请在组正则表达式前的括号中插入?P<group name> This solution assumes city names contain only upper and lower case letters and spaces and stats abbrev. 此解决方案假定城市名称仅包含大小写字母,空格和统计缩写。 contain exactly 2 capital letters but you can adjust it if this isn't the case. 恰好包含2个大写字母,但如果不是这种情况,则可以对其进行调整。 Note that the spaces between the groups in the regex are important here as they represent the spaces between the city, state and zip. 请注意,这里的正则表达式中各组之间的空格很重要,因为它们代表城市,州和邮编之间的空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM