简体   繁体   English

按照特定模式从列中提取字符串

[英]Extract string from column following a specific pattern

Please forgive my panda newbie question, but I have a column of US towns and states, such as the truncated version shown below (For some strange reason, the name of the column is called 'Alabama[edit]' which is associated with the first 0-7 town values in the column):请原谅我的熊猫新手问题,但我有一个美国城镇和州的专栏,例如下面显示的截断版本(出于某种奇怪的原因,该专栏的名称称为'Alabama[edit]',它与第一个相关联列中的 0-7 个城镇值):

0                          Auburn (Auburn University)[1]
1                 Florence (University of North Alabama)
2        Jacksonville (Jacksonville State University)[2]
3             Livingston (University of West Alabama)[2]
4               Montevallo (University of Montevallo)[2]
5                              Troy (Troy University)[2]
6      Tuscaloosa (University of Alabama, Stillman Co...
7                      Tuskegee (Tuskegee University)[5]
8                                           Alaska[edit]
9          Fairbanks (University of Alaska Fairbanks)[2]
10                                         Arizona[edit]
11            Flagstaff (Northern Arizona University)[6]
12                      Tempe (Arizona State University)
13                        Tucson (University of Arizona)
14                                        Arkansas[edit]
15     Arkadelphia (Henderson State University, Ouach...
16     Conway (Central Baptist College, Hendrix Colle...
17              Fayetteville (University of Arkansas)[7]
18              Jonesboro (Arkansas State University)[8]
19            Magnolia (Southern Arkansas University)[2]
20     Monticello (University of Arkansas at Monticel...
21            Russellville (Arkansas Tech University)[2]
22                        Searcy (Harding University)[5]
23                                      California[edit]

The towns that are in each state are below each state name, eg Fairbanks (column value 9) is a town in the state of Alaska.每个州的城镇都在每个州名的下方,例如费尔班克斯(列值 9)是阿拉斯加州的一个城镇。

What I want to do is to split up the town names based on the state names so that I have two columns 'State' and 'RegionName' where each state name is associated with each town name, like so:我想要做的是根据州名拆分城镇名称,以便我有两列“州”和“地区名”,其中每个州名都与每个城镇名相关联,如下所示:

                            RegionName                       State
0                          Auburn (Auburn University)[1]    Alabama
1                 Florence (University of North Alabama)    Alabama
2        Jacksonville (Jacksonville State University)[2]    Alabama
3             Livingston (University of West Alabama)[2]    Alabama
4               Montevallo (University of Montevallo)[2]    Alabama
5                              Troy (Troy University)[2]    Alabama
6      Tuscaloosa (University of Alabama, Stillman Co...    Alabama
7                      Tuskegee (Tuskegee University)[5]    Alabama

8         Fairbanks (University of Alaska Fairbanks)[2]     Alaska

9         Flagstaff (Northern Arizona University)[6]        Arizona
10                      Tempe (Arizona State University)    Arizona
11                        Tucson (University of Arizona)    Arizona                                              

12        Arkadelphia (Henderson State University, Ouach... Arkansas                                           

. . . . .etc. 。等等。

I know that each state name is followed by a string '[edit]', which I assume I can use to do the split and assignment of the town names.我知道每个州名后跟一个字符串“[edit]”,我假设我可以用它来进行城镇名称的拆分和分配。 But I don't know how to do this.但我不知道如何做到这一点。

Also, I know that there's a lot of other data cleaning I need to do, such as removing the strings within parentheses and within the brackets '[]'.另外,我知道我需要做很多其他数据清理,例如删除括号内和方括号“[]”内的字符串。 That can be done later...the important part is splitting up the states and towns and assigning each town to its proper US Any advice would be most appreciated.这可以稍后完成...重要的部分是拆分州和城镇并将每个城镇分配给其适当的美国任何建议将不胜感激。

Without much context or access to your data, I'd suggest something along these lines.在没有太多上下文或访问您的数据的情况下,我会建议一些类似的内容。 First, modify the code that reads your data:首先,修改读取数据的代码:

df = pd.read_csv(..., header=None, names=['RegionName']) 
# add header=False so as to read the first row as data

Now, extract the state name using str.extract , this should only extract names as long as they are succeeded by the substring "[edit]".现在,使用str.extract提取状态名称,这应该只提取名称,只要它们以子字符串“[edit]”为后继。 You can then forward fill all NaN values using ffill .然后,您可以使用ffill向前填充所有 NaN 值。

df['State'] = df['RegionName'].str.extract(
    r'(?P<State>.*)(?=\s*\[edit\])'
).ffill()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM