简体   繁体   中英

How do I remove numbers and/or parenthesis from names in a DataFrame column

Within my column I have several country names that contain numbers and/or parenthesis in their name that I need to remove.

For example:

  • 'Bolivia (Plurinational State of)' should be 'Bolivia'
  • 'Switzerland17' should be 'Switzerland'

The column in question is also set as my index if that impacts things?

try this:

In [121]: df
Out[121]:
                                     expected
Bolivia (Plurinational State of)      Bolivia
Switzerland17                     Switzerland

In [122]: df.set_index(df.index.str.replace('\s*\(.*?\)\s*', '').str.replace('\d+',''), inplace=True)

In [123]: df
Out[123]:
                expected
Bolivia          Bolivia
Switzerland  Switzerland

In [124]: df.index == df.expected
Out[124]: array([ True,  True], dtype=bool)

In [125]: (df.index == df.expected).all()
Out[125]: True
def remove(data):
    for i in range(len(data)):
      if data[i].isdigit():
        return data[:i]
      elif (data[i]=='('):
        return data[:i-1]
    return data

df['Country'] = df['Country'].apply(remove)
    def remove_digit(data):
        newData = ''.join([i for i in data if not i.isdigit()])
        i = newData.find('(')
        if i>-1: newData = newData[:i]
        return newData.strip()
    energy['Country'] = energy['Country'].apply(remove_digit)

One way to accomplish it without calling the index.

import re    
df.apply(lambda x : re.sub('\s*\(.*?\)\s*|\d+', '', x))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM