使用 Pandas 從文本中提取特定單詞

Question

在我的數據框中，有幾個國家的名稱中帶有數字和/或括號。 我想從這些國家/地區名稱中刪除括號和數字。

例如：'Bolivia (Plurinational State of)' 應該是'Bolivia'，'Switzerland17' 應該是'Switzerland'。

這是我的代碼，但似乎不起作用：

import numpy as np 
import pandas as pd 


def func():
    energy=pd.ExcelFile('Energy Indicators.xls').parse('Energy')
    energy=energy.iloc[16:243][['Environmental Indicators: Energy','Unnamed: 3','Unnamed: 4','Unnamed: 5']].copy()
    energy.columns=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
    o="..."
    n=np.NaN
    energy = energy.replace('...', np.nan)




    energy['Energy Supply']=energy['Energy Supply']*1000000

    old=["Republic of Korea","United States of America","United Kingdom of " 
                                +"Great Britain and Northern Ireland","China, Hong "
                                +"Kong Special Administrative Region"]
    new=["South Korea","United States","United Kingdom","Hong Kong"]
    for i in range(0,4):

        energy = energy.replace(old[i], new[i])

    #I'm trying to remove it here =====> 

    p="("

    for j in range(16,243):
        if p in energy.iloc[j]['Country']:
            country=""
            for c in energy.iloc[j]['Country'] : 

                while(c!=p & !c.isnumeric()):
                    country=c+country
            energy = energy.replace(energy.iloc[j]['Country'], country)


    return energy

這是我正在處理的 .xls 文件： https : //drive.google.com/file/d/0B80lepon1RrYeDRNQVFWYVVENHM/view?usp=sharing

Answer 1

使用str.extract ：

energy['country'] = energy['country'].str.extract('(^[a-zA-Z]+)', expand=False)

df

                            country
0  Bolivia (Plurinational State of)
1                     Switzerland17

df['country'] = df['country'].str.extract('(^[a-zA-Z]+)', expand=False)
df

       country
0      Bolivia
1  Switzerland

要處理名稱中帶有空格的國家（非常常見），對正則表達式進行小幅改進就足夠了。

df

                            country
0  Bolivia (Plurinational State of)
1                     Switzerland17
2             West Indies (foo bar)

df['country'] = df['country'].str.extract('(^[a-zA-Z\s]+)', expand=False).str.strip()
df

       country
0      Bolivia
1  Switzerland
2  West Indies

使用 Pandas 從文本中提取特定單詞

問題描述

1 個解決方案

解決方案1
2 已采納 2017-10-22 01:14:55

使用 Pandas 從文本中提取特定單詞

問題描述

1 個解決方案

解決方案1 2 已采納 2017-10-22 01:14:55

解決方案1
2 已采納 2017-10-22 01:14:55