從不同長度的字符串值中提取某些整數，其中包含不需要的整數。模式或位置

Question

我有些是初學者程序員，正在尋求幫助和問題的解釋。 我正在尋找將ID編號從字符串中提取到新列中，然后填寫缺失的編號。

我正在使用pandas數據框，並且具有以下街道名稱集，其中一些具有ID號，而另一些則缺失：

*Start station*:
"19th & L St (31224)"
"14th & R St NW (31202)"
"Paul Rd & Pl NW (31602)"
"14th & R St NW"
"19th & L St"
"Paul Rd & Pl NW"

My desired outcome:
*Start station*         *StartStatNum*
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602

拆分的第一步后，我遇到了困難。 我可以根據位置使用以下內容進行拆分：

def Stat_Num(Stat_Num):
    return Stat_Num.split('(')[-1].split(')')[0].strip()

db["StartStatNum"] = pd.DataFrame({'Num':db['Start station'].apply(Stat_Num)})

But this gives:
*Start station*         *StartStatNum*
"19th & L St (31224)"        31202
"14th & R St NW (31202)"     31224
"Paul Rd & Pl NW (31602)"    31602
"14th & R St NW"            "14th & R St NW"
"19th & L St"               "19th & L St"
"Paul Rd & Pl NW"           "Paul Rd & Pl NW"

然后，當我想用我沒有的工作站ID號查找/填充StartStatNum時，就會出現問題。

我一直在嘗試了解str.extract, str.contains, re.findall並嘗試了以下方法作為可能的墊腳石：

db['Start_S2']  = db['Start_Stat_Num'].str.extract(" ((\d+))")
db['Start_S2']  = db['Start station'].str.contains(" ((\d+))")
db['Start_S2']  = db['Start station'].re.findall(" ((\d+))")

我也從這里嘗試了以下

def parseIntegers(mixedList):
return [x for x in db['Start station'] if (isinstance(x, int) or isinstance(x, long)) and not isinstance(x, bool)]

但是，當我傳入值時，會得到帶有1個值的列表'x'。 有點菜鳥，我認為走模式路線不是最好的，因為它也會吸收不需要的整數（盡管我可能會求助於Nan，因為它們會小於30000（ID號的最小值））我也有一個想法，那就是我可能忽略了一些簡單的事情，但是經過連續20個小時的搜索和大量搜索之后，我有些茫然。

任何幫助都將非常有幫助。

Answer 1

解決方案可能是通過轉換創建數據框

station -> id

喜歡

l = ["19th & L St (31224)",
    "14th & R St NW (31202)",
    "Paul Rd & Pl NW (31602)",
    "14th & R St NW",
    "19th & L St",
    "Paul Rd & Pl NW",]

df = pd.DataFrame( {"station":l})
df_dict = df['station'].str.extract("(?P<station_name>.*)\((?P<id>\d+)\)").dropna()
print df_dict

 # result:
       station_name     id
 0      19th & L St   31224
 1   14th & R St NW   31202
 2  Paul Rd & Pl NW   31602
 [3 rows x 2 columns]

從那里開始，您可以使用一些列表理解：

l2 = [ [row["station_name"], row["id"]]
       for line in l
       for k,row in df_dict.iterrows()
       if row["station_name"].strip() in line]

要得到：

 [['19th & L St ', '31224'], 
  ['14th & R St NW ', '31202'], 
  ['Paul Rd & Pl NW ', '31602'], 
  ['14th & R St NW ', '31202'], 
  ['19th & L St ', '31224'], 
  ['Paul Rd & Pl NW ', '31602']]

我讓你在數據框中轉換后面的內容...

至少在最后一部分可能會有更好的解決方案...

Answer 2

這是一種對我有用的方法，首先提取括號中的數字：

In [71]:

df['start stat num'] = df['Start station'].str.findall(r'\((\d+)\)').str[0]
df
Out[71]:
             Start station start stat num
0      19th & L St (31224)          31224
1   14th & R St NW (31202)          31202
2  Paul Rd & Pl NW (31602)          31602
3           14th & R St NW            NaN
4              19th & L St            NaN
5          Paul Rd & Pl NW            NaN

現在刪除號碼，因為我們不再需要它了：

In [72]:

df['Start station'] = df['Start station'].str.split(' \(').str[0]
df
Out[72]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW            NaN
4      19th & L St            NaN
5  Paul Rd & Pl NW            NaN

現在，我們可以通過在df上調用map並刪除NaN行，然后將站名設置為索引來填充缺少的站號，這將查找站名並返回站號：

In [73]:

df['start stat num'] = df['Start station'].map(df.dropna().set_index('Start station')['start stat num'])
df
Out[73]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW          31202
4      19th & L St          31224
5  Paul Rd & Pl NW          31602

從不同長度的字符串值中提取某些整數，其中包含不需要的整數。模式或位置

問題描述

2 個解決方案

解決方案1
1 2015-05-14 17:57:25

解決方案2
1 已采納 2015-05-14 18:17:47

從不同長度的字符串值中提取某些整數，其中包含不需要的整數。 模式或位置

問題描述

2 個解決方案

解決方案1 1 2015-05-14 17:57:25

解決方案2 1 已采納 2015-05-14 18:17:47

從不同長度的字符串值中提取某些整數，其中包含不需要的整數。模式或位置

解決方案1
1 2015-05-14 17:57:25

解決方案2
1 已采納 2015-05-14 18:17:47