I am somewhat of a beginner programmer and am looking for help and an explanation of a problem. I am looking to extract the ID numbers from a string into new column, then fill in missing numbers.
I am working with pandas dataframe and I have the following set of street names, some with an ID number and others missing:
*Start station*:
"19th & L St (31224)"
"14th & R St NW (31202)"
"Paul Rd & Pl NW (31602)"
"14th & R St NW"
"19th & L St"
"Paul Rd & Pl NW"
My desired outcome:
*Start station* *StartStatNum*
"14th & R St NW" 31202
"19th & L St" 31224
"Paul Rd & Pl NW" 31602
"14th & R St NW" 31202
"19th & L St" 31224
"Paul Rd & Pl NW" 31602
I am having difficulty after my first step of splitting. I can split based on position with the following:
def Stat_Num(Stat_Num):
return Stat_Num.split('(')[-1].split(')')[0].strip()
db["StartStatNum"] = pd.DataFrame({'Num':db['Start station'].apply(Stat_Num)})
But this gives:
*Start station* *StartStatNum*
"19th & L St (31224)" 31202
"14th & R St NW (31202)" 31224
"Paul Rd & Pl NW (31602)" 31602
"14th & R St NW" "14th & R St NW"
"19th & L St" "19th & L St"
"Paul Rd & Pl NW" "Paul Rd & Pl NW"
The problem would then arise when I want to find/fill StartStatNum with the station ID numbers that I don't have.
I have been trying to get to know str.extract, str.contains, re.findall
and tried the following as a possible stepping stone:
db['Start_S2'] = db['Start_Stat_Num'].str.extract(" ((\d+))")
db['Start_S2'] = db['Start station'].str.contains(" ((\d+))")
db['Start_S2'] = db['Start station'].re.findall(" ((\d+))")
I have also tried the following this from here
def parseIntegers(mixedList):
return [x for x in db['Start station'] if (isinstance(x, int) or isinstance(x, long)) and not isinstance(x, bool)]
However when I pass values in, I get a list 'x' with 1 value. As a bit of a noob, I don't think going the pattern route is best as it will also take in unwanted integers (although I could possibly turn to Nan's as they would be less than 30000 (the lowest value for ID number) I also have an idea that it could be something simple that I'm overlooking, but after about 20 straight hours and a lot of searching, I am at a bit of a loss.
Any help would be extremely helpful.
A solution could be to create a dataframe with the transformation
station -> id
like
l = ["19th & L St (31224)",
"14th & R St NW (31202)",
"Paul Rd & Pl NW (31602)",
"14th & R St NW",
"19th & L St",
"Paul Rd & Pl NW",]
df = pd.DataFrame( {"station":l})
df_dict = df['station'].str.extract("(?P<station_name>.*)\((?P<id>\d+)\)").dropna()
print df_dict
# result:
station_name id
0 19th & L St 31224
1 14th & R St NW 31202
2 Paul Rd & Pl NW 31602
[3 rows x 2 columns]
Starting from there, you can use some list comprehension:
l2 = [ [row["station_name"], row["id"]]
for line in l
for k,row in df_dict.iterrows()
if row["station_name"].strip() in line]
to get:
[['19th & L St ', '31224'],
['14th & R St NW ', '31202'],
['Paul Rd & Pl NW ', '31602'],
['14th & R St NW ', '31202'],
['19th & L St ', '31224'],
['Paul Rd & Pl NW ', '31602']]
I let you transform the later in dataframe...
There might be nicer solutions for the last part at least...
Here's a way that worked for me, firstly extract the numbers in the braces:
In [71]:
df['start stat num'] = df['Start station'].str.findall(r'\((\d+)\)').str[0]
df
Out[71]:
Start station start stat num
0 19th & L St (31224) 31224
1 14th & R St NW (31202) 31202
2 Paul Rd & Pl NW (31602) 31602
3 14th & R St NW NaN
4 19th & L St NaN
5 Paul Rd & Pl NW NaN
Now remove the number as we don't need it anymore:
In [72]:
df['Start station'] = df['Start station'].str.split(' \(').str[0]
df
Out[72]:
Start station start stat num
0 19th & L St 31224
1 14th & R St NW 31202
2 Paul Rd & Pl NW 31602
3 14th & R St NW NaN
4 19th & L St NaN
5 Paul Rd & Pl NW NaN
Now we can fill in the missing station number by calling map on the df with the NaN
rows removed, and the station name set as the index, this will lookup the station name and return the station number:
In [73]:
df['start stat num'] = df['Start station'].map(df.dropna().set_index('Start station')['start stat num'])
df
Out[73]:
Start station start stat num
0 19th & L St 31224
1 14th & R St NW 31202
2 Paul Rd & Pl NW 31602
3 14th & R St NW 31202
4 19th & L St 31224
5 Paul Rd & Pl NW 31602
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.