简体   繁体   中英

Extract certain integers from string value, of different length, which contains unwanted integers. Pattern or Position

I am somewhat of a beginner programmer and am looking for help and an explanation of a problem. I am looking to extract the ID numbers from a string into new column, then fill in missing numbers.

I am working with pandas dataframe and I have the following set of street names, some with an ID number and others missing:

*Start station*:
"19th & L St (31224)"
"14th & R St NW (31202)"
"Paul Rd & Pl NW (31602)"
"14th & R St NW"
"19th & L St"
"Paul Rd & Pl NW"

My desired outcome:
*Start station*         *StartStatNum*
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602

I am having difficulty after my first step of splitting. I can split based on position with the following:

def Stat_Num(Stat_Num):
    return Stat_Num.split('(')[-1].split(')')[0].strip()

db["StartStatNum"] = pd.DataFrame({'Num':db['Start station'].apply(Stat_Num)})

But this gives:
*Start station*         *StartStatNum*
"19th & L St (31224)"        31202
"14th & R St NW (31202)"     31224
"Paul Rd & Pl NW (31602)"    31602
"14th & R St NW"            "14th & R St NW"
"19th & L St"               "19th & L St"
"Paul Rd & Pl NW"           "Paul Rd & Pl NW"

The problem would then arise when I want to find/fill StartStatNum with the station ID numbers that I don't have.

I have been trying to get to know str.extract, str.contains, re.findall and tried the following as a possible stepping stone:

db['Start_S2']  = db['Start_Stat_Num'].str.extract(" ((\d+))")
db['Start_S2']  = db['Start station'].str.contains(" ((\d+))")
db['Start_S2']  = db['Start station'].re.findall(" ((\d+))")

I have also tried the following this from here

def parseIntegers(mixedList):
return [x for x in db['Start station'] if (isinstance(x, int) or isinstance(x, long)) and not isinstance(x, bool)]

However when I pass values in, I get a list 'x' with 1 value. As a bit of a noob, I don't think going the pattern route is best as it will also take in unwanted integers (although I could possibly turn to Nan's as they would be less than 30000 (the lowest value for ID number) I also have an idea that it could be something simple that I'm overlooking, but after about 20 straight hours and a lot of searching, I am at a bit of a loss.

Any help would be extremely helpful.

A solution could be to create a dataframe with the transformation

station -> id 

like

l = ["19th & L St (31224)",
    "14th & R St NW (31202)",
    "Paul Rd & Pl NW (31602)",
    "14th & R St NW",
    "19th & L St",
    "Paul Rd & Pl NW",]

df = pd.DataFrame( {"station":l})
df_dict = df['station'].str.extract("(?P<station_name>.*)\((?P<id>\d+)\)").dropna()
print df_dict

 # result:
       station_name     id
 0      19th & L St   31224
 1   14th & R St NW   31202
 2  Paul Rd & Pl NW   31602
 [3 rows x 2 columns]

Starting from there, you can use some list comprehension:

l2 = [ [row["station_name"], row["id"]]
       for line in l
       for k,row in df_dict.iterrows()
       if row["station_name"].strip() in line]

to get:

 [['19th & L St ', '31224'], 
  ['14th & R St NW ', '31202'], 
  ['Paul Rd & Pl NW ', '31602'], 
  ['14th & R St NW ', '31202'], 
  ['19th & L St ', '31224'], 
  ['Paul Rd & Pl NW ', '31602']]

I let you transform the later in dataframe...

There might be nicer solutions for the last part at least...

Here's a way that worked for me, firstly extract the numbers in the braces:

In [71]:

df['start stat num'] = df['Start station'].str.findall(r'\((\d+)\)').str[0]
df
Out[71]:
             Start station start stat num
0      19th & L St (31224)          31224
1   14th & R St NW (31202)          31202
2  Paul Rd & Pl NW (31602)          31602
3           14th & R St NW            NaN
4              19th & L St            NaN
5          Paul Rd & Pl NW            NaN

Now remove the number as we don't need it anymore:

In [72]:

df['Start station'] = df['Start station'].str.split(' \(').str[0]
df
Out[72]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW            NaN
4      19th & L St            NaN
5  Paul Rd & Pl NW            NaN

Now we can fill in the missing station number by calling map on the df with the NaN rows removed, and the station name set as the index, this will lookup the station name and return the station number:

In [73]:

df['start stat num'] = df['Start station'].map(df.dropna().set_index('Start station')['start stat num'])
df
Out[73]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW          31202
4      19th & L St          31224
5  Paul Rd & Pl NW          31602

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM