简体   繁体   English

从不同长度的字符串值中提取某些整数,其中包含不需要的整数。 模式或位置

[英]Extract certain integers from string value, of different length, which contains unwanted integers. Pattern or Position

I am somewhat of a beginner programmer and am looking for help and an explanation of a problem. 我有些是初学者程序员,正在寻求帮助和问题的解释。 I am looking to extract the ID numbers from a string into new column, then fill in missing numbers. 我正在寻找将ID编号从字符串中提取到新列中,然后填写缺失的编号。

I am working with pandas dataframe and I have the following set of street names, some with an ID number and others missing: 我正在使用pandas数据框,并且具有以下街道名称集,其中一些具有ID号,而另一些则缺失:

*Start station*:
"19th & L St (31224)"
"14th & R St NW (31202)"
"Paul Rd & Pl NW (31602)"
"14th & R St NW"
"19th & L St"
"Paul Rd & Pl NW"

My desired outcome:
*Start station*         *StartStatNum*
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602

I am having difficulty after my first step of splitting. 拆分的第一步后,我遇到了困难。 I can split based on position with the following: 我可以根据位置使用以下内容进行拆分:

def Stat_Num(Stat_Num):
    return Stat_Num.split('(')[-1].split(')')[0].strip()

db["StartStatNum"] = pd.DataFrame({'Num':db['Start station'].apply(Stat_Num)})

But this gives:
*Start station*         *StartStatNum*
"19th & L St (31224)"        31202
"14th & R St NW (31202)"     31224
"Paul Rd & Pl NW (31602)"    31602
"14th & R St NW"            "14th & R St NW"
"19th & L St"               "19th & L St"
"Paul Rd & Pl NW"           "Paul Rd & Pl NW"

The problem would then arise when I want to find/fill StartStatNum with the station ID numbers that I don't have. 然后,当我想用​​我没有的工作站ID号查找/填充StartStatNum时,就会出现问题。

I have been trying to get to know str.extract, str.contains, re.findall and tried the following as a possible stepping stone: 我一直在尝试了解str.extract, str.contains, re.findall并尝试了以下方法作为可能的垫脚石:

db['Start_S2']  = db['Start_Stat_Num'].str.extract(" ((\d+))")
db['Start_S2']  = db['Start station'].str.contains(" ((\d+))")
db['Start_S2']  = db['Start station'].re.findall(" ((\d+))")

I have also tried the following this from here 我也从这里尝试了以下

def parseIntegers(mixedList):
return [x for x in db['Start station'] if (isinstance(x, int) or isinstance(x, long)) and not isinstance(x, bool)]

However when I pass values in, I get a list 'x' with 1 value. 但是,当我传入值时,会得到带有1个值的列表'x'。 As a bit of a noob, I don't think going the pattern route is best as it will also take in unwanted integers (although I could possibly turn to Nan's as they would be less than 30000 (the lowest value for ID number) I also have an idea that it could be something simple that I'm overlooking, but after about 20 straight hours and a lot of searching, I am at a bit of a loss. 有点菜鸟,我认为走模式路线不是最好的,因为它也会吸收不需要的整数(尽管我可能会求助于Nan,因为它们会小于30000(ID号的最小值))我也有一个想法,那就是我可能忽略了一些简单的事情,但是经过连续20个小时的搜索和大量搜索之后,我有些茫然。

Any help would be extremely helpful. 任何帮助都将非常有帮助。

A solution could be to create a dataframe with the transformation 解决方案可能是通过转换创建数据框

station -> id 

like 喜欢

l = ["19th & L St (31224)",
    "14th & R St NW (31202)",
    "Paul Rd & Pl NW (31602)",
    "14th & R St NW",
    "19th & L St",
    "Paul Rd & Pl NW",]

df = pd.DataFrame( {"station":l})
df_dict = df['station'].str.extract("(?P<station_name>.*)\((?P<id>\d+)\)").dropna()
print df_dict

 # result:
       station_name     id
 0      19th & L St   31224
 1   14th & R St NW   31202
 2  Paul Rd & Pl NW   31602
 [3 rows x 2 columns]

Starting from there, you can use some list comprehension: 从那里开始,您可以使用一些列表理解:

l2 = [ [row["station_name"], row["id"]]
       for line in l
       for k,row in df_dict.iterrows()
       if row["station_name"].strip() in line]

to get: 要得到:

 [['19th & L St ', '31224'], 
  ['14th & R St NW ', '31202'], 
  ['Paul Rd & Pl NW ', '31602'], 
  ['14th & R St NW ', '31202'], 
  ['19th & L St ', '31224'], 
  ['Paul Rd & Pl NW ', '31602']]

I let you transform the later in dataframe... 我让你在数据框中转换后面的内容...

There might be nicer solutions for the last part at least... 至少在最后一部分可能会有更好的解决方案...

Here's a way that worked for me, firstly extract the numbers in the braces: 这是一种对我有用的方法,首先提取括号中的数字:

In [71]:

df['start stat num'] = df['Start station'].str.findall(r'\((\d+)\)').str[0]
df
Out[71]:
             Start station start stat num
0      19th & L St (31224)          31224
1   14th & R St NW (31202)          31202
2  Paul Rd & Pl NW (31602)          31602
3           14th & R St NW            NaN
4              19th & L St            NaN
5          Paul Rd & Pl NW            NaN

Now remove the number as we don't need it anymore: 现在删除号码,因为我们不再需要它了:

In [72]:

df['Start station'] = df['Start station'].str.split(' \(').str[0]
df
Out[72]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW            NaN
4      19th & L St            NaN
5  Paul Rd & Pl NW            NaN

Now we can fill in the missing station number by calling map on the df with the NaN rows removed, and the station name set as the index, this will lookup the station name and return the station number: 现在,我们可以通过在df上调用map并删除NaN行,然后将站名设置为索引来填充缺少的站号,这将查找站名并返回站号:

In [73]:

df['start stat num'] = df['Start station'].map(df.dropna().set_index('Start station')['start stat num'])
df
Out[73]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW          31202
4      19th & L St          31224
5  Paul Rd & Pl NW          31602

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM