I have a dataframe like this:
name link
apple example1.com/dsa/es?id=2812168&width=1200/web/map&resize.html
banana. example2.com/es?id=28132908&width=1220/web/map_resize.html
orange. example3.com/es?id=3209908&width=1120/web&map_resize.html
Each name's ID is buried in the link, which may have different structure. However, I know that the pattern is 'id=' + 'what I want' + '&'
I wonder, is there a way to extract the id
from link
and put it back to the dataframe to get the following:
name link
apple 2812168
banana. 28132908
orange. 3209908
I try to use this:
df['name'] = df['name'].str.extract(r'id=\s*([^\.]*)\s*\\&', expand=False)
but it returns a column with all nan
Also, there may be more than one & in the link
We can make use of positive lookbehind
and positive lookahead
:
df['link'] = df['link'].str.extract('(?<=id\=)(.*?)(?=\&)')
name link
0 apple 2812168
1 banana. 28132908
2 orange. 3209908
Details :
(?<=id\=)
: positive lookbehind on id=
(.*)
: everything (?=\&width)
: positive lookahead on &width
I think Ids are always numbers, so this is somewhat cleaner:
df["link"] = df['link'].str.extract(r'id=(\d+)&', expand=False)
print(df)
# name link
#0 apple 2812168
#1 banana 28132908
#2 orange 3209908
Let tri split
df['link'].str.split('id=').str[1].str.split('&').str[0]
0 2812168
1 28132908
2 3209908
Name: link, dtype: object
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.