[英]How to extract URL from a redirect URL using regex in Python?
I have the following test_string
from which I need to obtain the actual URL.我有以下
test_string
,我需要从中获取实际的 URL。
Test string (partly shown):测试字符串(部分显示):
An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Desired output for part of test_string部分 test_string 所需的 output
https://www.sciencedirect.com/science/article/pii/S0010218020301346
I have been trying to obtain this with the MWE given below applied to many strings, but it gives only one URL.我一直在尝试将下面给出的 MWE 应用于许多字符串,但它只给出了一个 URL。
MWE MWE
from urlparse import urlparse, parse_qs
import re
from re import search
test_string = '''
Production, Properties, and Applications of ALPHA-Terpineol
<http://scholar.google.com/scholar_url?url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&hl=en&sa=X&d=12771069332921982368&scisig=AAGBfm1tFjLUm7GV1DRnuYCzvR4uGWq9Cg&nossl=1&oi=scholaralrt&hist=v2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
A Sales, L de Oliveira Felipe, JL Bicas
Abstract ALPHA-Terpineol (CAS No. 98-55-5) is a tertiary monoterpenoid
alcohol widely
and commonly used in the flavors and fragrances industry for its sensory
properties.
It is present in different natural sources, but its production is mostly
based on ...
Save
<http://scholar.google.com/citations?update_op=email_library_add&info=oB2z7uTzO7EJ&citsig=AMD79ooAAAAAYLfmix3sQyUWnFrHeKYZxuK31qlqlbCh&hl=en>
Twitter
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=tw&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g>
Facebook
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=fb&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g>
An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Butanol/diesel blend is considered as a very promising alternative fuel
with
agreeable combustion and emission performance in engines. This paper
intends to
further investigate its autoignition characteristics with the combination
of a heated =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3DE=
27Gd756Qj4J&citsig=3DAMD79ooAAAAAYImDxwWCwd5S5xIogWp9RTavFRMtTDgS&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
Using NMR spectroscopy to investigate the role played by copper in prion
diseases.
<http://scholar.google.com/scholar_url?url=3Dhttps://europepmc.org/article/=
med/32328835&hl=3Den&sa=3DX&d=3D16122276072657817806&scisig=3DAAGBfm1AE6Kyl=
jWO1k0f7oBnKFClEzhTMg&nossl=3D1&oi=3Dscholaralrt&hist=3Dv2Y_3P0AAAAJ:179499=
55323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
RA Alsiary, M Alghrably, A Saoudi, S Al-Ghamdi=E2=80=A6 - =E2=80=A6 and of =
the Italian
Society of =E2=80=A6, 2020
Prion diseases are a group of rare neurodegenerative disorders that develop
as a
result of the conformational conversion of normal prion protein (PrPC) to
the disease-
associated isoform (PrPSc). The mechanism that actually causes disease
remains =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3Dz=
pCMKavUvd8J&citsig=3DAMD79ooAAAAAYImDx3r4gltEWBAkhl0g2POsXB9Qn4Lk&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
'''
regex = re.compile('(http://scholar.*?)&')
url_all = regex.findall(test_string)
citation_url = []
for i in url_all:
if search('scholar.google.com',i):
qs = parse_qs(urlparse(i).query).values()
if search('http',str(qs[0])):
citation_url.append(qs[0])
print citation_url
Present output目前output
https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
Desired output所需 output
https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
https://www.sciencedirect.com/science/article/pii/S0010218020301346
https://europepmc.org/article/med/3232883
How to get handle URL text wrapping with equal sign and extracting the redirect URL in Python?如何处理 URL 文本换行与等号并在 Python 中提取重定向 URL?
You could match either a question mark or ampersand [&?]
using a character class.您可以使用字符 class 匹配问号或与号
[&?]
。 Looking at the example data, for the url=
part, you can add optional newlines and an optional equals sign and adjust accordingly.查看示例数据,对于
url=
部分,您可以添加可选的换行符和可选的等号并进行相应调整。
Some urls start with 3D, you can make that part optional using a non capturing group (?:3D)?
一些 url 以 3D 开头,您可以使用非捕获组
(?:3D)?
Then capture in group 1 matching http followed by matching all chars except &
然后在匹配 http 的第 1 组中捕获,然后匹配除
&
之外的所有字符
\bhttp://scholar\.google\.com.*?[&?]\n?u=?\n?r\n?l\n?=(?:3D)?(http[^&]+)
see this regex pattern i think it might help to extract redirect uri看到这个正则表达式模式,我认为它可能有助于提取重定向 uri
(http:\/\/scholar[\w.\/=&?]*)[?]?u[=]?rl=([\w\:.\/\-=]+)
also see this example here https://regex101.com/r/dmkF3h/3也可以在这里看到这个例子https://regex101.com/r/dmkF3h/3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.