简体   繁体   English

如何在 Python 中使用正则表达式从重定向 URL 中提取 URL?

[英]How to extract URL from a redirect URL using regex in Python?

I have the following test_string from which I need to obtain the actual URL.我有以下test_string ,我需要从中获取实际的 URL。

Test string (partly shown):测试字符串(部分显示):

An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020

Desired output for part of test_string部分 test_string 所需的 output

https://www.sciencedirect.com/science/article/pii/S0010218020301346

I have been trying to obtain this with the MWE given below applied to many strings, but it gives only one URL.我一直在尝试将下面给出的 MWE 应用于许多字符串,但它只给出了一个 URL。

MWE MWE

from urlparse import urlparse, parse_qs
import re
from re import search

test_string = '''
Production, Properties, and Applications of ALPHA-Terpineol
<http://scholar.google.com/scholar_url?url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&hl=en&sa=X&d=12771069332921982368&scisig=AAGBfm1tFjLUm7GV1DRnuYCzvR4uGWq9Cg&nossl=1&oi=scholaralrt&hist=v2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>

A Sales, L de Oliveira Felipe, JL Bicas
Abstract ALPHA-Terpineol (CAS No. 98-55-5) is a tertiary monoterpenoid 
alcohol widely
and commonly used in the flavors and fragrances industry for its sensory 
properties.
It is present in different natural sources, but its production is mostly 
based on ...
Save 
<http://scholar.google.com/citations?update_op=email_library_add&info=oB2z7uTzO7EJ&citsig=AMD79ooAAAAAYLfmix3sQyUWnFrHeKYZxuK31qlqlbCh&hl=en> 
    Twitter 
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=tw&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g> 
    Facebook 
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=fb&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g> 

An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Butanol/diesel blend is considered as a very promising alternative fuel
with
agreeable combustion and emission performance in engines. This paper
intends to
further investigate its autoignition characteristics with the combination
of a heated =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3DE=
27Gd756Qj4J&citsig=3DAMD79ooAAAAAYImDxwWCwd5S5xIogWp9RTavFRMtTDgS&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>

Using NMR spectroscopy to investigate the role played by copper in prion
diseases.
<http://scholar.google.com/scholar_url?url=3Dhttps://europepmc.org/article/=
med/32328835&hl=3Den&sa=3DX&d=3D16122276072657817806&scisig=3DAAGBfm1AE6Kyl=
jWO1k0f7oBnKFClEzhTMg&nossl=3D1&oi=3Dscholaralrt&hist=3Dv2Y_3P0AAAAJ:179499=
55323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
RA Alsiary, M Alghrably, A Saoudi, S Al-Ghamdi=E2=80=A6 - =E2=80=A6 and of =
the Italian
Society of =E2=80=A6, 2020
Prion diseases are a group of rare neurodegenerative disorders that develop
as a
result of the conformational conversion of normal prion protein (PrPC) to
the disease-
associated isoform (PrPSc). The mechanism that actually causes disease
remains =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3Dz=
pCMKavUvd8J&citsig=3DAMD79ooAAAAAYImDx3r4gltEWBAkhl0g2POsXB9Qn4Lk&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>


'''

regex = re.compile('(http://scholar.*?)&')
url_all = regex.findall(test_string)
citation_url = []
for i in url_all:
    if search('scholar.google.com',i):
        qs = parse_qs(urlparse(i).query).values()
        if search('http',str(qs[0])):
            citation_url.append(qs[0])
print citation_url

Present output目前output

https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf

Desired output所需 output

https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
https://www.sciencedirect.com/science/article/pii/S0010218020301346
https://europepmc.org/article/med/3232883

How to get handle URL text wrapping with equal sign and extracting the redirect URL in Python?如何处理 URL 文本换行与等号并在 Python 中提取重定向 URL?

You could match either a question mark or ampersand [&?] using a character class.您可以使用字符 class 匹配问号或与号[&?] Looking at the example data, for the url= part, you can add optional newlines and an optional equals sign and adjust accordingly.查看示例数据,对于url=部分,您可以添加可选的换行符和可选的等号并进行相应调整。

Some urls start with 3D, you can make that part optional using a non capturing group (?:3D)?一些 url 以 3D 开头,您可以使用非捕获组(?:3D)?

Then capture in group 1 matching http followed by matching all chars except &然后在匹配 http 的第 1 组中捕获,然后匹配除&之外的所有字符

\bhttp://scholar\.google\.com.*?[&?]\n?u=?\n?r\n?l\n?=(?:3D)?(http[^&]+)

Regex demo正则表达式演示

see this regex pattern i think it might help to extract redirect uri看到这个正则表达式模式,我认为它可能有助于提取重定向 uri

(http:\/\/scholar[\w.\/=&?]*)[?]?u[=]?rl=([\w\:.\/\-=]+)

also see this example here https://regex101.com/r/dmkF3h/3也可以在这里看到这个例子https://regex101.com/r/dmkF3h/3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM