Python，正則表達式：匹配字符串后提取字符串

Question

我想使用正則表達式匹配模式並提取模式的一部分。

我已經抓取了HTML數據，一個說明性代碼段如下所示：

</script>
</li>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<span class="hide" itemprop="position">1</span>
<div class="result-heading">
<a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
<img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
<div class="result-heading-texts">
<a href="/projects/quickfixj/" itemprop="url" title="Find out more 
<a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
<div class="description">
<p class="description-inner">DESMO-<em>J</em> is a framework for 
<a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
</strong>
<strong>

find_all('a')更具代表性的子集突出顯示問題：

<!-- Menu -->
<ul class="header-nav-menulist">
<li class="highlight social row">
<span class="social-label">Connect</span>
<span class="social-icons">
<span></span>
<a class="twitter" href="https://twitter.com/sourceforge" rel="nofollow" target="_blank">
<svg viewbox="0 0 1792 1792" xmlns="http://www.w3.org/2000/svg"><path d="M1684 408q-67 98-162 167 1 14 1 42 0 130-38 259.5t-115.5 248.5-184.5 210.5-258 146-323 54.5q-271 0-496-145 35 4 78 4 225 0 401-138-105-2-188-64.5t-114-159.5q33 5 61 5 43 0 85-11-112-23-185.5-111.5t-73.5-205.5v-4q68 38 146 41-66-44-105-115t-39-154q0-88 44-163 121 149 294.5 238.5t371.5 99.5q-8-38-8-74 0-134 94.5-228.5t228.5-94.5q140 0 236 102 109-21 205-78-37 115-142 178 93-10 186-50z"></path></svg></a>
<a class="facebook" href="https://www.facebook.com/sourceforgenet/" rel="nofollow" target="_blank">

HTML當前存儲為BeautifulSoup對象，即已通過以下方式傳遞：

html_soup= BeautifulSoup(response.text, 'html.parser')

我想在整個對象中搜索/projects/所有實例，並提取后續斜杠之間的字符串。 例如：

from "/projects/quickfixj/" I would like to store "quickfixj".

我最初的想法是使用re.findall()並嘗試匹配(/projects/./)*但這不起作用。

任何幫助是極大的贊賞。

Answer 1

您已經完成了一半

a='''</script>
</li>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<span class="hide" itemprop="position">1</span>
<div class="result-heading">
<a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
<img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
<div class="result-heading-texts">
<a href="/projects/quickfixj/" itemprop="url" title="Find out more 
<a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
<div class="description">
<p class="description-inner">DESMO-<em>J</em> is a framework for 
<a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
</strong>
<strong>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(a,"html.parser")
for i in soup.find_all('a'):
    print(re.findall('/projects/(\w{1,})/',i.get('href')))

如果您需要獨特的項目。 將最后幾行更改為

from bs4 import BeautifulSoup
soup = BeautifulSoup(a,"html.parser")
project_set=set()
for i in soup.find_all('a'):
    project_set.add(*re.findall('/projects/(\w{1,})/',i.get('href')))

print(project_set) #{u'desmoj', u'quickfixj'}

Answer 2

您可以提取所有鏈接，然后應用正則表達式：

from bs4 import BeautifulSoup

html = '''</script>
</li>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<span class="hide" itemprop="position">1</span>
<div class="result-heading">
<a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
<img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
<div class="result-heading-texts">
<a href="/projects/quickfixj/" itemprop="url" title="Find out more 
<a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
<div class="description">
<p class="description-inner">DESMO-<em>J</em> is a framework for 
<a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
</strong>
<strong>'''

html_soup = BeautifulSoup(html, 'html.parser')

links = [i.get('href') for i in html_soup.find_all('a', href=True)]

產量：

['/projects/quickfixj/', '/projects/quickfixj/', '/projects/desmoj/files/stats/timeline']

然后，您可以應用正則表達式：

cleaned = [re.findall(r'(?<=projects\/)(.*?)\/', i)[0] for i in links]

產量：

['quickfixj', 'quickfixj', 'desmoj']

Answer 3

這樣的正則表達式應該可以解決問題(?<=\\/projects\\/).+?(?=\\/)

並且會像這樣工作

import re
regex = "(?<=\/projects\/).+?(?=\/)"
string = "<a href="/projects/quickfixj/" itemprop="url" title="Find out more...."
matches = re.findall(regex, string)
print(matches)

輸出： ["quickfixj"]

Python，正則表達式：匹配字符串后提取字符串

問題描述

3 個解決方案

解決方案1
1 2018-10-09 17:51:28

解決方案2
0 2018-10-09 17:49:03

解決方案3
0 2018-10-09 17:49:06

Python，正則表達式：匹配字符串后提取字符串

問題描述

3 個解決方案

解決方案1 1 2018-10-09 17:51:28

解決方案2 0 2018-10-09 17:49:03

解決方案3 0 2018-10-09 17:49:06

解決方案1
1 2018-10-09 17:51:28

解決方案2
0 2018-10-09 17:49:03

解決方案3
0 2018-10-09 17:49:06