使用Python從特定的超引用中提取特定的子字符串

Question

我是Python的新手，在我嘗試進行項目的第二次嘗試時，我想從URL的超引用中提取一個子字符串，特別是一個標識數字。

例如，此URL是我的搜索查詢的結果，給出了超參考http://www.chessgames.com/perl/chessgame?gid=1012809 。 從這個我想提取的識別號碼“1012809”，並追加其導航到URL http://www.chessgames.com/perl/chessgame?gid=1012809 ，在這之后我打算在下載的URL文件HTTP ：//www.chessgames.com/pgn/alekhine_naegeli_1932.pgn？gid = 1012809 。 但是我目前在這方面落后了幾步，因為我不知道提取標識符的方法。

這是我的MWE：

from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1))  #At this point, things have completely broken down...

Answer 1

正如Albin Paul所說， re.findall返回一個列表，您需要從中提取元素。 順便說一句，您在這里不需要BeautifulSoup ，請使用urllib2.urlopen(url).read()來獲取內容的字符串，這里也不需要re.sub是一種正則表達式模式(?:gid=)([0-9]+)就足夠了。

import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'

page = urllib2.urlopen(url).read()

result = re.findall(r"(?:gid=)([0-9]+)",page)

print(result[0])
#'1012809'

Answer 2

您完全不需要在這里使用正則表達式。 CSS選擇器以及字符串操作將引導您朝正確的方向發展。 嘗試以下腳本：

import requests
from bs4 import BeautifulSoup

page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)

輸出：

使用Python從特定的超引用中提取特定的子字符串

問題描述

2 個解決方案

解決方案1
0 已采納 2017-12-29 13:01:28

解決方案2
0 2017-12-30 13:21:34

使用Python從特定的超引用中提取特定的子字符串

問題描述

2 個解決方案

解決方案1 0 已采納 2017-12-29 13:01:28

解決方案2 0 2017-12-30 13:21:34

解決方案1
0 已采納 2017-12-29 13:01:28

解決方案2
0 2017-12-30 13:21:34