[英]extract text from strong tag
我正在嘗試分別從汽車中提取外部顏色、內部顏色、傳輸信息。com。
HTML:
<ul class="listing-row__meta">
<li>
<strong>
Ext. Color:
</strong>
Gray
</li>
<li>
<strong>
Int. Color:
</strong>
White
</li>
<li>
<strong>
Transmission:
</strong>
Automatic
</li>
我嘗試了以下代碼,但它顯示了“預期的字符串或類似字節的對象”。 任何建議或解決方案將不勝感激。
from bs4 import BeautifulSoup
import urllib
import re
url ='https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all('div',{'class':'shop-srp-listings__listing-container'})
for each in all_matches:
info=each.findAll('ul',class_='listing-row__meta')
pattern=re.compile(r'Ext. Color:')
matches=pattern.finditer(info)
for match in matches:
print(match.text)
也許,這會更接近您可能嘗試提取的內容,我猜,其表達式類似於:
(?is)<strong>\s*([^<]*?)\s*<\/strong>
或者,
(?is)(?<=<strong>)\s*[^<]*?\s*(?=<\/strong>)
可以肯定的是,您也可以使用bs4
內置函數來做到這一點。
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = re.findall(
r'(?is)<strong>\s*[^<]*?\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))
for match in matches:
print(match)
Gray
Beige
Automatic
AWD
Gray
White
Automatic
AWD
Black
如果你願意,你也可以通過一些修改來制作一個 dict:
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = dict(re.findall(
r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))
for k, v in matches.items():
print(f'{k} {v}')
Ext. Color: Gray
Int. Color: Beige
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Gray
Int. Color: White
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Black
如果您想列出:
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = re.findall(
r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))
for match in matches:
print(list(match))
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Gray']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Black']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'White']
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']
outputs = dict()
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = dict(re.findall(
r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))
for item in matches.items():
if item[0] not in outputs:
outputs[item[0]] = [item[1]]
if item[0] in keys:
outputs[item[0]].append(item[1])
{'分機。 顏色:['銀色','銀色','白色','白色','黑色','灰色','灰色','黑色','黑色','白色','藍色','紅色' ','銀色','灰色','黑色','白色','黑色','灰色','白色','黑色','黑色'],'詮釋。 顏色:['米色','米色','黑色','白色','黑色','黑色','灰色','米色','黑色','黑色','米色','米色','黑色','黑色','黑色','黑色','黑色','黑色','白色','白色','黑色'],'變速箱':['自動','自動',“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”, '自動','自動','自動','自動','自動','自動','自動'],'傳動系統':['AWD','AWD','AWD','AWD', 'RWD'、'RWD'、'RWD'、'RWD'、'AWD'、'RWD'、'RWD'、'RWD'、'AWD'、'RWD'、'RWD'、'AWD'、'RWD ','AWD','AWD','AWD','AWD']}
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']
outputs = dict()
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = dict(re.findall(
r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))
for item in matches.items():
if item[0] not in outputs:
outputs[item[0]] = [item[1]]
if item[0] in keys:
outputs[item[0]].append(item[1])
print(outputs)
print('*' * 50)
no_duplicate_outputs = dict()
for item in outputs.items():
if item[0] not in no_duplicate_outputs:
no_duplicate_outputs[item[0]] = list(set(item[1]))
print(no_duplicate_outputs)
{'分機。 顏色:['黑色','黑色','白色','黑色','其他','灰色','白色','白色','灰色','白色','灰色','銀色','藍色','黑色','銀色','銀色','黑色','藍色','藍色','黑色','白色'],'詮釋。 顏色:['黑色','黑色','米色','米色','黑色','灰色','黑色','米色','米色','白色','黑色','黑色','灰色','黑色','黑色','灰色','黑色','黑色','黑色','白色','黑色'],'傳輸':['自動','自動',“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”,“自動”, '自動','自動','自動','自動','自動','自動','自動'],'傳動系統':['AWD','AWD','RWD','RWD', 'RWD','RWD','RWD','AWD','AWD','AWD','RWD','AWD','AWD','AWD','AWD','AWD','RWD ','AWD','AWD','AWD','AWD']} ******************************* ******************* {'分機。 顏色:['銀色','白色','藍色','其他','黑色','灰色'],'詮釋。 顏色:['米色','白色','黑色','灰色'],'變速箱':['自動'],'傳動系統':['RWD','AWD']}
如果您想簡化/修改/探索表達式,它已在regex101.com的右上角面板上進行了解釋。 如果您願意,您還可以在此鏈接中觀看它如何與一些示例輸入匹配。
jex.im可視化正則表達式:
正則表達式庫的findAll
function 返回結果列表; 所以info
是一個字符串數組,而不是單個字符串。 您可能還需要遍歷info
中的每個項目。
這些對象返回bs4.Tag
對象(不是字符串),可以將其轉換為字符串,以便它們適合查找器finditer
。 (這特別令人困惑,因為當您打印 object info
時,bs4 會將它們呈現為字符串!)
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
for item in info:
pattern = re.compile(r'Ext. Color:')
matches = pattern.finditer(str(item))
for match in matches:
print(match.text)
在此示例中, info
可能是長度 = 1 的列表; 在這種情況下,如果您確定只需要第一個結果,並且只有一個結果,您可以轉換為返回單個事件的調用,或者簡單地將第一個結果與此行一起使用:
info = each.findAll('ul', class_='listing-row__meta')[0]
然后按原樣使用問題中的代碼。
您得到的錯誤可以通過類型轉換為 str 來修復:
matches=pattern.finditer(info)
改成:
matches=pattern.finditer(str(info))
這里絕對不需要正則表達式。 html 是常規的,使用 bs4 4.7.1 + 您可以使用:contains 通過文本定位適當的元素,然后使用 next_sibling 獲取包含值的相鄰節點。 獲取列表 zip 並轉換為 dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
headers = ['Make','Ext','Int','Trans','Drive']
r = requests.get('https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX')
soup = bs(r.content, 'lxml')
make = [i.text.strip() for i in soup.select('.listing-row__title')]
ext_color = [i.next_sibling.strip() for i in soup.select('strong:contains("Ext. Color:")')]
int_color = [i.next_sibling.strip() for i in soup.select('strong:contains("Int. Color:")')]
transmission = [i.next_sibling.strip() for i in soup.select('strong:contains("Transmission:")')]
drive = [i.next_sibling.strip() for i in soup.select('strong:contains("Drivetrain:")')]
df = pd.DataFrame(zip(make, ext_color, int_color, transmission, drive), columns = headers)
print(df)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.