web怎么刮<p>里面的標簽</p><div>具有來自 HTML 的類/ID 的標簽，使用 Python</div><div id="text_translate"><p> 我想提取數據，例如</p><blockquote><p>發布日期：2016 年 6 月 16 日漏洞標識符：APSB16-23 優先級：3 CVE 編號：CVE-2016-4126</p></blockquote><p> 來自<em><a href="https://helpx.adobe.com/security/products/air/apsb16-23.ug.html" rel="nofollow noreferrer">https://helpx.adobe.com/security/products/air/apsb16-23.ug.html</a></em></p><p> 編碼：</p><pre> import requests from bs4 import BeautifulSoup as bs from pprint import pprint r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html') soup = bs(r.content, 'html.parser') pprint([i.text for i in soup.select('div >.text > p', limit = 4 )] )</pre><p> output：</p><pre> ['Release date:\xa0September 13, 2016', 'Vulnerability identifier: APSB16-31', 'Priority: 3', 'CVE number:\xa0CVE-2016-6936']</pre><p> 問題是 /xa0。我應該如何刪除它？如果還有其他有效的代碼嗎？我也想把它轉換成 CSV 文件。謝謝你。</p></div>

Question

我想提取數據，例如

發布日期：2016 年 6 月 16 日漏洞標識符：APSB16-23 優先級：3 CVE 編號：CVE-2016-4126

來自https://helpx.adobe.com/security/products/air/apsb16-23.ug.html

編碼：

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
    
r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
pprint([i.text for i in soup.select('div > .text >  p' , limit = 4 )] )

output：

['Release date:\xa0September 13, 2016',
 'Vulnerability identifier: APSB16-31',
 'Priority: 3',
 'CVE number:\xa0CVE-2016-6936']

問題是 /xa0。 我應該如何刪除它？ 如果還有其他有效的代碼嗎？ 我也想把它轉換成 CSV 文件。 謝謝你。

Answer 1

\xa0實際上是 Latin1 (ISO 8859-1) 中的不間斷空格，也是 chr(160)。 您應該用空格替換它。

嘗試這個：

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint

r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
pprint([i.text.replace(u'\xa0', u' ') for i in soup.select('div > .text >  p', limit=4)])

Output：

['Release date: September 13, 2016',
 'Vulnerability identifier: APSB16-31',
 'Priority: 3',
 'CVE number: CVE-2016-6936']

編輯：要將結果放到.csv文件中，請使用pandas 。

就是這樣：

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
release = [
    i.getText().replace(u'\xa0', u' ').split(": ") for i
    in soup.select('div > .text >  p', limit=4)
]
pd.DataFrame(release).set_index(0).T.to_csv("release_data.csv", index=False)

Output：

Answer 2

我剛剛使用了您的代碼並在提取的 HTML 標記上添加了一個 for 循環。 似乎在使用列表理解時 unicode 轉換器不存在。 雖然它只是一個假設。

至於我剛剛即興創作的劇本。

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = "https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
data = [i for i in soup.select('div > .text >  p', limit=4)]

for i in data:
    print(i.text)
    print("-"*20)

這將為您提供所需的 output。 查看圖像的鏈接，因為它不會在此處顯示圖像本身。 在此處輸入圖像描述

問題描述

2 個解決方案

解決方案1
2 已采納 2021-04-01 08:58:44

解決方案2
0 2021-04-01 09:02:00

問題描述

2 個解決方案

解決方案1 2 已采納 2021-04-01 08:58:44

解決方案2 0 2021-04-01 09:02:00

解決方案1
2 已采納 2021-04-01 08:58:44

解決方案2
0 2021-04-01 09:02:00