使用BeautifulSoup從網站列表中提取數據

Question

我有一個小腳本，用於從網站列表中收集數據。 我一直在使用lynx ，但在瀏覽數據后，我發現某些網站未返回任何結果。

#!/bin/bash

[ "$1" ] || exit 1

tmp=$(mktemp "${1}_XXXXXXXXX")

cat <<EOF > "$tmp"
https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}
EOF

while read; do

    lynx -nonumbers -dump -hiddenlinks=merge -listonly "$REPLY" | \
    grep -i "${1}" | awk '!x[$0]++' >> file.txt

done < "$tmp"

rm "$tmp"

原來這是證書驗證問題。 顯然， lynx沒有標志來忽略驗證。 雖然我知道驗證是每個人的最大利益，但我需要能夠從列表中的每個網站提取數據。

因此，我開始研究使用Python和BeautifulSoup。 從這個答案中，我可以從單個網址中提取鏈接。 並且從這個答案中忽略驗證。

到目前為止，我使用Python 3.6：

from bs4 import BeautifulSoup
import urllib.request
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

resp = urllib.request.urlopen('https://google.com', context=ctx)
soup = BeautifulSoup(resp, "lxml")

for link in soup.find_all('a', href=True):
    print(link['href'])

我想將bash腳本中的相同列表傳遞給Python腳本，以從列表中的每個URL中提取鏈接。 因此，基本上，該列表的每一行

https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}

將作為URLS傳遞給resp = urllib.request.urlopen('URLS', context=ctx)

我該怎么做呢？

Answer 1

嘗試使用Python字符串格式化。

如果您正在尋找'https://google.com/search?q=text' ， 'https://google.com/search?q=%s' % ('text',)產生'https://google.com/search?q=text'

Answer 2

閱讀站點名稱，從列表中說出，遍歷它們，發送請求並解析響應。

site_list = ['http://example.com', 'https://google.com']

for site in site_list:

    resp = urllib.request.urlopen(site)
    soup = BeautifulSoup(resp, "lxml")

    for link in soup.find_all('a', href=True):
        print(link['href'])

使用BeautifulSoup從網站列表中提取數據

問題描述

2 個解決方案

解決方案1
1 2017-06-09 08:33:25

解決方案2
1 2017-06-09 09:05:12

使用BeautifulSoup從網站列表中提取數據

問題描述

2 個解決方案

解決方案1 1 2017-06-09 08:33:25

解決方案2 1 2017-06-09 09:05:12

解決方案1
1 2017-06-09 08:33:25

解決方案2
1 2017-06-09 09:05:12