[英]scraping multiple URLs with bs4
我正在嘗試使用 BeautifulSoup 從 USPTO 網頁編譯專利文件。
df['link']
urls=df['link'].to_numpy()
urls
for i in urls:
page = requests.get(i)
## storing the content of the page in a variable
txt = page.text
## creating BeautifulSoup object
soup = bs4.BeautifulSoup(txt, 'html.parser')
soup
但是,它只打印 URL 之一,而不是所有 5 個鏈接。 我需要將所有 5 個鏈接作為文本抓取。
任何建議表示贊賞。 干杯
我需要抓取的鏈接#
array(['http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=2&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=3&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=4&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=5&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n'],
dtype=object)
import pandas as pd
def Main(url):
for item in range(1, 6):
df = pd.read_html(url.format(item), attrs={
'width': '100%'}, skiprows=1, match=r"^\d{4}/\d+")[0]
df.to_csv("data.csv", index=False, mode="a")
Main("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r={}&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n")
輸出: 在線查看
截屏:
評論中每個使用請求的新代碼:
import requests
def Main(url):
for item in range(1, 6):
with requests.Session() as req:
r = req.get(url.format(item))
with open("data.txt", 'a') as f:
f.writelines(r.text)
Main("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r={}&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.