简体   繁体   English

从python中的一个html表爬取数据

[英]Crawl data from an html table in python

I am a beginner in web crawling and I need help in getting the values from the table.我是 web 爬行的初学者,我需要帮助从表中获取值。 I have got all the required fields (LOCATION,DATE,SUMMARY,DEADLINE).我有所有必填字段(位置、日期、摘要、截止日期)。 What I want is the Summary is having a url to another page.我想要的是摘要有一个 url 到另一个页面。 I want that url to get appended along with the other fields like (LOCATION,DATE,SUMMARY,DEADLINE, URL )我希望 url 与其他字段一起附加,例如 (LOCATION,DATE,SUMMARY,DEADLINE, URL )

This is the website 这是网站

This is my code so far.到目前为止,这是我的代码。 But its not working但它不工作

import requests as rq
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.tendersinfo.com/global-information-technology-tenders-{}.php'

amount_of_pages = 2 #5194 
rows = []

for i in range(1,amount_of_pages):
    response = rq.get(url.format(i))


    if response.status_code == 200:
        soup = BeautifulSoup(response.text,'html.parser')
        table = soup.find('table',{'id':'datatable'})

        headers = []

        for th in table.find("tr").find_all("th"):
           headers.append(th.text.strip())

        for tr in table.find_all("tr")[1:]:
            cells = []
            tds = tr.find_all("td")

            if len(tds) == 0:
                ths = tr.find_all("th")

                for th in ths:
                    cells.append(th.text.strip())
            else:
                for td in tds:
                    cells.append(td.text.strip())
                    cells.append('https://www.tendersinfo.com/' + td.find('a')['href'])

            rows.append(cells)   

Here you go, I just re-coded the majority of it.给你 go,我只是重新编码了其中的大部分。

import requests as rq
from bs4 import BeautifulSoup
import pandas as pd

location = []
posted_date = []
summary = []
deadline = []

url = 'https://www.tendersinfo.com/global-information-technology-tenders-{}.php'

amount_of_pages = 10 # Max is 5194 currently
rows = []

for i in range(1,amount_of_pages):
    response = rq.get(url.format(i))
    if response.status_code == 200:
        soup = BeautifulSoup(response.text,'html.parser')
        table = soup.find('table',{'id':'datatable'})
        headers = []
        for th in table.find("tr").find_all("th"):
           headers.append(th.text.strip())
        for tr in table.find_all("tr")[1:]:
            cells = []
            tds = tr.find_all("td")
            if len(tds) == 0:
                ths = tr.find_all("th")
                for th in ths:
                    cells.append(th.text.strip())
            else:
                for td in tds:
                    cells.append(td.text.strip())
            rows.append(cells)

pd.DataFrame(rows, columns=headers).to_csv(r"C:\Users\HP\Desktop\Web Scraping (RFP's)\RFP_SCRAPED_DATA.csv", index=False)

Since your using pandas, why not use read_html which return's extracted tables as a list of DataFrames.由于您使用 pandas,为什么不使用read_html返回提取的表作为数据帧列表。

>>> tables = pd.read_html("https://www.tendersinfo.com/global-information-technology-tenders.php")

>>> tables[1]

  LOCATION         DATE                                        SUMMARY     DEADLINE
0    India  21-May-2020  Liquid Crystal Display Lcd Panel Or Monitors.  01-Jun-2020
1    India  21-May-2020                          Random Access Memory.  01-Jun-2020
2    India  21-May-2020         Supply Of Analog Transceiver-handheld.  01-Jun-2020
3    India  21-May-2020                   Supply Of Computer Printers.  01-Jun-2020
4    India  21-May-2020                                 All In One Pc.  01-Jun-2020

You get the table easily using pd.read_html and save this data into csv file using df.to_csv() .您可以使用pd.read_html轻松获取表格,并使用df.to_csv()将此数据保存到csv文件中。

import pandas as pd

url = "https://www.tendersinfo.com/ajax_all_new_search.php?country=information-technology&increment=1&%20select=500&%20total=259655&%20search_id=19906&%20order=id&%20imagevalue=1"

df = pd.read_html(url)[0]

df.to_csv("RFP_SCRAPED_DATA.csv", index=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM