简体   繁体   English

如何使用美丽的汤 python 将 Div 中的所有详细信息导出到 excel/csv?

[英]How to export all the Details in the Div using Beautiful soup python to excel/csv?

i am Newbie to Soup/python and i am trying to find the data.我是 Soup/python 的新手,我正在尝试查找数据。 my website Structure look like this.我的网站结构看起来像这样。 在此处输入图片说明

and if i open the Divclass border class it look like this.(image below)如果我打开 Divclass border class它看起来像这样。(下图)

I have done something like this:我做了这样的事情:

for  P in soup.find_all('p', attrs={'class': 'bid_no pull-left'}) :
  print(P.find('a').contents[0])

在此处输入图片说明

a Div structure look like一个 Div 结构看起来像
there are around 10 div in each page每页大约有 10 个 div
this in which i want to extract the Items,Quantity Require,Bid number,End date .我想在其中提取Items,Quantity Require,Bid number,End date Please help me请帮我

<div class="border block " style="display: block;">
    <div class="block_header">
        <p class="bid_no pull-left"> BID NO: <a style="color:#fff !important" href="/showbidDocument/1844736">GEM/2020/B/763154</a></p> 
        <p class="pull-right view_corrigendum" data-bid="1844736" style="display:none; margin-left: 10px;"><a href="#">View Corrigendum</a></p>

         <div class="clearfix"></div>
    </div>

    <div class="col-block">
        <p><strong style="text-transform: none !important;">Item(s): </strong><span>Compatible Cartridge</span></p>
        <p><strong>Quantity Required: </strong><span>8</span></p>

        <div class="clearfix"></div>
    </div>
    <div class="col-block">
        <p><strong>Department Name And Address:</strong></p>
        <p class="add-height">
            Ministry Of Railways<br> Na<br> South Central Railway N/a
        </p>
        <div class="clearfix"></div>
    </div>
    <div class="col-block">
        <p><strong>Start Date: </strong><span>25-08-2020 02:54 PM</span></p>
        <p><strong>End Date: </strong><span>04-09-2020 03:00 PM</span></p>
        <div class="clearfix"></div>

    </div>


    <div class="clearfix"></div>
</div>

Error image错误图片

错误 在此处输入图片说明

在此处输入图片说明

Try the below approach using requests and beautiful soup .尝试使用requests美丽的汤的以下方法。 I have created the script with the URL which is fetched from website and then creating a dynamic URL to traverse each and every page to get the data.我已经使用从网站获取的 URL 创建了脚本,然后创建了一个动态 URL 来遍历每个页面以获取数据。

What exactly script is doing:脚本到底在做什么:

  1. First script will create a URL where page_no query string parameter will increment by 1 upon completion of each traversal.第一个脚本将创建一个 URL,其中page_no查询字符串参数将在每次遍历完成后递增 1。

  2. Requests will get the data from the created URL using get method which will then pass to beautiful soup to parse HTML structure using lxml .请求将使用get方法从创建的 URL 中获取数据,然后该数据将传递给 Beautiful Soup 以使用lxml解析 HTML 结构。

  3. Then from the parsed data script will search for the div where data is actually present.然后从解析的数据脚本中搜索实际存在数据的div

  4. Finally looping on all the div text data one by one for each page.最后对每一页的所有div文本数据逐个循环。

     ```python import requests from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) from bs4 import BeautifulSoup as bs def scrap_bid_data(): page_no = 1 #initial page number while True: print('Hold on creating URL to fetch data...') URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) #create dynamic URL print('URL cerated: ' + URL) scraped_data = requests.get(URL,verify=False) # request to get the data soup_data = bs(scraped_data.text, 'lxml') #parse the scraped data using lxml extracted_data = soup_data.find('div',{'id':'pagi_content'}) #find divs which contains required data if len(extracted_data) == 0: # **if block** which will check the length of extracted_data if it is 0 then quit and stop the further execution of script. break else: for idx in range(len(extracted_data)): # loops through all the divs and extract and print data if(idx % 2 == 1): #get data from odd indexes only because we have required data on odd indexes bid_data = extracted_data.contents[idx].text.strip().split('\\n') print('-' * 100) print(bid_data[0]) #BID number print(bid_data[5]) #Items print(bid_data[6]) #Quantitiy Required print(bid_data[10] + bid_data[12].strip()) #Department name and address print(bid_data[16]) #Start date print(bid_data[17]) #End date print('-' * 100) page_no +=1 #increments the page number by 1 scrap_bid_data() ```

Actual Code实际代码

在此处输入图片说明

Output image输出图像

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM