简体   繁体   English

使用 BeautifulSoup 进行网页抓取(抓取时缺少值)

[英]Web-Scraping using BeautifulSoup (missing values when scraping)

I have been trying to webscrape a realtor website using BeautifulSoup and encountered 2 difficulties that I cannot seem to fix.我一直在尝试使用 BeautifulSoup 抓取房地产经纪人网站,但遇到了两个我似乎无法解决的困难。

Difficulties:难点:

  1. When I run my code below, I am missing some date values.当我在下面运行我的代码时,我缺少一些日期值。 The dataframe should hold 68 rows of data scraped from the first page.数据框应包含从第一页抓取的 68 行数据。 The description and title scrapes return 68 rows, but the date scrape returns 66. I don't get N/A values returned if its missing either.描述和标题抓取返回 68 行,但日期抓取返回 66。如果缺失,我也不会返回 N/A 值。 Does anyone have an idea why this is?有谁知道这是为什么? When I inspected the website elements it had the same tags, except it is listed as VIP or Special (promotion) apartments.当我检查网站元素时,它具有相同的标签,只是它被列为 VIP 或特殊(促销)公寓。
  2. Secondly, I cannot seem to figure out how to scrape meta itemprop tags.其次,我似乎无法弄清楚如何抓取元 itemprop标签。 I keep getting blank values when I use:当我使用时,我不断得到空白值:
for tag in soup.findAll('div',attrs={'class':'announcement-block-text-container announcement-block__text-container'}):
for tag2 in tag.findAll('div', attrs={'class':'announcement-block__date'}):

Thank you in advance for any assistance you could provide.预先感谢您提供的任何帮助。

Python Code:蟒蛇代码:

from urllib.request import urlopen,Request
from bs4 import BeautifulSoup as bsoup
import ssl
import pandas as pd

def get_headers():
   #Headers
   headers={'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
           'accept-language':'en-US,en;q=0.9',
           'cache-control':'max-age=0',
           'upgrade-insecure-requests':'1',
           'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
   return headers

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
count = 1 # for pagination

#Make list holder
title = []
description = []
date = []

urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/5-r/']

for x in urls:
   count=1
   y=x
   while(count < 2):  # will get only 1st page
       print(x)
       req = Request(x, headers=get_headers())  #req all headers
       htmlfile = urlopen(req)
       htmltext = htmlfile.read()
       soup = bsoup(htmltext,'html.parser')
       
       for tag in soup.findAll('div',attrs={'class':'announcement-block-text-container announcement-block__text-container'}):
           for tag2 in tag.findAll('a', attrs={'class':'announcement-block__title'}):
               text = tag2.get_text().strip()
               if len(text) > 0:
                   title.append(text)
               else:
                   title.append('N/A')
               
       for tag in soup.findAll('div',attrs={'class':'announcement-block-text-container announcement-block__text-container'}):
           for tag2 in tag.findAll('div', attrs={'class':'announcement-block__description'}):
               text = tag2.get_text().strip()
               if len(text) > 0:
                   description.append(text)
               else:
                   description.append('N/A')
               
       for tag in soup.findAll('div',attrs={'class':'announcement-block-text-container announcement-block__text-container'}):
           for tag2 in tag.findAll('div', attrs={'class':'announcement-block__date'}):
               text = tag2.get_text().strip()
               if len(text) > 0:
                   date.append(text)
               else:
                   date.append('N/A')

       # Go to next page
       count=count+1
       page = '?page='+str(count)
       x=y+page

data_frame = pd.DataFrame(list(zip(title,description,date)),columns=['Title', 'Description', 'Date'])

You get 66 items because your date[] contains only 66 elements, therefore, you need to check all three fields at once in one for loop.您得到 66 个项目,因为您的date[]仅包含 66 个元素,因此,您需要在一个for循环中一次检查所有三个字段。 Your if else checks do nothing as there are no announcement-block__date divs with empty content on the page.您的if else检查不执行任何操作,因为页面上没有包含空内容的announcement-block__date div。

from urllib.request import urlopen,Request
from bs4 import BeautifulSoup as bsoup
import ssl
import pandas as pd

def get_headers():
   #Headers
   headers={'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
           'accept-language':'en-US,en;q=0.9',
           'cache-control':'max-age=0',
           'upgrade-insecure-requests':'1',
           'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
   return headers

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
count = 1 # for pagination

#Make list holder
info = {
    'title': [],
    'description': [],
    'date': []
}

urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/5-r/']

for x in urls:
   count=1
   y=x
   while(count < 2):  # will get only 1st page
       print(x)
       req = Request(x, headers=get_headers())  #req all headers
       htmlfile = urlopen(req)
       htmltext = htmlfile.read()
       soup = bsoup(htmltext,'html.parser')
       for tag in soup.findAll('div',attrs={'class':'announcement-block-text-container announcement-block__text-container'}):
            title = tag.find('a', attrs={'class':'announcement-block__title'})
            description = tag.find('div', attrs={'class':'announcement-block__description'})
            date = tag.find('div', attrs={'class':'announcement-block__date'})
            info['title'].append(title.get_text().strip() if title else 'N/A')
            info['description'].append(description.get_text().strip() if description else 'N/A')
            info['date'].append(date.get_text().strip() if date else 'N/A')
       # Go to next page
       count=count+1
       page = '?page='+str(count)
       x=y+page

data_frame = pd.DataFrame(list(zip(info['title'], info['description'], info['date'])),columns=['Title', 'Description', 'Date'])
print(len(info['title']), len(info['description']), len(info['date']))
print(data_frame)

About your second question, a similar question has already been answered here关于你的第二个问题, 这里已经回答了一个类似的问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM