[英]How to scrape all data from this element? Nodejs/Puppeteer
我想從這些元素中收集姓名、職位和類型(在線/親自)
<div class="cse-userslist-user" data-user="178">
<div class="cse-ul--img">
<div class="cse-ul--img-child">
<img src="https://secure.gravatar.com/avatar/99574b52aaa5ecb0bea650602fecfbd7?s=100&d=mm&r=g" alt="Dina Abdelma">
</div>
</div>
<div class="cse-ul--content">
<div class="cse-ul--name">Dina Abdelma</div>
<div class="cse-ul--position">Head of SMEs, MDI</div>
<div class="cse-ul--role">Online</div>
</div>
<div class="cse-ul-overlay">
<div class="cse-ul-overlay-bg"></div>
<a class="cse-open-popform cse-btn cse-btn--primary">
Message </a>
<a href="#" class="cse-btn cse-btn--primary disabled">Schedule Meeting</a> </div>
</div>
我進入了登錄頁面,但我無法抓取所有數據,我只抓取了一個名字的第一個字母。
此外,數據用戶中的數字始終是隨機的,沒有其他變化我想從這三個元素中抓取數據並將它們放入數組/excel 中。
<div class="cse-ul--name">Dina Abdelma</div>
<div class="cse-ul--position">Head of SMEs, MDI</div>
<div class="cse-ul--role">Online</div>
這是我當前登錄網頁的代碼(無關,有效)
await page.waitForSelector('#username')
await page.type('#username', login)
await page.type('#password', password)
await page.click('#ur-frontend-form > form > div > div > div > input')
await page.waitForSelector('#cse-main > div > div > section.cse-section.cse-section--links > div > a:nth-child(2)')
await page.click('#cse-main > div > div > section.cse-section.cse-section--links > div > a:nth-child(2)')
await page.waitForSelector('#cse-main > div.cse-page.cse-page--networking.cse-global-bg > section.cse-section.cse-section--userslist > div > div.cse-userslist-button > a')
await page.click('#cse-main > div.cse-page.cse-page--networking.cse-global-bg > section.cse-section.cse-section--userslist > div > div.cse-userslist-button > a')
編輯
var names = await page.$$eval('.cse-ul--name',
elements=> elements.map(item=>item.textContent))
有效但不會抓取所有數據,只是抓取可見的數據。
你可以使用美麗的湯:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser') # html = the given html page from your question
# looks for a div, class='cse-ul--name' and decodes the contents of it
print(soup.find('div', 'cse-ul--name').decode_contents())
# looks for a div, class='cse-ul--position' and decodes the contents of it
print(soup.find('div', 'cse-ul--position').decode_contents())
# looks for a div, class='cse-ul--role' and decodes the contents of it
print(soup.find('div', 'cse-ul--role').decode_contents())
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.