[enter image description here][1]Id like to scrape a web page for the name of subject and words find all words for a word i decide to search within the web page. My code so far not working
import requests
import csv
from bs4 import BeautifulSoup
start_urls = 'https://en.wikipedia.org/wiki/Data_science'
r = requests.get(start_urls)
soup = BeautifulSoup(r.content, 'html.parser')
crawled_page =[]
for page in soup.findAll('data'):
crawled_page.append(page.get('href'))
print(crawled_page
Errormessage:
C:\Users\tette\PycharmProjects\WebcrawlerProject\venv\Scripts\python.exe
C:/Users/tette/PycharmProjects/WebcrawlerProject/webScrapy/webScrapy/spiders
/webcrawler.py []
Process finished with exit code 0
if you want to search word in text then you should use
import re
soup.findAll(string=re.compile('data'))
but it finds strings ( NavigableString
), not tags, so you may have to get they parent to search attributes like href
import requests
from bs4 import BeautifulSoup, NavigableString
import re
start_urls = 'https://en.wikipedia.org/wiki/Data_science'
r = requests.get(start_urls)
soup = BeautifulSoup(r.content, 'html.parser')
crawled_page =[]
for page in soup.findAll(string=re.compile('data')):
#print(isinstance(page, NavigableString))
#print(page.parent)
href = page.parent.get('href')
if href: # skip None
crawled_page.append(href)
print(crawled_page)
EDIT: similar with lxml
using xpath
import requests
import lxml.html
import re
start_urls = 'https://en.wikipedia.org/wiki/Data_science'
r = requests.get(start_urls)
soup = lxml.html.fromstring(r.content)
crawled_page =[]
for page in soup.xpath('//*[contains(text(), "data")]'):
href = page.attrib.get('href')
if href: # skip None
crawled_page.append(href)
print(crawled_page)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.