繁体   English   中英

使用 python 魅力抓取网页

[英]Webcrawling webpages using python charm

[在此处输入图片描述][1]我想抓取一个网页以获取主题名称和单词 找到我决定在网页中搜索的单词的所有单词。 我的代码到目前为止不起作用

import requests
import csv
from bs4 import BeautifulSoup
start_urls = 'https://en.wikipedia.org/wiki/Data_science'
r = requests.get(start_urls)
soup = BeautifulSoup(r.content, 'html.parser')
crawled_page =[]
for page in soup.findAll('data'):
  crawled_page.append(page.get('href'))
print(crawled_page


Errormessage:
C:\Users\tette\PycharmProjects\WebcrawlerProject\venv\Scripts\python.exe 
C:/Users/tette/PycharmProjects/WebcrawlerProject/webScrapy/webScrapy/spiders

/webcrawler.py []

Process finished with exit code 0

如果你想在文本中搜索单词,那么你应该使用

import re

soup.findAll(string=re.compile('data'))

但它会找到字符串( NavigableString ),而不是标签,因此您可能必须让它们成为父级才能搜索href属性

import requests
from bs4 import BeautifulSoup, NavigableString
import re

start_urls = 'https://en.wikipedia.org/wiki/Data_science'

r = requests.get(start_urls)

soup = BeautifulSoup(r.content, 'html.parser')
crawled_page =[]
for page in soup.findAll(string=re.compile('data')):
    #print(isinstance(page, NavigableString))
    #print(page.parent)
    href = page.parent.get('href')
    if href: # skip None
        crawled_page.append(href)
print(crawled_page)

编辑:类似于lxml使用xpath

import requests
import lxml.html
import re

start_urls = 'https://en.wikipedia.org/wiki/Data_science'

r = requests.get(start_urls)

soup = lxml.html.fromstring(r.content)

crawled_page =[]

for page in soup.xpath('//*[contains(text(), "data")]'):
    href = page.attrib.get('href')
    if href: # skip None
        crawled_page.append(href)

print(crawled_page)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM