简体   繁体   中英

How do I make a crawler extracting information from relative paths?

I am trying to make a simple crawler that extracts links from the "See About" section from this link https://en.wikipedia.org/wiki/Web_scraping . That is 19 links in total, which I have managed to extract using Beautiful Soup. However I get them as relative links in a list, which I also need to fix by making them into absolute links. Intended result would look like this: 在此处输入图像描述

Then I wanted to use those same 19 links and extract further information from them. For example the first paragraph from each of the 19 links. So far I have this:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen

url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text

soup = BeautifulSoup(data, 'html.parser')

links = soup.find('div', {'class':'div-col'})
test = links.find_all('a', href=True)

data = []
for link in links.find_all('a'):
    data.append(link.get('href'))
#print(data)

soupNew = BeautifulSoup(''.join(data), 'html.parser')
print(soupNew.find_all('p')[0].text)

#test if there is any <p> tag, which returns empty, so I have not looped correctly.
x = soupNew.findAll('p')
if x is not None and len(x) > 0:
    section = x[0]
print(x)

My main issue is that I simply cant find a way to loop through the 19 links and look for the information I need. I am trying to learn Beautiful Soup and Python so I would prefer to stick with those for now even though there might be better options for doing this out there. So I just need some help or preferably an simple example to explain the process of doing said things above. Thanks!

You should split your code like you split your problems.

  1. Your first problem was to get a list, so you could write a method called get_urls

     def get_urls(): url = 'https://en.wikipedia.org/wiki/Web_scraping' data = requests.get(url).text soup = BeautifulSoup(data, 'html.parser') links = soup.find('div', {'class':'div-col'}) data = [] for link in links.find_all('a'): data.append("https://en.wikipedia.org"+link.get('href')) return data
  2. You wanted to get the first paragraph of every url. With little research i just got this one

     def get_first_paragraph(url): data = requests.get(url).text soup = BeautifulSoup(data, 'html.parser') return soup.p.text
  3. now all has to be wired up

     def iterate_through_urls(urls): for url in urls: print(get_first_paragraph(url)) def run(): urls = get_urls() iterate_through_urls(urls)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM