How do I make a crawler extracting information from relative paths?

Question

I am trying to make a simple crawler that extracts links from the "See About" section from this link https://en.wikipedia.org/wiki/Web_scraping . That is 19 links in total, which I have managed to extract using Beautiful Soup. However I get them as relative links in a list, which I also need to fix by making them into absolute links. Intended result would look like this:

Then I wanted to use those same 19 links and extract further information from them. For example the first paragraph from each of the 19 links. So far I have this:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen

url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text

soup = BeautifulSoup(data, 'html.parser')

links = soup.find('div', {'class':'div-col'})
test = links.find_all('a', href=True)

data = []
for link in links.find_all('a'):
    data.append(link.get('href'))
#print(data)

soupNew = BeautifulSoup(''.join(data), 'html.parser')
print(soupNew.find_all('p')[0].text)

#test if there is any <p> tag, which returns empty, so I have not looped correctly.
x = soupNew.findAll('p')
if x is not None and len(x) > 0:
    section = x[0]
print(x)

My main issue is that I simply cant find a way to loop through the 19 links and look for the information I need. I am trying to learn Beautiful Soup and Python so I would prefer to stick with those for now even though there might be better options for doing this out there. So I just need some help or preferably an simple example to explain the process of doing said things above. Thanks!

Answer 1

You should split your code like you split your problems.

Your first problem was to get a list, so you could write a method called get_urls

 def get_urls(): url = 'https://en.wikipedia.org/wiki/Web_scraping' data = requests.get(url).text soup = BeautifulSoup(data, 'html.parser') links = soup.find('div', {'class':'div-col'}) data = [] for link in links.find_all('a'): data.append("https://en.wikipedia.org"+link.get('href')) return data

You wanted to get the first paragraph of every url. With little research i just got this one

 def get_first_paragraph(url): data = requests.get(url).text soup = BeautifulSoup(data, 'html.parser') return soup.p.text

now all has to be wired up

 def iterate_through_urls(urls): for url in urls: print(get_first_paragraph(url)) def run(): urls = get_urls() iterate_through_urls(urls)

How do I make a crawler extracting information from relative paths?

Question

1 answers

solution1
0 ACCPTED 2021-03-01 15:21:13

How do I make a crawler extracting information from relative paths?

Question

1 answers

solution1 0 ACCPTED 2021-03-01 15:21:13

solution1
0 ACCPTED 2021-03-01 15:21:13