简体   繁体   中英

Python, extract text from webpage

I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.

EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *

I am aware of solutions such as the.get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.

With the.get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to.get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.

As a bonus, is there a way to identify the main body of text on a web page?

Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())

For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.

thank you very much for the knowledge, it def helped when it came to practice. seen that i am new in the world of python, i do have a quick question

I am trying to scrape the following:

However i am stuck at this particular syntax as i don't know what to edit: print(soup.find(id='someid').get("attribute-name"))

could you kindly assist? Thanks =) cheers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM