简体   繁体   中英

Read the text of an web page in python

I do know, this question or similiar ones have already been asked. But the ones I found didn't provide the right answer for me so I ask here.

How can I get the text of an HTML site and which i can use to compare it to other given values?

Lets say I have this web page:

<html>
<head>
<title>This is my page</title>

<center>
<div class="mon_title">Some title here</div>
<table class="mon_list" >
<tr class='list'><th class="list" align="center"></th><th class="list" align="center">Set 1</th><th class="list" align="center">Set 2</th><th class="list" align="center">Set 4</th><th class="list" align="center">Set 5</th><th class="list" align="center">Set 6</th><th class="list" align="center">Set 7</th><th class="list" align="center">Set 8</th><th class="list" align="center">Set 9</th><th class="list" align="center">Set 10</th><th class="list" align="center">Set 11</th><th class="list" align="center">Set 12</th></tr>
<tr class='list even'><td class="list" align="center">Value 1</td><td class="list" align="center">Value 2</td><td class="list" align="center">Value 3</td><td class="list" align="center">Value 4</td><td class="list" align="center">Value 5</td><td class="list">Value 6</td><td class="list">Value 7</td><td class="list" align="center">Value 8</td><td class="list" align="center">Value 9</td><td class="list" align="center">Value 10</td><td class="list" align="center">Value 11</td><td class="list" align="center">Value 12</td></tr>
<tr class='list even'><td class="list" align="center">Value 1</td><td class="list" align="center">Value 2</td><td class="list" align="center">Value 3</td><td class="list" align="center">Value 4</td><td class="list" align="center">Value 5</td><td class="list">Value 6</td><td class="list">Value 7</td><td class="list" align="center">Value 8</td><td class="list" align="center">Value 9</td><td class="list" align="center">Value 10</td><td class="list" align="center">Value 11</td><td class="list" align="center">Value 12</td></tr>
</table>

Sorry for any typos or missing parts. I hope you get the point of the page. So now, my program should read if some given Values out of the table are the same as the given ones like "Is Value 2 somewhere in it?" and if it is actually it should ask "is Value 5 in the same row?"

Is that generally possible? How much effort would be needed to construct the program?

All i got ist the download of the actual full HTML webpage with this code in python:

import requests

url = 'http://some.random.site.com/you/ad/here'
print (requests.get(url).text)

which gives me the HTML code you see above. Instead I want that what you get when you click CTRL+A on a Website and copy+paste it into an Editor file.

PS: I'm fairly new to programming so sorry if there are any concepts i don't really get or sth like it. Also, sorry for my english I'm german...

You can use urllib and re to find the values:

import urllib.request
import re

data = str(urllib.request.urlopen(url).read())

values = re.findall("Value \d+", data)

Output:

['Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12', 'Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12']

You could use a parsing library such as beautiful soup . Your question is also answered here .

import requests
from bs4 import BeautifulSoup as soup
url = 'http://some.random.site.com/you/ad/here'
text=soup(requests.get(url).text)
text=text.find(class_='mon_list')
listy=[]
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    listy.append([elem.get_text() for elem in cols])
print(listy)

This will give it to you in a nested list:

[[], ['Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12'], ['Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM