How can I print a webpage line by line in Python 3.x

Question

All I want to do is print the HTML text of a simple website. When I try printing, I get the text below in raw format with newline characters ( \\n ) instead of actual new lines.

This is my code:

import urllib.request

page = urllib.request.urlopen('http://www.york.ac.uk/teaching/cws/wws/webpage1.html', data = None)
pageText = page.read()
pageLines = page.readlines()
print(pageLines)
print(pageText)

I've tried all kinds of other stuff and discovered some stuff. When I try to index the pageText variable, even after converting it to a string, it does not find any \\n character. If I try copying the raw text myself with the new lines represented as \\n and I print() that, it converts the \\n characters into actual new lines which is what I want. The problem is that I can't get that result without copying it myself.

To show you what I mean, here are some HTML snippets:

Raw text:

b'<HMTL>\n<HEAD>\n<TITLE>webpage1</TITLE>\n</HEAD>\n<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">\n

What I want:

b'<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR='FFFFFf' LINK='006666' ALINK='8B4513' VLINK='006666'>

I also used:

page = str(page)
lines = page.split('\n')

and it suprisingly did nothing. It just printed it as one line.

Please, help me. I am surprised that I found nothing that worked for me. Even on forums, nothing worked.

Answer 1

One way to do it is by using pythons requests module. You can obtain it by doing pip install requests (you may have to use sudo if you're not using a virtualenv).

import requests

res = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
if res.status_code == 200: # check that the request went through
  # print the entire html, should maintain internal newlines so that when it print to screen it isn't on a single line
  print(res.content)

  #if you want to split the html into lines, use the split command like below
  #lines = res.content.split('\n')
  #print(lines)

Answer 2

Your byte string appears to have hard-coded \\n in it.

For example, can't split on the value initially.

In [1]: s = b'<HMTL>\n<HEAD>\n'

In [2]: s.split('\n')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-e85dffa8b351> in <module>()
----> 1 s.split('\n')

TypeError: a bytes-like object is required, not 'str'

So, you str() it, but that doesn't seem to work either.

In [3]: str(s).split('\n')
Out[3]: ["b'<HMTL>\\n<HEAD>\\n'"]

But, if you escape the new lines it does somewhat work.

In [4]: str(s).split('\\n')
Out[4]: ["b'<HMTL>", '<HEAD>', "'"]

You could use a raw string to split on

In [5]: for line in str(s).split(r'\n'):
   ...:     print(line)
   ...:
b'<HMTL>
<HEAD>
'

Or, if you don't want the leading b , you can decode the byte string into a string object you can then split on.

In [9]: for line in s.decode("UTF-8").split('\n'):
   ...:     print(line)
   ...:
<HMTL>
<HEAD>

Answer 3

Whet you have is not text but bytes. If you want text just decode it.

b = b'<HMTL>\n<HEAD>\n<TITLE>webpage1</TITLE>\n</HEAD>\n<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">\n'
s = b.decode()  # might need to specify an encoding
print(s)

Output:

<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">

How can I print a webpage line by line in Python 3.x

Question

3 answers

solution1
0 ACCPTED 2016-10-04 01:14:24

solution2
0 2016-10-05 00:13:11

solution3
0 2016-10-05 07:42:56

How can I print a webpage line by line in Python 3.x

Question

3 answers

solution1 0 ACCPTED 2016-10-04 01:14:24

solution2 0 2016-10-05 00:13:11

solution3 0 2016-10-05 07:42:56

solution1
0 ACCPTED 2016-10-04 01:14:24

solution2
0 2016-10-05 00:13:11

solution3
0 2016-10-05 07:42:56