简体   繁体   中英

How can I print a webpage line by line in Python 3.x

All I want to do is print the HTML text of a simple website. When I try printing, I get the text below in raw format with newline characters ( \\n ) instead of actual new lines.

This is my code:

import urllib.request

page = urllib.request.urlopen('http://www.york.ac.uk/teaching/cws/wws/webpage1.html', data = None)
pageText = page.read()
pageLines = page.readlines()
print(pageLines)
print(pageText)

I've tried all kinds of other stuff and discovered some stuff. When I try to index the pageText variable, even after converting it to a string, it does not find any \\n character. If I try copying the raw text myself with the new lines represented as \\n and I print() that, it converts the \\n characters into actual new lines which is what I want. The problem is that I can't get that result without copying it myself.

To show you what I mean, here are some HTML snippets:

Raw text:

b'<HMTL>\n<HEAD>\n<TITLE>webpage1</TITLE>\n</HEAD>\n<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">\n

What I want:

b'<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR='FFFFFf' LINK='006666' ALINK='8B4513' VLINK='006666'>

I also used:

page = str(page)
lines = page.split('\n')

and it suprisingly did nothing. It just printed it as one line.

Please, help me. I am surprised that I found nothing that worked for me. Even on forums, nothing worked.

One way to do it is by using pythons requests module. You can obtain it by doing pip install requests (you may have to use sudo if you're not using a virtualenv).

import requests

res = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
if res.status_code == 200: # check that the request went through
  # print the entire html, should maintain internal newlines so that when it print to screen it isn't on a single line
  print(res.content)

  #if you want to split the html into lines, use the split command like below
  #lines = res.content.split('\n')
  #print(lines)

Your byte string appears to have hard-coded \\n in it.

For example, can't split on the value initially.

In [1]: s = b'<HMTL>\n<HEAD>\n'

In [2]: s.split('\n')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-e85dffa8b351> in <module>()
----> 1 s.split('\n')

TypeError: a bytes-like object is required, not 'str'

So, you str() it, but that doesn't seem to work either.

In [3]: str(s).split('\n')
Out[3]: ["b'<HMTL>\\n<HEAD>\\n'"]

But, if you escape the new lines it does somewhat work.

In [4]: str(s).split('\\n')
Out[4]: ["b'<HMTL>", '<HEAD>', "'"]

You could use a raw string to split on

In [5]: for line in str(s).split(r'\n'):
   ...:     print(line)
   ...:
b'<HMTL>
<HEAD>
'

Or, if you don't want the leading b , you can decode the byte string into a string object you can then split on.

In [9]: for line in s.decode("UTF-8").split('\n'):
   ...:     print(line)
   ...:
<HMTL>
<HEAD>

Whet you have is not text but bytes. If you want text just decode it.

b = b'<HMTL>\n<HEAD>\n<TITLE>webpage1</TITLE>\n</HEAD>\n<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">\n'
s = b.decode()  # might need to specify an encoding
print(s)

Output:

<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM