简体   繁体   English

你如何将Python的urllib2.urlopen()转换为文本?

[英]How do you convert Python's urllib2.urlopen() to text?

I'm doing a program on python that does the following: 我在python上做一个程序,它执行以下操作:

  • Gets info from a web. 从网站获取信息。
  • Puts it on a .txt file. 将其放在.txt文件中。

I've used urllib2.urlopen() for giving me the HTML code, but I want the info of the page. 我已经使用urllib2.urlopen()给我HTML代码,但我想要页面的信息 I say: 我说:

urllib2.urlopen() gets HTML. urllib2.urlopen()获取HTML。 But I want that HTML written on text, I don't want HTML code!! 但我希望HTML写在文本上,我不想要HTML代码!!

My program at the moment: 我的节目目前:

import urllib2
import time
url = urllib2.urlopen('http://www.dev-explorer.com/articles/using-python-httplib')
html = url.readlines()
for line in html:
    print line

time.sleep(5)

You have to use some method to read what you are opening: 您必须使用某种方法来阅读您正在打开的内容:

url = urllib2.urlopen('someURL')
html = url.readlines()
for line in html:
    #At this level you already have a str in 'line'
    #do something

Also you have other methods: read, readline 您还有其他方法:read,readline

Edit: 编辑:

As I said in one of my comments in this thread, maybe you need to use BeautifulSoup to scrap what you want. 正如我在这篇帖子中的一篇评论中所说,也许你需要使用BeautifulSoup来废弃你想要的东西。 So, I think this was already solved here . 所以,我认为这已经解决

You have to install BeautifulSoup: 你必须安装BeautifulSoup:

pip install BeautifulSoup

Then you have to do what is in the example: 然后你必须做示例中的内容:

from bs4 import BeautifulSoup
import urllib2    
import re

html = urllib.urlopen('someURL').read()
soup = BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

And if you have some problem with ascii characters, you have to change str(element) to unicode(element) in the visible function. 如果你对ascii字符有一些问题,你必须在可见函数中将str(element)更改为unicode(element)。

You could use the requests package which is my preference over urllib. 您可以使用我更喜欢的urllib请求包。 This returns all the html from the web page. 这将返回网页中的所有html。

import requests

response  = requests.get('http://stackoverflow.com/questions/34157599/how-do-you-convert-pythons-urllib2-urlopen-to-text')

with open('test.txt' 'w' ) as f:
   f.writelines(response.text)
f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM