简体   繁体   English

使用.read() 从文件 object 中提取文本

[英]Extract text from a file object using .read()

I'm trying to read the source of a website with this code:我正在尝试使用以下代码阅读网站的源代码:

import urllib2
z=urllib2.urlopen('http://skreemr.com/results.jsp?q=said+the+whale&search=SkreemR+Search')
z.read()
print z
txt = open('music.txt','w')
txt.write(str(z))
txt.close()
for i in open('music.txt','r'):
        if '''onclick="javascript:pageTracker._trackPageview('/clicks/''' in i:
                print i

And all I get for the source code is:我得到的所有源代码是:

<addinfourl at 51561608L whose fp = <socket._fileobject object at 0x0000000002CCA480>>

It might be an error I don't know?这可能是一个我不知道的错误?
Does anyone know of a better way to do the job above without putting it into a text file first?有谁知道在不先将其放入文本文件的情况下完成上述工作的更好方法?

z is a file object. z是一个文件 object。 In fact your codes prints the object description.事实上,您的代码会打印 object 描述。 You need to put the result of z.read() inside a variable (or print it directly).您需要将z.read()的结果放入变量中(或直接打印)。

You should do你应该做

import urllib2
z=urllib2.urlopen('http://skreemr.com/results.jsp?q=said+the+whale&search=SkreemR+Search')
i = z.read()
print i

.read() does not actually change the state of z . .read()实际上并没有改变z的 state 。 Use z=z.read() instead.请改用z=z.read()

z is the file-like object. z是类似文件的 object。 str(z) just gives you the representation you're seeing. str(z)只是给你你所看到的表示。

You need to keep the string (the contents of the file) that's returned by z.read() .您需要保留z.read()

Better yet, just iterate over it directly:更好的是,直接迭代它:

import urllib2
z=urllib2.urlopen('http://skreemr.com/results.jsp?q=said+the+whale&search=SkreemR+Search')
for i in z:
    if '''onclick="javascript:pageTracker._trackPageview('/clicks/''' in i:
        print i

I think you're missing what read does.我认为你错过了read的作用。 Try:尝试:

data = z.read()
print data
with open('music.txt','w') as txt:
    txt.write(data)
with open('music.txt','w') as out:
    out.write(urllib2.urlopen('http://skreemr.com/results.jsp?q=said+the+whale&search=SkreemR+Search').read()

But this is just the html for the page, you will need to parse it yourself using beautiful soup or lxml但这只是页面的 html,您需要使用漂亮的汤或 lxml 自己解析它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM