简体   繁体   English

使用python将抓取的数据写入文件

[英]Writing crawled data to file using python

I crawled google search result page data using urllib2 and wrote that to a file. 我使用urllib2抓取了Google搜索结果页数据,并将其写入了文件。 But while I am opening the parsed html file in a browser I am getting some utf-8 characters. 但是,当我在浏览器中打开已解析的html文件时,却得到了一些utf-8字符。

Here is my code in python for htmlparse. 这是我在htmlparse中的python代码。

import os
import urllib2
import webbrowser
url = 'https://www.google.co.in/search?q=lcd+tv'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 
       (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
f = response.read()
file = open('file.html','w+')
file.write(f)
file.close()

Here is the screenshot of the parsed page. 这是已解析页面的屏幕截图。

在此处输入图片说明

We can see ‎ and  some where on titles. 我们可以在标题上看到“?”和“?”。 Even Ads images are not loading :( . 甚至Ads图片都没有加载:(。

How can I remove those unicode ? 如何删除那些unicode?

Thanks in advance. 提前致谢。

The web server sent UTF-8 encoded data, but you have written to the file opened with the default text encoding. Web服务器发送了UTF-8编码的数据,但是您已写入使用默认文本编码打开的文件。 In Python, that is ASCII, causing non-ascii data to be dumped to the file. 在Python中,这是ASCII,导致将非ascii数据转储到文件中。 Open the file with mode "wb" (binary) and it is likely to resolve your issue. 以“ wb”模式(二进制)打开文件,这很可能解决了您的问题。

In addition, Google does not supply encoding information in the page itself, but only in the Content-Type header. 此外,Google不会在页面本身中提供编码信息,而仅在Content-Type标头中提供。 It is possible the browser doesn't recognize it is UTF-8 when loading from the file. 从文件加载时,浏览器可能无法识别它为UTF-8。 You can try adding a meta tag to the document 您可以尝试将meta标签添加到文档中

 <meta http-equiv="content-type" content="text/html; charset=utf-8">

About the ads, take note that relative URLs would try to find files on your HD rather than the actual servers. 关于广告,请注意,相对URL会尝试在HD而不是实际服务器上查找文件。

If you need Ads images displayed, they should be saved separately. 如果需要显示广告图像,则应单独保存。 You can parse <img> tags using HTMLParser class (it's very simple to use) from standard module HTMLParser and save them into separate files. 您可以使用标准模块HTMLParser HTMLParser类(使用起来非常简单)来解析<img>标签,并将其保存到单独的文件中。 Of course, each link in every <img> tag should be replaced by local file path. 当然,每个<img>标记中的每个链接都应替换为本地文件路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM