简体   繁体   中英

Writing crawled data to file using python

I crawled google search result page data using urllib2 and wrote that to a file. But while I am opening the parsed html file in a browser I am getting some utf-8 characters.

Here is my code in python for htmlparse.

import os
import urllib2
import webbrowser
url = 'https://www.google.co.in/search?q=lcd+tv'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 
       (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
f = response.read()
file = open('file.html','w+')
file.write(f)
file.close()

Here is the screenshot of the parsed page.

在此处输入图片说明

We can see ‎ and  some where on titles. Even Ads images are not loading :( .

How can I remove those unicode ?

Thanks in advance.

The web server sent UTF-8 encoded data, but you have written to the file opened with the default text encoding. In Python, that is ASCII, causing non-ascii data to be dumped to the file. Open the file with mode "wb" (binary) and it is likely to resolve your issue.

In addition, Google does not supply encoding information in the page itself, but only in the Content-Type header. It is possible the browser doesn't recognize it is UTF-8 when loading from the file. You can try adding a meta tag to the document

 <meta http-equiv="content-type" content="text/html; charset=utf-8">

About the ads, take note that relative URLs would try to find files on your HD rather than the actual servers.

If you need Ads images displayed, they should be saved separately. You can parse <img> tags using HTMLParser class (it's very simple to use) from standard module HTMLParser and save them into separate files. Of course, each link in every <img> tag should be replaced by local file path.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM