简体   繁体   English

使用python编写csv时出错,可能是unicode / encode / decode问题

[英]Error writing csv with python, probably unicode/encode/decode issue

I've been trying to find answers elsewhere but either I do not understand the explaination or the resolution does not work for my case. 我一直在尝试寻找其他地方的答案,但是我不理解解释或解决方案对我的情况不起作用。

So for this case: 因此,在这种情况下:
1. the output character is Chinese 1.输出字符为中文
2. the reading part works perfectly fine, just the writing malfunction 2.阅读部分工作正常,只是书写故障
3. I'm using Python 2.7.13 3.我正在使用Python 2.7.13

Please help! 请帮忙!

BTW, I'm pretty new to python, so if you located anything that could be improved by using any better practices please point them out! 顺便说一句,我对python来说还很陌生,所以如果您找到可以通过使用任何更好的实践加以改进的任何内容,请指出来! I would really appreciate it! 我真的很感激!

Thank you! 谢谢!

Here's the code: 这是代码:

# -*- coding: utf-8 -*-
import csv
import urllib2
from bs4 import BeautifulSoup
import socket
import httplib
# import sys  <= this did not work
# reload(sys)
# sys.setdefaultencoding('utf-8')

with open('/users/Rachael/Desktop/BDnodes.csv', 'r') as readcsv, 
open("/users/Rachael/Desktop/CheckTitle.csv", 'wb') as writecsv:
    writer = csv.writer(writecsv)
    for row in readcsv.readlines():
        opener = urllib2.build_opener()
        opener.addheaders = [('User-Agent',
                          'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
        urllib2.install_opener(opener)
        openpage = urllib2.urlopen(row).read()
        soup = BeautifulSoup(openpage, "lxml")
        # print "page results:"
        for child in soup.findAll("h3", {"class": "t"}):
            try:
                geturls = child.a.get('href')
                # print urllib2.urlopen(geturls).geturl()
                url_result = urllib2.urlopen(geturls).geturl()
                # print url_result
                try:
                    openitem = urllib2.urlopen(url_result).read()
                    gettitle = BeautifulSoup(openitem, 'lxml')
                    url_title = gettitle.title.text
                except urllib2.HTTPError:
                    url_title = 'passed http error'
                    pass
                except urllib2.URLError:
                    url_title = 'passed url error'
                    pass
                except socket.timeout:
                    url_title = 'passed timeout'
                    pass
                except httplib.BadStatusLine:
                   url_title = 'passed badstatus'
                    pass
                except:
                    url_title = 'unknown'
                    pass
            except urllib2.HTTPError as e:
                pass
            except urllib2.URLError:
                pass
            except socket.timeout:
                pass
            except httplib.BadStatusLine:
                pass
            writer.writerow([url_result, url_title])
            # writer.writerow([url_result, url_title.encode('utf-8')]) did not work either, even tried with 'utf-16'
writecsv.close()

The error was: 错误是:

C:\Python27\python.exe C:/Users/Rachael/PycharmProjects/untitled1/OpenNGet.py
Traceback (most recent call last):
  File "C:/Users/Rachael/PycharmProjects/untitled1/OpenNGet.py", line 55, in <module>
    writer.writerow([url_result, url_title])
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Process finished with exit code 1

You can pass encoding parameter in open function. 您可以在open函数中传递编码参数。

import codecs
codecs.open("/users/Rachael/Desktop/CheckTitle.csv", 'wb', encoding='utf-8') as writecsv

Could it be that your original solution is correct, but that the problem is in the 'result' variable instead of in the title? 可能是您原来的解决方案是正确的,但是问题出在'result'变量而不是标题中?

Try something like 尝试类似

writer.writerow([url_result.encode('utf-8'), url_title.encode('utf-8')])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM