简体   繁体   English

从CSV解码-法语和西班牙语特殊字符

[英]Decoding from CSV - French and Spanish special characters

I'm encoding my CSV_table from scrapping process like this : 我正在从这样的剪贴过程中编码我的CSV_table:

with open("Raw_table.csv", 'w',encoding="utf-8") as outfile:
   csv_writer = csv.writer(outfile, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL,)

Usually, when i want to use them i use a csv_parser like this : 通常,当我想使用它们时,我使用csv_parser像这样:

def parse_csv(content, delimiter = ';'):  
  csv_data = []
  for line in content.split('\n'):
    csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
  return csv_data


list_raw=parse_csv(open('Raw_RC.csv','r',encoding="utf-8").read())

It works when i'm scrapping from USA, England website. 当我从美国英格兰网站上报废时,它可以工作。 Here i have to deal with French, Spanish and German things it gives me such error when trying to read from the csv with parse_csv 在这里,我不得不处理法语,西班牙语和德语的问题,当尝试使用parse_csv从csv读取时,它给了我这样的错误

    csv_writer.writerow([k] + v)
ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)

How can i fix this ? 我怎样才能解决这个问题 ?

Subsidiary questions : 附属问题:

  1. Should I encode the CSV, scrap the site another way (eg set BeautifoulSoup differently) otherwise when it's german or french ? 我应该对CSV编码,还是用德语或法语,以另一种方式(例如,将BeautifoulSoup设置为其他方式)删除网站?
  2. This encoding problem can be related with all of the \\xa0 i get from scrapping ? 此编码问题可能与我从\\xa0获得的所有\\xa0有关。 I don't think so because i'm able to parse UK,USA cdv whereas there are also full of them. 我不这么认为,因为我可以解析UK,USA CDV,但其中也有很多。

Every bytes of your time you take to solve this is appreciated ! 您花费时间来解决此问题的每一字节都值得赞赏! :) :)

When working with french/german/spanish character (website written in that language), don't use : encoding='utf-8' but encoding='ISO-8859-1' instead. 使用法语/德语/西班牙语字符(以该语言编写的网站)时,请勿使用: encoding='utf-8'而应使用encoding='ISO-8859-1'

So writing : 所以写:

with open("Raw_table.csv", 'w',encoding="ISO-8859-1") as outfile:
   csv_writer = csv.writer(outfile, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL,)

And reading : 和阅读:

list_raw=parse_csv(open('Raw_RC.csv','r',encoding="ISO-8859-1").read())

The \\xa0 problem is not related. \\ xa0问题不相关。 Indeed, it occurs only in UTF-8. 实际上,它仅在UTF-8中发生。 So my specific french/german typography wasn't related. 因此,我的法语/德语字体与我无关。 To go further on this matter (which wasn't the core of the question) please see the following link suggested by tripleee. 要进一步处理此问题(这不是问题的核心),请参阅由Tripleee建议的以下链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM