简体   繁体   English

导入带有“特殊”字符的.csv文件

[英]Importing .csv files with “special” characters

I'm trying to read a .csv file into R. The .csv file was created in Excel, and it contains "long" dashes, which are the result of Excel "auto-correcting" the sequence space-dash-space. 我正在尝试将.csv文件读入R。该.csv文件是在Excel中创建的,它包含“长”破折号,这是Excel“自动更正”序列空间-破折号-空格的结果。 Sample entries that contain these "long" dashes: 包含这些“长”破折号的示例条目:

US – California – LA 美国–加利福尼亚–洛杉矶
US – Washington – Seattle 美国–华盛顿–西雅图

I've experimented with different encoding, including the following three options: 我尝试了不同的编码,包括以下三个选项:

x <- read.csv(filename, encoding="windows-1252") # Motivated by http://www.perlmonks.org/?node_id=551123
x <- read.csv(filename, encoding="latin1")
x <- read.csv(filename, encoding="UFT-8")

But, the long dashes either show up as (first and second option) or as <U+0096> (third option). 但是,长破折号显示为 (第一个和第二个选项)或<U+0096> (第三个选项)。

I realize that I can store the file in different formats or use different software ( Excel to CSV with UTF8 encoding ) but that's not the point. 我意识到我可以用不同的格式存储文件或使用不同的软件( 使用UTF8编码的Excel到CSV ),但这不是重点。

Has anyone figured out what encoding option in R works in such cases? 有谁知道在这种情况下R中的哪种编码选项有效?

If you are using RStudio, use Import Dataset. 如果您使用的是RStudio,请使用“导入数据集”。

  • Use Heading: No 使用标题:否
  • Separator Whitespace 分隔符空白
  • Decimal Period 小数点
  • Quote Double quote 报价双引号
  • uncheck strings as factors 取消选中字符串作为因素

when your document is loaded you can simply remove the columns that now show as '?' 加载文档后,您只需删除现在显示为“?”的列即可 You can see this is column 2 and column 4. If you have a dataframe, mydf, then you would delete the second column like this. 您可以看到这是第2列和第4列。如果您有数据框mydf,则可以像这样删除第二列。

mydf_new<-mydf[-2]

You could do the same thing for the other column, which is now column 3. 您可以对另一列(即现在的第3列)执行相同的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM