简体   繁体   English

RCurl处理汉字

[英]RCurl handle Chinese characters

This is my code. 这是我的代码。 Why cannot it not decode Chinese characters correctly: 为什么不能正确解码汉字:

    library(XML)
    require(RCurl)
    myURL <- "http://data.eastmoney.com/zjlx/600066.html"
    html <- getURL(myURL,.encoding = "gb2312")
    print(Encoding(html))
    basicInfo <- htmlParse(html)
    #print(Encoding(basicInfo))
    tables <- readHTMLTable(basicInfo)

The problem is that website dynamically adds data to the tables with javascript. 问题是网站使用javascript动态地向表添加数据。 If you load the page in a browser with js disabled, you'll notice you don't see any data there either. 如果您在禁用js的浏览器中加载页面,您会发现您也没有看到任何数据。

I had some limited success on the javascript side of things but not the encoding (something is wrong with the character encoding when giving it over to R which I don't know how to correct): 我在javascript方面取得了一些有限的成功,但没有编码(当将字符编码提供给R时,我不知道如何纠正字符编码有问题):

# On Windows install the packages and required files
require(devtools)
install_github('seleniumJars', 'LluisRamon')
install_github('relenium', 'LluisRamon')

# Load package
require(relenium)

# Start a new instance of Firefox (this must already be installed on your computer)
firefox <- firefoxClass$new()

# We go to the url using the function get.
firefox$get("http://data.eastmoney.com/zjlx/600066.html")

# The html from the webpage can be obtained with the getPageSource (and returning a "character") function.
html <- firefox$getPageSource()

# Parse the html using the XML package
doc <- htmlParse(html)

# Extract your table
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
mytable <- tables$dt_1

           V1    V2     V3          V4      V5           V6      V7          V8      V9         V10     V11          V12     V13
1  2014-07-24 18.30  2.81%  3893万  10.63%   -323万  -0.88%  4217万  11.52% -1600万  -4.37%  -2293万  -6.26%
2  2014-07-23 17.80 -0.50%  1287万   8.63%  27.48万   0.18%  1259万   8.44%  -333万  -2.24%   -953万  -6.39%
3  2014-07-22 17.89  4.25%  7765万  18.46%   5729万  13.62%  2036万   4.84% -4574万 -10.87%  -3190万  -7.58%

I don't know if the stringi package will help in this case. 我不知道stringi包在这种情况下是否有帮助。 It may work under Linux (I often find that text encoding issues are far less in Linux than Windows but that is anecdotal). 它可能在Linux下工作(我经常发现Linux中的文本编码问题远远少于Windows,但这是轶事)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM