HTTP GET Chinese character using luasocket

Question

I use luasocket to GET a web page which contains Chinese characters "开奖结果" (the page itself is encoded in charset="gb2312"), as below:

require "socket"
host = '61.129.89.226'
fileformat = '/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=%s'
function getlottery(num)
  c = assert(socket.connect(host, 80))
  c:send('GET ' .. string.format(fileformat, num)  .. " HTTP/1.0\r\n\r\n")
  content = c:receive('*l')
  while content do
    if content and content:find('开奖结果') then -- failed
      print(content)
    end
    content = c:receive('*l')
  end
  c:close()
end

--http://61.129.89.226/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=2012138
getlottery('2012138')

Unfortunately, it fails to match the expected characters:

content:find('开奖结果') -- failed

I know Lua is capable of finding unicode characters:

Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> if string.find("This is 开奖结果", "开奖结果") then print("found!") end
found!

Then I guess it might be caused by how luasocket retrieves data from the web. Could anyone shed some lights on this?

Thanks.

Answer 1

If the page is encoded in GB2312, and your script (the file itself) is encoded in utf-8, there's no way the match will work. Because .find() will look for utf-8 codepoints, and it will just slide over the characters you're looking for, because they're not encoded the same way...

          开    奖      结     果
GB      bfaa   bdb1   bde1   b9fb
UTF-16  5f00   5956   7ed3   679c
UTF-8   e5bc80 e5a596 e7bb93 e69e9c

HTTP GET Chinese character using luasocket

Question

1 answers

solution1
4 ACCPTED 2012-11-24 04:46:49

HTTP GET Chinese character using luasocket

Question

1 answers

solution1 4 ACCPTED 2012-11-24 04:46:49

solution1
4 ACCPTED 2012-11-24 04:46:49