简体   繁体   中英

HTTP GET Chinese character using luasocket

I use luasocket to GET a web page which contains Chinese characters "开奖结果" (the page itself is encoded in charset="gb2312"), as below:

require "socket"
host = '61.129.89.226'
fileformat = '/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=%s'
function getlottery(num)
  c = assert(socket.connect(host, 80))
  c:send('GET ' .. string.format(fileformat, num)  .. " HTTP/1.0\r\n\r\n")
  content = c:receive('*l')
  while content do
    if content and content:find('开奖结果') then -- failed
      print(content)
    end
    content = c:receive('*l')
  end
  c:close()
end

--http://61.129.89.226/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=2012138
getlottery('2012138')

Unfortunately, it fails to match the expected characters:

content:find('开奖结果') -- failed

I know Lua is capable of finding unicode characters:

Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> if string.find("This is 开奖结果", "开奖结果") then print("found!") end
found!

Then I guess it might be caused by how luasocket retrieves data from the web. Could anyone shed some lights on this?

Thanks.

If the page is encoded in GB2312, and your script (the file itself) is encoded in utf-8, there's no way the match will work. Because .find() will look for utf-8 codepoints, and it will just slide over the characters you're looking for, because they're not encoded the same way...

          开    奖      结     果
GB      bfaa   bdb1   bde1   b9fb
UTF-16  5f00   5956   7ed3   679c
UTF-8   e5bc80 e5a596 e7bb93 e69e9c

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM