简体   繁体   English

一个字符可以编码多少个数据?

[英]How much data can you encode in a single character?

If I were creating a videogame level editor in AS3 or .NET with a string-based level format, that can be copied, pasted and emailed, how much data could I encode into each character? 如果我正在AS3或.NET中创建一个基于字符串的关卡格式的视频游戏关卡编辑器,可以将其复制,粘贴和通过电子邮件发送,那么我可以为每个角色编码多少数据? What is important is getting the maximum amount of data for the minimum amount of characters displayed on the screen, regardless of how many bytes the computer is actually using to store these characters. 重要的是,要获得屏幕上显示的最少字符数所需的最大数据量,而不管计算机实际使用了多少字节来存储这些字符。

For example if I wanted to store the horizontal position of an object in 1 string character, how many possible values could that have? 例如,如果我想以1个字符串字符存储对象的水平位置,那么可以有多少个可能的值? Are there are any characters that can't be sent over a the internet, or that can't be copy and pasted? 是否存在无法通过互联网发送或无法复制粘贴的字符? What difference would things like UTF8 make? 像UTF8这样的东西会有什么不同? Answers please for either AS3 or C#/.NET, or both. 请回答AS3或C#/。NET,或同时回答两者。

2nd update: ok so Flash uses UTF16 for its String class. 第二次更新:好的,所以Flash将UTF16用作其String类。 There are lots of control characters that I cannot use. 我不能使用很多控制字符。 How could I manage which characters are ok to use? 如何管理可以使用的字符? Just a big lookup table? 只是一个大的查询表? And can operating systems and browser handle UTF16 to the extent that you can safely copy and paste a UTF16 string into an email, notepad, etc? 操作系统和浏览器是否可以在可以安全地将UTF16字符串复制并粘贴到电子邮件,记事本等的范围内处理UTF16?

Updated: "update 1", "update 2" 更新:“更新1”,“更新2”

You can store 8 Bits in a single charakter with ANSI, ASCII or UTF-8 encoding. 您可以使用ANSI,ASCII或UTF-8编码在单个字符中存储8位。

But, for example, if you whant to use ASCII-Encoding you shouldn't use the first 5 bits (0001 1111 = 0x1F) and the chars 0x7F there are represent system-charaters like "Escape, null, start of text, end of text ..) who are not can be copy and paste. So you could store 223 (1110 0000 = 0xE0) different informations in one single charakter. 但是,例如,如果您想使用ASCII编码,则不应使用前5位(0001 1111 = 0x1F),而chars 0x7F则表示系统字符,例如“转义,空值,文本开头,结尾可以复制和粘贴非文本,因此您可以在一个字符中存储223个(1110 0000 = 0xE0)不同的信息。

If you use UTF-16 you have 2 bytes = 16 bits - system-characters to store your informationen. 如果您使用UTF-16,则您有2个字节= 16位-系统字符来存储您的信息。

A in UTF-8 Encoding: 0x0041 (the first 2 digits are every 0!) or 0x41
A in UTF-16 Encoding: 0x0041 (the first 2 digits can be higher then 0) 
A in ASCII Encoding: 0x41 
A in ANSI Encoding: 0x41

see images at the and of this post! 看到这篇文章的和的图片!

update 1: 更新1:

If you not need to modify the values without any tool (c#-tool, javascript-base webpage, ...) you can alternative base64 or zip+base64 your informationens. 如果不需要任何工具(c#-tool,基于javascript的网页等)无需修改值,则可以使用base64或zip + base64替代信息。 this solution avoid the problem that you descript in your 2nd update. 此解决方案避免了您在第二次更新中描述的问题。 "here are lots of control characters that I cannot use. How could I manage which characters are ok to use?" “这里有很多我不能使用的控制字符。我如何管理可以使用的字符?”

If this is not an option you can not avoid to use any type of lookup-table. 如果这不是一个选择,那么您将不可避免地使用任何类型的查找表。 the shortest way for an lookuptable are: 查找表的最短方法是:

var illegalCharCodes = new byte[]{0x00, 0x01, 0x02, ..., 0x1f, 0x7f};

or you code it like this: 或者你这样编码:

//The example based on ASNI-Encoding but in principle its the same with utf-16
var value = 0;
if(charcode > 0x7f)
  value = charcode - 0x1f - 1; //-1 because 0x7f is the first illegalCharCode higher then 0x1f
else
  value = charcode - 0x1f;
value -= 1; //because you need a 0 value;
//charcode: 0x20 (' ') -> value: 0
//charcode: 0x21 ('!') -> value: 1
//charcode: 0x22 ('"') -> value: 2
//charcode: 0x7e ('~') -> value: 94
//charcode: 0x80 ('€') -> value: 95
//charcode: 0x81 ('�') -> value: 96
//..

update 2: 更新2:

for Unicode (UTF-16) you can use this table: http://www.tamasoft.co.jp/en/general-info/unicode.html Any character represent with a symbol like or are empty you should not use. 对于Unicode(UTF-16),您可以使用此表: http : //www.tamasoft.co.jp/en/general-info/unicode.html请勿使用任何带有符号的字符表示,例如或为空。 So you can not store 50,000 possible values in one utf-16 character if you allow to copy and past them. 因此,如果允许复制和粘贴它们,则不能在一个utf-16字符中存储50,000个可能的值。 you need any spezial-encoder and you must use 2 UTF-16 character like: 您需要任何spezial-encoder,并且必须使用2个UTF-16字符,例如:

//charcode: 0x0020 + 0x0020 ('  ') > value: 0
//charcode: 0x0020 + 0x0020 (' !') > value: 2
//charcode: 0x0020 + 0x0020 ('!A') > value: something higher 40.000, i dont know excatly because i dont have count the illegal characters in UTF-16 :D

ASCII表 ASCII表扩展
(source: asciitable.com ) (来源: asciitable.com

In C, a char is a type of integer, and it's most typically one byte wide. 在C语言中, char是整数的一种,最典型的是一个字节宽。 One byte is 8 bits so that's 2 to the power 8, or 256, possible values (as noted in another answer). 一个字节是8位,因此是2的幂8或256(可能的值)(如另一个答案中所述)。

In other languages, a 'character' is a completely different thing from an integer (as it should be), and has to be explicitly encoded to turn it into a byte. 在其他语言中,“字符”与整数是完全不同的(应该是整数),必须显式编码才能将其转换为字节。 Java, for example, makes this relatively simple by storing characters internally in a UTF-16 encoding (forgive me some details), so they take up 16 bits, but that's just implementation detail. 例如,Java通过以UTF-16编码在内部存储字符来使这一过程相对简单(请原谅一些细节),因此它们占用16位,但这仅是实现细节。 Different encodings such as UTF-8 mean that a character, when encoded for transmission, could occupy anything from one to four bytes. 诸如UTF-8之类的不同编码意味着一个字符在进行编码以进行传输时,可能会占用一个到四个字节的任何内容。

Thus your question is slighly malformed (which is to say it's actually several distinct questions in one). 因此,您的问题有点畸形(也就是说,实际上是几个不同的问题)。

How many values can a byte have? 一个字节可以有几个值? 256. 256。

What characters can be sent in emails? 电子邮件中可以发送哪些字符? Mostly those ASCII characters from space (32) to tilde (126). 通常,这些ASCII字符从空格(32)到波浪号(126)。

What bytes can be sent over the internet? 可以通过互联网发送什么字节? Any you like, as long as you encode them for transmission. 任何您喜欢的东西,只要对它们进行编码以进行传输。

What can be cut-and-pasted? 什么可以剪切粘贴? If your platform can do Unicode, then all of unicode; 如果您的平台可以执行Unicode,那么所有的unicode都可以; if not, not. 如果不是,则不会。

Does UTF-8 make a difference? UTF-8是否有所作为? UTF-8 is a standard way of encoding a string of characters into a string of bytes, and probably not much to do with your question (Joel Spolsky has a very good account of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) ). UTF-8是一种将字符串转换为字节字符串的标准方法,可能与您的问题无关(Joel Spolsky 对绝对绝对最小的每个软件开发人员非常 肯定,肯定肯定知道Unicode。和字符集(无借口!) )。

So pick a question! 所以选择一个问题!

Edit, following edit to question Aha! 编辑,然后编辑以提问 Aha! If the question is: 'how do I encode data in such a way that it can be mailed?', then the answer is probably 'Use base64 '. 如果问题是:“如何以一种可以邮寄的方式编码数据?”,那么答案可能是“使用base64 ”。 That is, if you have some purely binary format for your levels, then base64 is the 'standard' (very much quotes-standard) way of encoding that binary blob in a way that will make it through mail. 也就是说,如果您的级别具有某种纯粹的二进制格式,则base64是“标准”(非常多的引号标准)方式,用于对该二进制blob进行编码,使其能够通过邮件进行编码。 The things you want to google for are 'serialization' and 'deserialization'. 您要Google搜索的是“序列化”和“反序列化”。 Base64 is probably close to the practical maximum of information-per-mailable-character. Base64可能接近每个可邮寄字符信息的实际最大值。

(Another answer is 'use XML', but the question seems to imply some preference for compactness, and that a basically binary format is desirable). (另一个答案是“使用XML”,但是这个问题似乎暗示了对紧凑性的某种偏爱,并且希望使用基本二进制格式)。

Confusingly, a char is not the same thing as a character. 令人困惑的是, char是不同的。 In C and C++, a char is virtually always an 8-bit type. 在C和C ++中, char实际上始终是8位类型。 In Java and C#, a char is a UTF-16 code unit and thus a 16-bit type. 在Java和C#中, char是UTF-16代码单元,因此是16位类型。

But in Unicode, a character is represented by a "code" point that ranges from 0 to 0x10FFFF, for which a 16-bit type is inadequate. 但是在Unicode中,字符由范围从0到0x10FFFF的“代码”点表示,对于该点,16位类型不足。 So a character must either be represented by a 21-bit type (in practice, a 32-bit type), or use multiple "code units". 因此,字符必须由21位类型(实际上是32位类型)表示,或使用多个“代码单元”。 Specifically, 特别,

  • IN UTF-32, all characters require 32 bits. 在UTF-32中,所有字符都需要32位。
  • In UTF-16, characters U+0000 to U+FFFF (the "basic multilingual plane"), except for U+D800 to U+DFFF which cannot be represented, require 16 bits, and all other characters require 32 bits. 在UTF-16中,除了无法表示的U + D800至U + DFFF外,字符U + 0000至U + FFFF(“基本多语言平面”)需要16位,而所有其他字符都需要32位。
  • In UTF-8, characters U+0000 to U+007F (the ASCII reportoire) require 8 bits, U+0080 to U+07FF require 16 bits, U+0800 to U+FFFF require 24 bits, and all other characters require 32 bits. 在UTF-8中,字符U + 0000至U + 007F(ASCII报告字符)需要8位,U + 0080至U + 07FF需要16位,U + 0800至U + FFFF需要24位,所有其他字符需要32位位。

If I were creating a videogame level editor with a string-based level format, how much data could I encode into each char? 如果我要创建一个基于字符串的关卡格式的视频游戏关卡编辑器,我可以在每个字符中编码多少数据? For example if I wanted to store the horizontal position of an object in 1 char, how many possible values could that have? 例如,如果我想将一个对象的水平位置存储为1个字符,那么可以有多少个可能的值?

Since you wrote char rather than "character", the answer is 256 for C and 65,536 for C#. 由于您写的是char而不是“ character”,因此答案是C语言为256,C#语言为65,536。

But char isn't designed to be a binary data type. 但是char并非设计为二进制数据类型。 byte or short would be more appropriate. byteshort byte会更合适。

Are there are any characters that can't be sent over a the internet, or that can't be copy and pasted? 是否存在无法通过互联网发送或无法复制粘贴的字符?

There aren't any characters that can't be sent over the Internet, but you have to be careful using "control characters" or non-ASCII characters. 没有任何不能通过Internet发送的字符,但是您必须小心使用“控制字符”或非ASCII字符。

Many Internet protocols (especially SMTP) are designed for text rather than binary data. 许多Internet协议(尤其是SMTP)都是为文本而不是二进制数据而设计的。 If you want to send binary data, you can Base64 encode it. 如果要发送二进制数据,可以对它进行Base64编码。 That gives you 6 bits of information for each byte of the message. 这为消息的每个字节提供了6位信息。

The number of different states a variable can hold is two to the power of the number of bits it has. 变量可以保持的不同状态数是其位数的乘方。 How many bits a variable has is something that is likely to vary according to the compiler and machine used. 变量具有多少位,可能会根据所使用的编译器和机器而有所不同。 But in most cases a char will have eight bits and two to the power eight is two hundred and fifty six. 但是在大多数情况下,一个char将具有8位,而2的幂8是256。

Modern screen resolutions being what they are, you will most likely need more than one char for the horizontal position of anything. 现代屏幕分辨率就是它们的本质,对于任何事物的水平位置,您很可能需要多个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM