简体繁体 English

`Data.Text`对`Data.Vector.Unboxed Char`

[英]`Data.Text` vs `Data.Vector.Unboxed Char`

原文 2013-12-19 20:53:01 7 2 haskell

Is there any difference in how Data.Text and Data.Vector.Unboxed Char work internally? Data.Text和Data.Vector.Unboxed Char在内部的工作方式有什么不同吗？ Why would I choose one over the other? 为什么我会选择一个而不是另一个？

I always thought it was cool that Haskell defines String as [Char] . 我一直认为Haskell将String定义为[Char]很酷。 Is there a reason that something analagous wasn't done for Text and Vector Char ? 有没有理由为Text和Vector Char做了一些类似的事情？

There certainly would be an advantage to making them the same.... Text-y and Vector-y tools could be written to be used in both camps. 使它们相同肯定会有一个优势......可以编写Text-y和Vector-y工具以在两个阵营中使用。 Imagine Ropes of Ints, or Regexes on strings of poker cards. 想象一下Ints的绳索，或者扑克牌上的正则表达。

Of course, I understand that there were probably historical reasons and I understand that most current libraries use Data.Text , not Vector Char , so there are many practical reasons to favor one over the other. 当然，我知道可能有历史原因，我理解大多数当前的库都使用Data.Text ，而不是Vector Char ，所以有很多实际的理由支持其中一个。 But I am more interested in learning about the abstract qualities, not the current state that we happen to be in.... If the whole thing were rewritten tomorrow, would it be better to unify the two? 但是我更感兴趣的是了解抽象的品质，而不是我们碰巧在的当前状态......如果整个事情明天被重写，那么统一两者会更好吗？

Edit, with more info- 编辑，更多信息 -

To put stuff into perspective- 把东西放到透视图中 -

According to this page, http://www.haskell.org/haskellwiki/GHC/Memory_Footprint , GHC uses 16 bytes for each Char in your program! 根据这个页面http://www.haskell.org/haskellwiki/GHC/Memory_Footprint,GHC为你的程序中的每个Char使用16个字节！
Data.Text is not O(1) index'able, it is O(n). Data.Text不是O（1）索引，它是O（n）。
Ropes (binary trees wrapped around text) can also hold strings.... They have better complexity for index/insert/delete, although depending on the number of nodes and balance of the tree, index could be close to that of Text. Ropes（文本周围的二叉树）也可以包含字符串....它们具有更好的索引/插入/删除复杂性，尽管取决于节点的数量和树的平衡，索引可能接近于Text的索引。

This is my takeaway from this- 这是我从这里得到的 -

Text and Vector Char are different internally.... Text和Vector Char在内部不同....
Use String if you don't care about performance. 如果您不关心性能，请使用String。
If performance is important, default to using Text. 如果性能很重要，则默认使用Text。
If fast indexing of chars is necessary, and you don't mind a lot of memory overhead (up to 16x), use Vector Char. 如果需要快速索引字符，并且您不介意大量内存开销（最多16倍），请使用Vector Char。
If you want to insert/delete a lot of data, use Ropes. 如果要插入/删除大量数据，请使用Ropes。

2 个解决方案

It's a fairly bad idea to think of Text as being a list of characters. 将Text视为字符列表是一个相当糟糕的主意。 Text is designed to be thought of as an opaque, user-readable blob of Unicode text. Text被设计为一种不透明的，用户可读的Unicode文本blob。 Character boundaries might be defined based on encoding, locale, language, time of month, phase of the moon, coin flips performed by a blinded participant, and migratory patterns of Venezuela's national bird whatever it may be. 字符边界可以根据编码，区域设置，语言，月份的时间，月亮的阶段，由盲目的参与者执行的硬币翻转以及委内瑞拉国家鸟类的迁移模式来定义。 The same story happens with sorting, up-casing, reversing, etc. 同样的故事发生在分拣，上壳，倒车等方面。

Which is a long way of saying that Text is an abstract type representing human language and goes far out of its way to not behave just the same way as its implementation, be it a ByteString , a Vector UTF16CodePoint , or something totally unique (which is the case). 这是一个很长的路要说， Text是一种表示人类语言的抽象类型，并且远远不能像它的实现那样表现不同，无论是ByteString ， Vector UTF16CodePoint ，还是完全独特的东西（这是案子）。

To clarify this distinction take note that there's no guarantee that unpack . pack 为了澄清这种区别，请注意，无法保证unpack . pack unpack . pack witnesses an isomorphism , that the preferred ways of converting from Text to ByteString are in Data.Text.Encoding and are partial, and that there's a whole sophisticated plug-in module text-icu littered with complex ways of handling human language strings. unpack . pack见证isomorphism ，从Text转换为ByteString的首选方法是在Data.Text.Encoding并且是部分的，并且有一个完整的插件模块text-icu散落着复杂的处理人类语言字符串的方式。

You absolutely should use Text if you're dealing with a human language string. 如果您正在处理人类语言字符串，则绝对应该使用Text 。 You should also be really careful to treat it with care since human language strings are not easily amenable to computer processing. 您也应该非常小心地小心对待它，因为人类语言字符串不易于计算机处理。 If your string is better thought of as a machine string, you probably should use ByteString . 如果你的字符串更好地被认为是一个机器字符串，你可能应该使用ByteString 。

The pedagogical advantages of type String = [Char] are high, but the practical advantages are quite low. type String = [Char]的教学优势很高，但实际优势很低。

To add to what J. Abrahamson said, it's also worth making the distinction between iterating over runes (roughly character by character, but really could be ideograms too) as opposed to unitary logical unicode code points. 为了补充J. Abrahamson所说的，与单一逻辑unicode代码点相反，它也值得区分迭代符文（大致逐字符，但实际上也可能是表意文字）。 Sometimes you need to know if you're looking at a code point that has been "decorated" by a previous code point. 有时您需要知道您是否正在查看已被前一个代码点“修饰”的代码点。

In the case of the latter, you then have to make the distinction between code points that stand alone (such as letters, ideograms) and those that modify the text that follows (right-to-left code point, diacritics, etc). 在后者的情况下，您必须区分独立的代码点（如字母，表意文字）和修改后面的文本（从右到左的代码点，变音符号等）。

Well implemented unicode libraries will typically abstract these details away and let you process the text in a more or less character-by-character fashion but you have to drop certain assumptions that come from thinking in terms of ASCII. 良好实现的unicode库通常会抽象出这些细节，让你以一种或多或少的逐字符方式处理文本，但你必须放弃从ASCII思考中得出的某些假设。

A byte is not a character. 字节不是字符。 A logical unit of text isn't necessarily a "character". 文本的逻辑单元不一定是“字符”。 Not every code point stands alone, some decorate/annotate the following code point or even the rest of the byte stream until invalidated (right-to-left). 不是每个代码点都是独立的，有些装饰/注释下面的代码点甚至是字节流的其余部分，直到无效（从右到左）。

Unicode is hard. Unicode很难。 There is no one true encoding that will eliminate the difficulty of encapsulating the variety inherent in human language. 没有一种真正的编码能够消除封装人类语言固有变化的难度。 Data.Text does a respectable job of it though. 尽管如此，Data.Text做得相当不错。

To summarize: 总结一下：

The methods of processing are: 处理方法是：

byte-by-byte - totally invalid for unicode, only applicable to latin-1/ASCII 逐字节 - 对unicode完全无效，仅适用于latin-1 / ASCII
code point by code point - works for processing unicode, but is lower-level than people realize 代码点的代码点 - 用于处理unicode，但比人们意识到的更低级别
logical rune-by-rune - what you actually want 逻辑符文符文 - 你真正想要的

The types are: 类型是：

String (aka [Char]) - has a limited scope. String（又名[Char]） - 范围有限。 Best used for teaching Haskell or for legacy use-cases. 最适合用于教授Haskell或传统用例。
Text - the preferred way to handle "human" text. 文本 - 处理“人类”文本的首选方式。
Bytestring - for byte streams, raw data, binary etc. 字节串 - 用于字节流，原始数据，二进制等。