简体   繁体   English

优化自动换行算法

[英]Optimizing WordWrap Algorithm

I have a word-wrap algorithm that basically generates lines of text that fit the width of the text. 我有一个自动换行算法,该算法基本上会生成适合文本宽度的文本行。 Unfortunately, it gets slow when I add too much text. 不幸的是,当我添加太多文本时,它变得很慢。

I was wondering if I oversaw any major optimizations that could be made. 我想知道我是否监督了可以进行的任何重大优化。 Also, if anyone has a design that would still allow strings of lines or string pointers of lines that is better I'd be open to rewriting the algorithm. 此外,如果有人设计的结果仍然允许行字符串或行字符串指针更好,那么我愿意公开重写算法。

Thanks 谢谢

void AguiTextBox::makeLinesFromWordWrap()
{
    textRows.clear();
    textRows.push_back("");
    std::string curStr;
    std::string curWord;

    int curWordWidth = 0;
    int curLetterWidth = 0;
    int curLineWidth = 0;

    bool isVscroll = isVScrollNeeded();
    int voffset = 0;
    if(isVscroll)
    {
        voffset = pChildVScroll->getWidth();
    }
    int AdjWidthMinusVoffset = getAdjustedWidth() - voffset;
    int len = getTextLength();
    int bytesSkipped = 0;
    int letterLength = 0;
    size_t ind = 0;

    for(int i = 0; i < len; ++i)
    {

        //get the unicode character
        letterLength = _unicodeFunctions.bringToNextUnichar(ind,getText());
        curStr = getText().substr(bytesSkipped,letterLength);


        bytesSkipped += letterLength;

        curLetterWidth = getFont().getTextWidth(curStr);

        //push a new line
        if(curStr[0] == '\n')
        {
            textRows.back() += curWord;
            curWord = "";
            curLetterWidth = 0;
            curWordWidth = 0;
            curLineWidth = 0;
            textRows.push_back("");
            continue;
        }



            //ensure word is not longer than the width
            if(curWordWidth + curLetterWidth >= AdjWidthMinusVoffset && 
                curWord.length() >= 1)
            {
                textRows.back() += curWord;

                textRows.push_back("");
                curWord = "";
                curWordWidth = 0;
                curLineWidth = 0;
            }

            //add letter to word
            curWord += curStr;
            curWordWidth += curLetterWidth;


        //if we need a Vscroll bar start over
        if(!isVscroll && isVScrollNeeded())
        {
            isVscroll = true;
            voffset = pChildVScroll->getWidth();
            AdjWidthMinusVoffset = getAdjustedWidth() - voffset;
            i = -1;
            curWord = "";
            curStr = "";
            textRows.clear();
            textRows.push_back("");
            ind = 0;

            curWordWidth = 0;
            curLetterWidth = 0;
            curLineWidth = 0;

            bytesSkipped = 0;
            continue;
        }

        if(curLineWidth + curWordWidth >= 
            AdjWidthMinusVoffset && textRows.back().length() >= 1)
        {
            textRows.push_back("");
            curLineWidth = 0;
        }

        if(curStr[0] == ' ' || curStr[0] == '-')
        {
            textRows.back() += curWord;
            curLineWidth += curWordWidth;
            curWord = "";
            curWordWidth = 0;
        }
    }

    if(curWord != "")
    {
        textRows.back() += curWord;
    }

    updateWidestLine();
}

There are two main things making this slower than it could be, I think. 我认为有两个主要因素使速度变慢。

The first, and probably less important: as you build up each line, you're appending words to the line. 第一个,可能不太重要:在构建每行时,您要在该行后附加单词。 Each such operation may require the line to be reallocated and its old contents copied. 每个此类操作都可能需要重新分配该行并复制其旧内容。 For long lines, this is inefficient. 对于长行,这是低效的。 However, I'm guessing that in actual use your lines are quite short (say 60-100 characters), in which case the cost is unlikely to be huge. 但是,我猜想在实际使用中您的行很短(例如60-100个字符),在这种情况下,代价不太可能很大。 Still, there's probably some efficiency to be won there. 不过,在那里可能仍会获得一些效率。

The second, and probably much more important: you're apparently using this for a text-area in some sort of GUI, and I'm guessing that it's being typed into. 第二个,也许更重要:您显然在某种GUI中将它用于文本区域,而且我猜想它正在被键入。 If you're recomputing for every character typed, that's really going to hurt once the text gets long. 如果您要重新计算每个键入的字符,那么一旦文本变长,那真的会很受伤。

As long as the user is only adding characters at the end -- which is surely the most common case -- you can make effective use of the fact that with your "greedy" line-breaking algorithm changes never affect anything on earlier lines: so just recompute from the start of the last line. 只要用户仅在末尾添加字符(这肯定是最常见的情况),您就可以有效利用以下事实:“贪心”换行算法的更改永远不会影响早期的行:只需从最后一行的开头重新计算。

If you want to make it fast even when the user is typing (or deleting or whatever) somewhere in the middle of the text, your code will need to do more work and store more information. 如果即使在用户在文本中间的某个位置键入(或删除或其他内容)时也想使其快速运行,则您的代码将需要做更多的工作并存储更多的信息。 For instance: whenever you build a line, remember "if you start a line with this word, it ends with that word and this is the whole resulting line". 例如:当你建立一个行,记住“如果你开始一个符合这个词,它字结束, 也是导致线路全”。 Invalidate this information when anything changes within that line. 当该行中发生任何更改时,使此信息无效。 Now, after a little editing, most changes will not require very much recalculation. 现在,稍加编辑后,大多数更改将不需要太多的重新计算。 You should work out the details of this for yourself because (1) it's a good exercise and (2) I need to go to bed now. 您应该自己解决这个问题,因为(1)这是一个很好的锻炼,并且(2)我现在需要上床睡觉。

(To save on memory, you might prefer not to store whole lines at all -- whether or not you implement the sort of trick I just described. Instead, just store here's-the-next-line-break information and build up lines as your UI needs to render them.) (为了节省内存,您可能不希望根本不存储整行-不管您是否实现了我刚刚描述的那种技巧。相反,只需存储此处的下一个换行符信息并以您的UI需要呈现它们。)

It's probably more complication than you want to take on board right now, but you should also look up Donald Knuth's dynamic-programming-based line-breaking algorithm. 它可能比您现在想承担的要复杂得多,但是您还应该查阅Donald Knuth的基于动态编程的换行算法。 It's substantially more complicated than yours but can still be made quite quick, and it produces distinctly better results. 它比您的要复杂得多,但仍然可以很快完成,并产生明显更好的结果。 See, eg, http://defoe.sourceforge.net/folio/knuth-plass.html . 参见例如http://defoe.sourceforge.net/folio/knuth-plass.html

Problems on algorithms often come with problem on data-structures. 算法问题经常伴随数据结构问题。

Let's make a few observations, first: 首先,让我们进行一些观察:

  • paragraphs can be treated independently 段落可以独立对待
  • editing at a given index only invalidates the current word and those that follow 在给定索引处进行编辑只会使当前单词和后面的单词无效
  • it is unnecessary to copy the whole words when their index would suffice for retrieving them and only their length matter for the computation 当它们的索引足以检索它们时,没有必要复制整个单词,而只需要考虑它们的长度即可进行计算

Paragraph

I would begin by introducing the notion of paragraph, which are determined by user-introduced line-breaks. 我将首先介绍段落的概念,该概念由用户引入的换行符确定。 When an edition takes place, you need to locate which is the concerned paragraph, which requires a look-up structure. 进行版本编辑时,您需要找到相关的段落,这需要查找结构。

The "ideal" structure here would be a Fenwick Tree, for a small text box however this seems overkill. 对于一个小的文本框,这里的“理想”结构将是一棵Fenwick树,但这似乎过分了。 We'll just have each paragraph store the number of displayed lines that make up its representation and you'll count from the beginning. 我们只需要让每个段落存储组成其表示的显示行数,您就可以从头算起。 Note that an access to the last displayed line is an access to the last paragraph. 请注意,对最后显示的行的访问就是对最后一段的访问。

The paragraphs are thus stored as a contiguous sequence, in C++ terms, well probably take the hit of an indirection (ie storing pointers) to save moving them around when a paragraph in the middle is removed. 因此,这些段落以C ++的形式存储为连续的序列,很可能会受到间接调用的影响(即,存储指针),以免中间部分的段落被删除时将它们移动。

Each paragraph will store: 每个段落将存储:

  • its content, the simplest being a single std::string to represent it. 它的内容,最简单的是用单个std::string表示它。
  • its display, in editable form (which we need to determine still) 以可编辑的形式显示(我们需要确定)

Each paragraph will cache its display, this paragraph cache will be invalidated whenever an edit is made. 每个段落将缓存其显示,每当进行编辑时, 段落缓存将失效。

The actual rendering will be made for only a couple of paragraphs at a time (and better, a couple of displayed lines): those which are visible. 一次只能对几个段落(以及更好的是显示的几行)进行实际渲染:那些可见。

Displayed Line 显示行

A paragraph may be to displayed with at least one line, but there is no maximum. 一段可能显示至少一行,但没有最大值。 We need to store the "display" in editable form, that is a form suitable for edition. 我们需要以可编辑的形式存储“显示”,这是一种适合版本的形式。

A single chunk of characters with \\n thrown in is not suitable. 抛出\\n单个字符不适合。 Changes imply moving lots of characters around, and users are supposed to be changing the text, so we need better. 更改意味着要移动许多字符,并且用户应该更改文本,因此我们需要更好的。

Using lengths, instead of characters, we may actually only store a mere 4 bytes (if the string takes more than 3GB... I don't guarantee much about this algorithm). 使用长度而不是字符,我们实际上可能仅存储4个字节(如果字符串占用的空间超过3GB,则我对该算法不做太多保证)。

My first idea was to use the character index, however in case of edition all subsequent indexes are changed, and the propagation is error prone. 我的第一个想法是使用字符索引,但是在版本的情况下,所有后续索引都会更改,并且传播容易出错。 Lengths are offsets, so we have an index relative to the position of the previous word. 长度是偏移量,因此我们有一个相对于前一个单词的位置的索引。 It does pose the issue of what a word (or token) is. 它确实提出了一个单词(或令牌)是什么的问题。 Notably, do you collapse multiple spaces ? 值得注意的是,您是否折叠多个空间? How do you handle them ? 您如何处理它们? Here I'll assume that words are separated from one another by a single whitespace. 在这里,我假设单词被单个空格隔开。

For "fast" retrieval, I'll store the length of the whole displayed line as well. 对于“快速”检索,我还将存储整个显示行的长度。 This allows quickly skipping the first displayed lines when an edit is made at character 503 of the paragraph. 当在段落的字符503进行编辑时,这允许快速跳过显示的第一行。

A displayed line will thus be composed of: 因此,显示的行将包含:

  • a total length (inferior to the maximum displayed length of the box, once computation ended) 总长度(在计算结束后,小于框的最大显示长度)
  • a sequence of words (tokens) length 单词(标记)长度的序列

This sequence should be editable efficiently at both ends (since for wrapping we'll push/pop words at both ends depending on whether an edit added or removed words). 这个序列在两端都应该是可有效编辑的(因为要进行换行,我们将在两端推送/弹出单词,具体取决于编辑是添加还是删除单词)。 It's not so important if in the middle we're not that efficient, because only one line at a time is edited in the middle. 如果中间的效率不是那么重要,那不是很重要,因为一次只能编辑一行。

In C++, either a vector or deque should be fine. 在C ++中, vectordeque都可以。 While in theory a list would be "perfect", in practice its poor memory locality and high memory overhead will offset its asymptotic guarantees. 虽然从理论上讲, list是“完美的”,但实际上,它的较差的内存位置和较高的内存开销将抵消其渐近保证。 A line is composed of few words, so the asymptotic behavior does not matter and high constants do. 一条线由很少的单词组成,因此渐近行为无关紧要,而高常数则重要。

Rendering 渲染

For the rendering, pick up a buffer of already sufficient length (a std::string with a call to reserve will do). 为了进行渲染,请选择一个已经足够长的缓冲区(一个带有std::string并调用reserve的缓冲区)。 Normally, you'd clear and rewrite the buffer each time, so no memory allocation occurs. 通常,您每次都会clear并重写缓冲区,因此不会发生内存分配。

You need not display what cannot be seen, but do need to know how many lines there are, to pick up the correct paragraph. 您不需要显示看不到的内容,而是需要知道有多少行才能选择正确的段落。

Once you get the paragraph: 一旦获得该段:

  • set offset to 0 offset设置为0
  • for each line hidden, increment offset by its length (+ 1 for the space after it) 对于每条隐藏线,以其长度递增offset (其后的空格为+ 1)
  • a word is accessed as a substring of _content , you can use the insert method on buffer : buffer.insert(buffer.end(), _content[offset], _content[offset+length]) 一个单词作为_content的子字符串访问,可以在buffer上使用insert方法: buffer.insert(buffer.end(), _content[offset], _content[offset+length])

The difficulty is in maintaining offset , but that's what makes the algorithm efficient. 困难在于保持offset ,但这就是使算法高效的原因。

Structures 结构

struct LineDisplay: private boost::noncopyable
{
  Paragraph& _paragraph;
  uint32_t _length;
  std::vector<uint16_t> _words; // copying around can be done with memmove
};

struct Paragraph:
{
  std::string _content;
  boost::ptr_vector<LineDisplay> _lines;
};

With this structure, implementation should be straightforward, and should not slow down as much when the content grows. 使用这种结构,实现应该简单明了,并且在内容增长时不应减慢太多。

General change to the algorithm - 对算法的一般更改-

  1. work out if you need the scroll bar as cheap as you can, ie. 算出是否需要尽可能便宜的滚动条,即。 count the number of \\n in the text and if it's greater then the vheight turn on the scroll, check lengths so on. 计算文本中\\ n的数量,如果大于\\ n,则打开滚动条的vheight,检查长度,依此类推。
  2. prepare the text into appropriate lines for the control now that you know you need a scroll bar or not. 现在您知道是否需要滚动条,即可将文本准备好以适合控件使用。

This allows you to remove/reduce the test if(!isVscroll && isVScrollNeeded()) as is run on almost every character - isVScroll is probably not cheep, the example code doesn't seem to pass knowledge of lines to the function so can't see how it tells if it is needed. 这使您可以删除/减少测试if(!isVscroll && isVScrollNeeded())因为它几乎在每个字符上运行-isVScroll可能不是很便宜,示例代码似乎并未将行的知识传递给该函数,因此可以'看不到它如何指示是否需要它。

Assuming textRows is a vector<string> - textrows.back() += is kind of expensive, looking up the back not so much as += on string not being efficient for strings. 假设textRowsvector<string> textrows.back() +=有点贵,在字符串上查找后沿的方式不如+ =,对字符串而言效率不高。 I'd change to using a ostrstream for gathering the row and push it in when it is done. 我将改为使用ostrstream收集行,并在完成时将其推入。

getFont().getWidth() are likely to be expensive - is the font changing? getFont()。getWidth()可能很昂贵-字体是否正在更改? how greatly does the width differ between smallest and largest, shortcuts for fixed width fonts. 最小和最大固定宽度字体的快捷方式之间的宽度有多大差异。

Use native methods where possible to get the size of a word since you don't want to break them - GetTextExtentPoint32 尽可能使用本机方法来获取单词的大小,因为您不想破坏它们-GetTextExtentPoint32

Often the will be sufficient space to allow for the VScroll when you change between. 通常,当您在两者之间进行切换时,它们将有足够的空间容纳VScroll。 Restarting from the beginning with measuring could cost you up to twice the time. 从头开始进行测量可能会花费您多达两倍的时间。 Store the width of the line with each line so you can skip over the ones that still fit. 存储每行的线宽,以便您可以跳过仍然适合的宽度。 Or don't build the line strings directly, keep the words seperate with the size. 或者不要直接构建线串,而是使单词与大小分开。

How accurate does it realy need to be? 确实需要多精确? Apply some pragmatism... 应用一些实用主义...
Just assume VScroll will be needed, mostly wrapping won't change much even if it isn't (1 letter words at the end/start of a line) 只是假设将需要VScroll,即使不需要,多数情况下换行也不会有太大变化(行尾/行首有1个字母)

try and work more with words than with letters - checking remaining space for each letter can waste time. 尝试用单词而不是字母来工作-检查每个字母的剩余空间会浪费时间。 assume each letter in the string is the longest letter, letters x longest < space then put it in. 假设字符串中的每个字母都是最长的字母,字母x最长<空格,然后将其放入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM