简体   繁体   English

如何根据长度将[String]拆分为[[String]]

[英]How to split a [String] in to [[String]] based on length

I'm trying to split a list of Strings in to a List of Lists of Strings so like in the title [String] -> [[String]] 我正在尝试将字符串列表拆分为字符串列表,例如标题[String] -> [[String]]

This has to be done based on length of characters, so that the Lists in the output are no longer than 10. So if input was length 20 this would be broken down in to 2 lists and if length 21 in to 3 lists. 必须基于字符的长度来完成此操作,以便输出中的列表不超过10。因此,如果输入的长度为20,则将其分为2个列表,如果长度为21,则为3个列表。

I'm not sure what to use to do this, I don't even know how to brake down a list in to a list of lists never mind based on certain length. 我不知道该怎么用,我什至不知道如何将列表分解为一定长度的列表。

For example if the limit was 5 and the input was: 例如,如果限制为5 ,并且输入为:

["abc","cd","abcd","ab"]

The output would be: 输出为:

[["abc","cd"],["abcd"],["ab"]]

I'd like to be pointed in the right direction and what methods to use, list comprehension? 我想指出正确的方向,使用什么方法来理解列表? recursion? 递归?

Here's an intuitive solution: 这是一个直观的解决方案:

import Data.List (foldl')

breakup :: Int -> [[a]] -> [[[a]]]
breakup size = foldl' accumulate [[]]
  where accumulate broken l
         | length l > size = error "Breakup size too small."
         | sum (map length (last broken ++ [l])) <= size
               = init broken ++ [last broken ++ [l]]
         | otherwise = broken ++ [[l]]

Now, let's go through it line-by-line: 现在,让我们逐行进行说明:

breakup :: Int -> [[a]] -> [[[a]]]

Since you hinted that you may want to generalize the function to accept different size limits, our type signature reflects this. 由于您暗示您可能希望泛化该函数以接受不同的大小限制,因此我们的类型签名反映了这一点。 We also generalize beyond [String] (that is, [[Char]] ), since our problem is not specific to [[Char]] , and could equally apply to any [[a]] . 我们还对[String] (即[[Char]] )进行了概括,因为我们的问题并非特定于[[Char]] ,并且可以同样适用于任何[[a]]

breakup size = foldl' accumulate [[]]

We're using a left fold because we want to transform a list, left-to-right, into our target, which will be a list of sub-lists. 我们之所以使用左折,是因为我们希望将一个列表从左到右转换成我们的目标,该目标将是一个子列表的列表。 Even though we're not concerned with efficiency, we're using Data.List.foldl' instead of Prelude's own foldl because this is standard practice. 即使我们不关心效率,我们使用Data.List.foldl' ,而不是前奏自己foldl因为这是标准做法。 You can read more about foldl vs. foldl' here . 你可以阅读更多关于foldlfoldl' 在这里

Our folding function is called accumulate . 我们的折叠功能称为accumulate It will consider a new item and decide whether to place it in the last-created sub-list or to start a new sub-list. 它将考虑一个新项目,并决定是将其放置在最后创建的子列表中还是开始一个新的子列表。 To make that judgment, it uses the size we passed in. We start with an initial value of [[]] , that is, a list with one empty sub-list. 为了做出判断,它使用我们传入的size 。我们从[[]]的初始值开始,即一个带有一个空子列表的列表。

Now the question is, how should you accumulate your target? 现在的问题是,您应该如何accumulate目标?

  where accumulate broken l

We're using broken to refer to our constructed target so far, and l (for "list") to refer to the next item to process. 到目前为止,我们使用broken来指代我们构造的目标,而l (代表“列表”)则指代下一个要处理的项目。 We'll use guards for the different cases: 在不同情况下,我们将使用防护措施:

         | length l > size = error "Breakup size too small."

We need to raise an error if the item surpasses the size limit on its own, since there's no way to place it in a sub-list that satisfies the size limit. 如果商品本身超过了尺寸限制,我们需要提出一个错误,因为无法将其放置在满足尺寸限制的子列表中。 (Alternatively, we could build a safe function by wrapping our return value in the Maybe monad, and that's something you should definitely try out on your own.) (或者,我们可以通过将返回值包装在Maybe monad中来构建安全函数,这绝对是您应该自己尝试的方法。)

         | sum (map length (last broken ++ [l])) <= size
               = init broken ++ [last broken ++ [l]]

The guard condition is sum (map length (last broken ++ [l])) <= size , and the return value for this guard is init broken ++ [last broken ++ [l]] . 保护条件为sum (map length (last broken ++ [l])) <= size ,并且此后卫的返回值是init broken ++ [last broken ++ [l]] Translated into plain English, we might say, "If the item can fit in the last sub-list without going over the size limit, append it there." 翻译成简单的英语,我们可能会说:“如果该项目适合最后一个子列表,而又没有超过大小限制,则将其附加在该列表中。”

         | otherwise = broken ++ [[l]]

On the other hand, if there isn't enough "room" in the last sub-list for this item, we start a new sub-list, containing only this item. 另一方面,如果该项目的最后一个子列表中没有足够的“房间”,我们将启动一个仅包含该项目的新子列表。 When the accumulate helper is applied to the next item in the input list, it will decide whether to place that item in this sub-list or start yet another sub-list, following the same logic. 当将accumulate助手应用于输入列表中的下一个项目时,它将按照相同的逻辑决定是将该项目放置在子列表中还是开始另一个子列表。

There you have it. 你有它。 Don't forget to import Data.List (foldl') up at the top. 不要忘记在顶部import Data.List (foldl') As another answer points out, this is not a performant solution if you plan to process 100,000 strings. 另一个答案指出,如果您打算处理100,000个字符串,那么这不是一种高效的解决方案。 However, I believe this solution is easier to read and understand. 但是,我相信此解决方案更易于阅读和理解。 In many cases, readability is the more important optimization. 在许多情况下,可读性是更重要的优化。

Thanks for the fun question. 感谢您提出的有趣问题。 Good luck with Haskell, and happy coding! Haskell祝您好运,并祝您编程愉快!

You can do something like this: 您可以执行以下操作:

splitByLen :: Int -> [String] -> [[String]]
splitByLen n s = go (zip s $ scanl1 (+) $ map length s) 0
  where go [] _ = []
        go xs prev = let (lst, rest) = span (\ (x, c) -> c - prev <= n) xs
                     in (map fst lst) : go rest (snd $ last lst)

And then: 接着:

*Main> splitByLen 5 ["abc","cd","abcd","ab"]
[["abc","cd"],["abcd"],["ab"]]

In case there is a string longer than n , this function will fail. 如果字符串长于n ,则此函数将失败。 Now, what you want to do in those cases depends on your requirements and that was not specified in your question. 现在,您在这些情况下要做什么取决于您的要求,而您的问题中未指定。


[Update] [更新]

As requested by @amar47shah, I made a benchmark comparing his solution ( breakup ) with mine ( splitByLen ): 按照要求通过@ amar47shah,我做了一个标杆比较他的解决方案( breakup )和我( splitByLen ):

import Data.List
import Data.Time.Clock
import Control.DeepSeq
import System.Random

main :: IO ()
main = do
  s <- mapM (\ _ -> randomString 10) [1..10000]
  test "breakup    10000" $ breakup    10 s
  test "splitByLen 10000" $ splitByLen 10 s
  putStrLn ""
  r <- mapM (\ _ -> randomString 10) [1..100000]
  test "breakup    100000" $ breakup    10 r
  test "splitByLen 100000" $ splitByLen 10 r

test :: (NFData a) => String -> a -> IO ()
test s a = do time1 <- getCurrentTime
              time2 <- a `deepseq` getCurrentTime
              putStrLn $ s ++ ": " ++ show (diffUTCTime time2 time1)

randomString :: Int -> IO String
randomString n = do
  l <- randomRIO (1,n)
  mapM (\ _ -> randomRIO ('a', 'z')) [1..l]

Here are the results: 结果如下:

breakup    10000: 0.904012s
splitByLen 10000: 0.005966s

breakup    100000: 150.945322s
splitByLen 100000: 0.058658s

Here is another approach. 这是另一种方法。 It is clear from the problem that the result is a list of lists and we need a running length and an inner list to keep track of how much we have accumulated (We use foldl' with these two as input). 从问题中可以明显看出,结果是一个列表列表,我们需要一个运行长度和一个内部列表来跟踪已累积的数量(我们将foldl'与这两个作为输入)。 We then describe what we want which is basically: 然后,我们描述我们想要的基本上是:

  1. If the length of the current input string itself exceeds the input length, we ignore that string (you may change this if you want a different behavior). 如果当前输入字符串本身的长度超过了输入长度,我们将忽略该字符串(如果需要其他行为,可以更改此字符串)。
  2. If the new length after we have added the length of the current string is within our input length, we add it to the current result list. 如果在添加当前字符串的长度之后的新长度在输入长度之内,则将其添加到当前结果列表中。
  3. If the new length exceeds the input length, we add the result so far to the output and start a new result list. 如果新长度超过输入长度,我们将到目前为止的结果添加到输出中,并开始一个新的结果列表。
chunks len = reverse  . map reverse . snd . foldl' f (0, [[]]) where
  f (resSoFar@(lenSoFar, (currRes: acc)) curr
    | currLength > len = resSoFar -- ignore
    | newLen <= len    = (newLen, (curr: currRes):acc)
    | otherwise        = (currLength, [curr]:currRes:acc) 
    where
      newLen = lenSoFar + currLength
      currLength = length curr

Every time we add a result to the output list, we add it to the front hence we need reverse . map reverse 每次将结果添加到输出列表时,都将其添加到最前面,因此需要reverse . map reverse reverse . map reverse at the end. reverse . map reverse

> chunks 5 ["abc","cd","abcd","ab"]
[["abc","cd"],["abcd"],["ab"]]

> chunks 5 ["abc","cd","abcdef","ab"]
[["abc","cd"],["ab"]]

Here is an elementary approach. 这是一种基本方法。 First, the type String doesn't matter, so we can define our function in terms of a general type a : 首先, String类型无关紧要,因此我们可以根据通用类型a定义函数:

breakup :: [a] -> [[a]]

I'll illustrate with a limit of 3 instead of 10. It'll be obvious how to implement it with another limit. 我将以3(而不是10)为例进行说明。很明显,如何以另一个限制实现它。

The first pattern will handle lists which are of size >= 3 and the the second pattern handles all of the other cases: 第一个模式将处理大小大于等于3的列表,第二个模式将处理所有其他情况:

breakup (a1 : a2 : a3 : as) = [a1, a2, a3] : breakup as
breakup as = [ as ]

It is important to have the patterns in this order. 按此顺序排列模式很重要。 That way the second pattern will only be used when the first pattern does not match, ie when there are less than 3 elements in the list. 这样,仅当第一个模式不匹配时,即列表中少于3个元素时,才使用第二个模式。

Examples of running this on some inputs: 在某些输入上运行此示例:

breakup [1..5]       -> [ [1,2,3], [4,5] ]
breakup [1..4]       -> [ [1,2,3], [4] ]
breakup [1..2]       -> [ [1,2] ]
breakup [1..3]       -> [ [1,2,3], [] ]

We see these is an extra [] when we run the function on [1..3] . 当在[1..3]上运行函数时,我们看到这些是额外的[] [1..3] Fortunately this is easy to fix by inserting another rule before the last one: 幸运的是,通过在最后一个规则之前插入另一个规则,可以轻松解决此问题:

breakup [] = []

The complete definition is: 完整的定义是:

breakup :: [a] -> [[a]]
breakup [] = []
breakup (a1 : a2 : a3 : as) = [a1, a2, a3] : breakup as
breakup as = [ as ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM