简体   繁体   中英

Segmented Least Squares

Give an algorithm that takes a sequence of points in the plane (x_1, y_1), (x_2, y_2), ...., (x_n, y_n) and an integer k as input and returns the best piecewise linear function f consisting of at most k pieces that minimizes the sum squared error. You may assume that you have access to an algorithm that computes the sum squared error for one segment through a set of n points in Θ(n) time.The solution should use O(n^2k) time and O(nk) space.

Can anyone help me with this problem? Thank you so much!

(This is too late for your homework, but hope it helps anyway.)
First is dynamic programming in python / numpy for k = 4 only, to help you understand how dynamic programming works; once you understand that, writing a loop for any k should be easy.
Also, Cost[] is a 2d matrix, space O(n^2); see the notes at the end for getting down to space O(nk)

#!/usr/bin/env python
""" split4.py: min-cost split into 4 pieces, dynamic programming k=4 """

from __future__ import division
import numpy as np

__version__ = "2014-03-09 mar denis"

#...............................................................................
def split4( Cost, verbose=1 ):
    """ split4.py: min-cost split into 4 pieces, dynamic programming k=4
        min Cost[0:a] + Cost[a:b] + Cost[b:c] + Cost[c:n]
        Cost[a,b] = error in least-squares line fit to xy[a] .. xy[b] *including b*
        or error in lsq horizontal lines, sum (y_j - av y) ^2 for each piece --

          o--
        o-
             o---
                 o----
        | |  |   |
        0 2  5   9

        (Why 4 ? to walk through step by step, then put in a loop)
    """
        # speedup: maxlen 2 n/k or so

    Cost = np.asanyarray(Cost)
    n = Cost.shape[1]
        # C2 C3 ... costs, J2 J3 ... indices of best splits
    J2 = - np.ones(n, dtype=int)  # -1, NaN mark undefined / bug
    C2 = np.ones(n) * np.NaN
    J3 = - np.ones(n, dtype=int)
    C3 = np.ones(n) * np.NaN

        # best 2-splits of the left 2 3 4 ...
    for nleft in range( 1, n ):
        J2[nleft] = j = np.argmin([ Cost[0,j-1] + Cost[j,nleft]  for j in range( 1, nleft+1 )]) + 1
        C2[nleft] =                 Cost[0,j-1] + Cost[j,nleft]
            # an idiom for argmin j, min value c together

        # best 3-splits of the left 3 4 5 ...
    for nleft in range( 2, n ):
        J3[nleft] = j = np.argmin([ C2[j-1] + Cost[j,nleft]  for j in range( 2, nleft+1 )]) + 2
        C3[nleft] =                 C2[j-1] + Cost[j,nleft]

        # best 4-split of all n --
    j4 = np.argmin([ C3[j-1] + Cost[j,n-1]  for j in range( 3, n )]) + 3
    c4 =             C3[j4-1] + Cost[j4,n-1]
    j3 = J3[j4]
    j2 = J2[j3]
    jsplit = np.array([ 0, j2, j3, j4, n ])
    if verbose:
        print "split4: len %s  pos %s  cost %.3g" % (np.diff(jsplit), jsplit, c4)
        print "split4: J2 %s  C2 %s" %(J2, C2)
        print "split4: J3 %s  C3 %s" %(J3, C3)
    return jsplit

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    import spread

    n = 10
    ncycle = 2
    plot = 0
    seed = 0

        # run this.py a=1 b=None c=[3] 'd = expr' ...  in sh or ipython
    for arg in sys.argv[1:]:
        exec( arg )
    np.set_printoptions( 1, threshold=100, edgeitems=10, linewidth=100, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "\n", 80 * "-"
    title = "Dynamic programming least-square horizontal lines  %s  n %d  seed %d" % (
            __file__, n, seed)
    print title

    x = np.arange( n + 0. )
    y = np.sin( 2*np.pi * x * ncycle / n )
        # synthetic time series ?
    print "y: %s  av %.3g  variance %.3g" % (y, y.mean(), np.var(y))

    print "Cost[j,k] = sum (y - av y)^2 --"  # len * var y[j:k+1]
    Cost = spread.spreads_allij( y )
    print Cost  # .round().astype(int)

    jsplit = split4( Cost )
        # split4: len [3 2 3 2]  pos [ 0  3  5  8 10]

    if plot:
        import matplotlib.pyplot as pl
        title += "\n lengths: %s" % np.diff(jsplit)
        pl.title( title )
        pl.plot( y )
        for js, js1 in zip( jsplit[:-1], jsplit[1:] ):
            if js1 <= js:  continue
            yav = y[js:js1].mean() * np.ones( js1 - js + 1 )
            pl.plot( np.arange( js, js1 + 1 ), yav )
        # pl.legend()
        pl.show()

Then, the following code does Cost[] for horizontal lines only, slope 0; extending it to line segments of any slope, in time O(n), is left as an exercise.

""" spreads( all y[:j] ) in time O(n)

define spread( y[] ) = sum (y - average y)^2
    e.g. spread of 24 hourly temperatures y[0:24] i.e. y[0] .. y[23]
    around a horizontal line at the average temperature
        (spread = 0 for constant temperature,
        24 c^2 for constant + [c -c c -c ...],
        24 * variance(y) )

    How fast can one compute all 24 spreads
    1 hour (midnight to 1 am), 2 hours ... all 24 ?

    A simpler problem: compute all 24 averages in time O(n):
        N = np.arange( 1, len(y)+1 )
        allav = np.cumsum(y) / N
            = [ y0,  (y0 + y1) / 2,  (y0 + y1 + y2) / 3 ...]
    An identity:
        spread(y) = sum(y^2) - n * (av y)^2
    Voila: the code below, all spreads() in time O(n).

    Exercise: extend this to spreads around least-squares lines
    fit to [ y0,  [y0 y1],  [y0 y1 y2] ... ], not just horizontal lines.
"""

from __future__ import division
import sys
import numpy as np


#...............................................................................
def spreads( y ):
    """ [ spread y[:1], spread y[:2] ... spread y ] in time O(n)
        where spread( y[] ) = sum (y - average y )^2
            = n * variance(y)
    """
    N = np.arange( 1, len(y)+1 )
    return np.cumsum( y**2 ) - np.cumsum( y )**2 / N

def spreads_allij( y ):
    """ -> A[i,j] = sum (y - av y)^2, spread of y around its average
        for all y[i:j+1] 
        time, space O(n^2)
    """
    y = np.asanyarray( y, dtype=float )
    n = len(y)
    A = np.zeros((n,n))
    for i in range(n):
        A[i,i:] = spreads( y[i:] )
    return A

So far we have an nxn cost matrix, space O(n^2). To get down to space O( nk ), look closely at the pattern of Cost[i,j] accesses in the dyn-prog code:

for nleft .. to n:
    Cost_nleft = Cost[j,nleft ]  -- time nleft or nleft^2
    for k in 3 4 5 ...:
        min [ C[k-1, j-1] + Cost_nleft[j]  for j .. to nleft ]

Here Cost_nleft is one row of the full nxn cost matrix, ~ n segments, generated as needed. This can be done in time O(n) for line segments. But if "error for one segment through a set of n points takes O(n) time", it seems we're up to time O(n^3). Comments anyone ?

If you can do least squares for some segment in n^2 , it's easy to do what you want in n^2 k^2 with dynamic programming. You might be able to optimize that to a single k only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM