简体   繁体   中英

Map reduce in RavenDb, update 1

Update 1 , following Ayende's answer

This is my first journey into RavenDb and to experiment with it I wrote a small map/ reduce, but unfortunately the result is empty?

I have around 1.6 million documents loaded into RavenDb

A document:

public class Tick
{
    public DateTime Time;
    public decimal Ask;
    public decimal Bid;
    public double AskVolume;
    public double BidVolume;
}

and wanted to get Min and Max of Ask over a specific period of Time.

The collection by Time is defined as:

var ticks = session.Query<Tick>().Where(x => x.Time > new DateTime(2012, 4, 23) && x.Time < new DateTime(2012, 4, 24, 00, 0, 0)).ToList();

Which gives me 90280 documents, so far so good.

But then the map/ reduce:

Map = rows => from row in rows 
                          select new
                          {
                              Max = row.Bid,
                              Min = row.Bid, 
                              Time = row.Time,
                              Count = 1
                          };

Reduce = results => from result in results
                                group result by new{ result.MaxBid, result.Count} into g
                                select new
                                {
                                    Max = g.Key.MaxBid,
                                    Min = g.Min(x => x.MaxBid),
                                    Time = g.Key.Time,
                                    Count = g.Sum(x => x.Count)

                                };

...

private class TickAggregationResult
{
    public decimal MaxBid { get; set; }
        public decimal MinBid { get; set; }
        public int Count { get; set; }

    }

I then create the index and try to Query it:

Raven.Client.Indexes.IndexCreation.CreateIndexes(typeof(TickAggregation).Assembly, documentStore);


        var session = documentStore.OpenSession();

        var g1 = session.Query<TickAggregationResult>(typeof(TickAggregation).Name);


        var group = session.Query<Tick, TickAggregation>()
                         .Where(x => x.Time > new DateTime(2012, 4, 23) && 
                                     x.Time < new DateTime(2012, 4, 24, 00, 0, 0)
                                  )
            .Customize(x => x.WaitForNonStaleResults())
                                           .AsProjection<TickAggregationResult>();

But the group is just empty :(

As you can see I've tried two different Queries, I'm not sure about the difference, can someone explain?

Now I get an error: 在此输入图像描述

The group are still empty :(

Let me explain what I'm trying to accomplish in pure sql:

select min(Ask), count(*) as TickCount from Ticks 
where Time between '2012-04-23' and '2012-04-24)

Unfortunately, Map/Reduce doesn't work that way. Well, at least the Reduce part of it doesn't. In order to reduce your set, you would have to predefine specific time ranges to group by, for example - daily, weekly, monthly, etc. You could then get min/max/count per day if you reduced daily.

There is a way to get what you want, but it has some performance considerations. Basically, you don't reduce at all, but you index by time and then do the aggregation when transforming results. This is similar to if you ran your first query to filter and then aggregated in your client code. The only benefit is that the aggregation is done server-side, so you don't have to transmit all of that data to the client.

The performance concern here is how big of a time range are you filtering to, or more precisely, how many items will there be inside your filter range? If it's relatively small, you can use this approach. If it's too large, you will be waiting while the server goes through the result set.

Here is a sample program that illustrates this technique:

using System;
using System.Linq;
using Raven.Client.Document;
using Raven.Client.Indexes;
using Raven.Client.Linq;

namespace ConsoleApplication1
{
  public class Tick
  {
    public string Id { get; set; }
    public DateTime Time { get; set; }
    public decimal Bid { get; set; }
  }

  /// <summary>
  /// This index is a true map/reduce, but its totals are for all time.
  /// You can't filter it by time range.
  /// </summary>
  class Ticks_Aggregate : AbstractIndexCreationTask<Tick, Ticks_Aggregate.Result>
  {
    public class Result
    {
      public decimal Min { get; set; }
      public decimal Max { get; set; }
      public int Count { get; set; }
    }

    public Ticks_Aggregate()
    {
      Map = ticks => from tick in ticks
               select new
                    {
                      Min = tick.Bid,
                      Max = tick.Bid,
                      Count = 1
                    };

      Reduce = results => from result in results
                group result by 0
                  into g
                  select new
                         {
                           Min = g.Min(x => x.Min),
                           Max = g.Max(x => x.Max),
                           Count = g.Sum(x => x.Count)
                         };
    }
  }

  /// <summary>
  /// This index can be filtered by time range, but it does not reduce anything
  /// so it will not be performant if there are many items inside the filter.
  /// </summary>
  class Ticks_ByTime : AbstractIndexCreationTask<Tick>
  {
    public class Result
    {
      public decimal Min { get; set; }
      public decimal Max { get; set; }
      public int Count { get; set; }
    }

    public Ticks_ByTime()
    {
      Map = ticks => from tick in ticks
               select new {tick.Time};

      TransformResults = (database, ticks) =>
                 from tick in ticks
                 group tick by 0
                 into g
                 select new
                      {
                        Min = g.Min(x => x.Bid),
                        Max = g.Max(x => x.Bid),
                        Count = g.Count()
                      };
    }
  }

  class Program
  {
    private static void Main()
    {
      var documentStore = new DocumentStore { Url = "http://localhost:8080" };
      documentStore.Initialize();
      IndexCreation.CreateIndexes(typeof(Program).Assembly, documentStore);


      var today = DateTime.Today;
      var rnd = new Random();

      using (var session = documentStore.OpenSession())
      {
        // Generate 100 random ticks
        for (var i = 0; i < 100; i++)
        {
          var tick = new Tick { Time = today.AddMinutes(i), Bid = rnd.Next(100, 1000) / 100m };
          session.Store(tick);
        }

        session.SaveChanges();
      }


      using (var session = documentStore.OpenSession())
      {
        // Query items with a filter.  This will create a dynamic index.
        var fromTime = today.AddMinutes(20);
        var toTime = today.AddMinutes(80);
        var ticks = session.Query<Tick>()
          .Where(x => x.Time >= fromTime && x.Time <= toTime)
          .OrderBy(x => x.Time);

        // Ouput the results of the above query
        foreach (var tick in ticks)
          Console.WriteLine("{0} {1}", tick.Time, tick.Bid);

        // Get the aggregates for all time
        var total = session.Query<Tick, Ticks_Aggregate>()
          .As<Ticks_Aggregate.Result>()
          .Single();
        Console.WriteLine();
        Console.WriteLine("Totals");
        Console.WriteLine("Min: {0}", total.Min);
        Console.WriteLine("Max: {0}", total.Max);
        Console.WriteLine("Count: {0}", total.Count);

        // Get the aggregates with a filter
        var filtered = session.Query<Tick, Ticks_ByTime>()
          .Where(x => x.Time >= fromTime && x.Time <= toTime)
          .As<Ticks_ByTime.Result>()
          .Take(1024)  // max you can take at once
          .ToList()    // required!
          .Single();
        Console.WriteLine();
        Console.WriteLine("Filtered");
        Console.WriteLine("Min: {0}", filtered.Min);
        Console.WriteLine("Max: {0}", filtered.Max);
        Console.WriteLine("Count: {0}", filtered.Count);
      }

      Console.ReadLine();
    }
  }
}

I can envision a solution to the problem of aggregating over a time filter with a potentially large scope. The reduce would have to break things down into decreasingly smaller units of time at different levels. The code for this is a bit complex, but I am working on it for my own purposes. When complete, I will post over in the knowledge base at www.ravendb.net.


UPDATE

I was playing with this a bit more, and noticed two things in that last query.

  1. You MUST do a ToList() before calling single in order to get the full result set.
  2. Even though this runs on the server, the max you can have in the result range is 1024, and you have to specify a Take(1024) or you get the default of 128 max. Since this runs on the server, I didn't expect this. But I guess its because you don't normally do aggregations in the TransformResults section.

I've updated the code for this. However, unless you can guarantee that the range is small enough for this to work, I would wait for the better full map/reduce that I spoke of. I'm working on it. :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM