数据文件和ETS表之间的平均大小比例是多少？

Question

I am evaluating the use of Erlang ETS to store a large in-memory data set that. 我正在评估使用Erlang ETS来存储大量的内存数据集。 My test data source is a CSV file that consumes only 350 MBytes of disk. 我的测试数据源是一个CSV文件，仅消耗350 MB的磁盘。

My parser reads row by row and splices it into a list, then create a tuple and stores it in ETS , using a "bag" configuration. 我的解析器逐行读取并将其拼接成一个列表，然后使用“bag”配置创建一个元组并将其存储在ETS中 。

After loading all the data in ETS I noticed that my computer's 8GB of RAM was all gone, and the OS had created virtual memory, occupying somewhere near 16GB or RAM. 在ETS中加载所有数据后，我注意到我的计算机的8GB内存全部消失了，操作系统创建了虚拟内存，占用了16GB或RAM附近。 The erlang's Beam process seems to take a consume about 10 times fold more memory than the size of disk data. erlang的Beam进程似乎比磁盘数据的大小消耗大约10倍的内存。

Here is the test code : 这是测试代码 ：

-module(load_test_data).
-author("gextra").

%% API
-export([test/0]).

init_ets() ->
  ets:new(memdatabase, [bag, named_table]).

parse(File) ->
  {ok, F} = file:open(File, [read, raw]),
  parse(F, file:read_line(F), []).

parse(F, eof, Done) ->
  file:close(F),
  lists:reverse(Done);

parse(F, Line, Done) ->
  parse(F, file:read_line(F), [ parse_row_commodity_data(Line) | Done ]).

parse_row_commodity_data(Line) ->
  {ok, Data} = Line,
  %%io:fwrite(Data),
  LineList          = re:split(Data,"\,",[{return,list}]),
  ReportingCountry  = lists:nth(1, LineList),
  YearPeriod        = lists:nth(2, LineList),
  Year              = lists:nth(3, LineList),
  Period            = lists:nth(4, LineList),
  TradeFlow         = lists:nth(5, LineList), 
  Commodity         = lists:nth(6, LineList),
  PartnerCountry    = lists:nth(7, LineList),
  NetWeight         = lists:nth(8, LineList),
  Value             = lists:nth(9, LineList),
  IsReported        = lists:nth(10, LineList),
  ets:insert(memdatabase, {YearPeriod ++ ReportingCountry ++ Commodity , { ReportingCountry, Year, Period, TradeFlow, Commodity, PartnerCountry, NetWeight, Value, IsReported } }).


test() ->
  init_ets(),
  parse("/data/000-2010-1.csv").

Answer 1

It strongly depend what you mean by splices it into a list, then create a tuple . 它强烈依赖于你将它拼接到列表中的意思，然后创建一个元组 。 Especially splice into list can take a lot of memory. 特别是拼接到列表中会占用大量内存。 One byte can occupy 16B if split into list. 如果拆分成列表，则一个字节可占用16B。 It is 5.6GB with easy. 容量为5.6GB。

EDIT : 编辑：

Try this: 试试这个：

parse(File) ->
  {ok, F} = file:open(File, [read, raw, binary]),
  ok = parse(F, binary:compile_pattern([<<$,>>, <<$\n>>])),
  ok = file:close(F).

parse(F, CP) ->
  case file:read_line(F) of
    {ok, Line} ->
      parse_row_commodity_data(Line, CP),
      parse(F, CP);
    eof -> ok
  end.

parse_row_commodity_data(Line, CP) ->
  [ ReportingCountry, YearPeriod, Year, Period, TradeFlow, Commodity,
    PartnerCountry, NetWeight, Value, IsReported]
      = binary:split(Line, CP, [global, trim]),
  true = ets:insert(memdatabase, {
         {YearPeriod, ReportingCountry, Commodity},
         { ReportingCountry, Year, Period, TradeFlow, Commodity,
           PartnerCountry, NetWeight, Value, IsReported}
       }).

数据文件和ETS表之间的平均大小比例是多少？

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-02-22 21:24:04

数据文件和ETS表之间的平均大小比例是多少？

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-02-22 21:24:04

解决方案1
4 已采纳 2014-02-22 21:24:04