Clojure中的慢速詞法分析器

Question

我正在嘗試用clojure寫一個簡單的詞法分析器。 目前，它僅識別由空格分隔的標識符。

(refer 'clojure.set :only '[union])

(defn char-range-set
  "Generate set containing all characters in the range [from; to]"
  [from to]
  (set (map char (range (int from) (inc (int to))))))

(def ident-initial (union (char-range-set \A \Z) (char-range-set \a \z) #{\_}))

(def ident-subseq (union ident-initial (char-range-set \0 \9)))

(defn update-lex [lex token source]
  (assoc (update lex :tokens conj token) :source source))

(defn scan-identifier [lex]
  (assert (ident-initial (first (:source lex))))
  (loop [[c & cs :as source] (rest (:source lex))
         value [(first (:source lex))]]
    (if (ident-subseq c)
      (recur cs (conj value c))
      (update-lex lex {:type :identifier :value value} source))))

(defn scan [{tokens :tokens [c & cs :as source] :source :as lex}]
  (cond
    (Character/isWhitespace c) (assoc lex :source cs)
    (ident-initial c) (scan-identifier lex)))

(defn tokenize [source]
  (loop [lex {:tokens [] :source source}]
    (if (empty? (:source lex))
      (:tokens lex)
      (recur (scan lex)))))

(defn measure-tokenizer [n]
  (let [s (clojure.string/join (repeat n "abcde "))]
    (time (tokenize s))
    (* n (count "abcde "))))

Lexer在15秒鍾內處理大約600萬個字符。

=> (measure-tokenizer 1000000)
"Elapsed time: 15865.909399 msecs"

之后，我將所有地圖和矢量轉換為瞬態。 這沒有改善。

另外，我已經在C ++中實現了類似算法。 相同的輸入僅需0.2秒。

我的問題是：如何改善代碼？ 也許我不正確地使用clojure數據結構？

更新：

所以這是我的C ++代碼。

#include <iostream>
#include <vector>
#include <chrono>
#include <unordered_set>
#include <cstdlib>
#include <string>
#include <cctype>
using namespace std;

struct Token
{
   enum { IDENTIFIER = 1 };
   int type;
   string value;
};

class Lexer
{
public:
   Lexer(const std::string& source)
      : mSource(source)
      , mIndex(0)
   {
      initCharSets();
   }

   std::vector<Token> tokenize()
   {
      while (mIndex < mSource.size())
      {
         scan();
      }

      return mResult;
   }

private:

   void initCharSets()
   {
      for (char c = 'a'; c <= 'z'; ++c)
         mIdentifierInitial.insert(c);
      for (char c = 'A'; c <= 'Z'; ++c)
         mIdentifierInitial.insert(c);
      mIdentifierInitial.insert('_');

      mIdentifierSubsequent = mIdentifierInitial;
      for (char c = '0'; c <= '9'; ++c)
         mIdentifierSubsequent.insert(c);
   }

   void scan()
   {
      skipSpaces();

      if (mIndex < mSource.size())
      {
         if (mIdentifierInitial.find(mSource[mIndex]) != mIdentifierInitial.end())
         {
            scanIdentifier();
         }

         mResult.push_back(mToken);
      }
   }

   void scanIdentifier()
   {
      size_t i = mIndex;

      while ((i < mSource.size()) && (mIdentifierSubsequent.find(mSource[i]) != mIdentifierSubsequent.end()))
         ++i;

      mToken.type = Token::IDENTIFIER;
      mToken.value = mSource.substr(mIndex, i - mIndex);
      mIndex = i;
   }

   void skipSpaces()
   {
      while ((mIndex < mSource.size()) && std::isspace(mSource[mIndex]))
         ++mIndex;
   }

   unordered_set<char> mIdentifierInitial;
   unordered_set<char> mIdentifierSubsequent;
   string mSource;
   size_t mIndex;
   vector<Token> mResult;
   Token mToken;
};

void measureBigString(int n)
{
   std::string substri = "jobbi ";
   std::string bigstr;
   for (int i =0 ;i < n;++i)
      bigstr += substri;

   Lexer lexer(bigstr);

   std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();

   lexer.tokenize();

   std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

   std::cout << n << endl;
   std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() <<std::endl;
   std::cout << "\n\n\n";
}



int main()
{
   measureBigString(1000000);


   return 0;
}

Answer 1

我沒有發現此代碼有任何明顯的錯誤。 我不希望瞬變對您有太大幫助，因為您不是批量加載，而是每個循環更新一次（我懷疑這實際上是最慢的部分）。

我對哪些事情進展緩慢的猜測：

檢查集合中的字符（需要散列並遍歷內部散列樹）。 創建函數實際上對字符范圍（> this，<that等）進行基於int的檢查，而不是構建集合，雖然看起來不那么漂亮，但是幾乎可以肯定會更快，特別是如果您小心使用原始類型提示，並避免對對象裝箱。
每次循環時，都會嵌套在哈希圖中嵌套的值。 那不會是最快的操作。 如果確實將其保留為獨立的瞬態向量，則會更快，並且避免重建上層樹。 根據慣用的Clojure到您想進入Java領域的距離，您還可以使用可變的ArrayList。 它很臟，但是速度很快-如果您限制暴露於該可變狀態的人員的范圍，那么我會考慮這樣的事情。 從概念上講，與瞬態矢量相同。

Answer 2

更新：

一個更重要的調整是向量解構。 通過替換這樣的代碼：

(let [[c & cs] xs] ...)

有：

(let [c  (first xs)
      cs (rest xs)] ...)

將帶來另一個x2的性能提升。 總之，您將獲得x26加速-應該與C ++實現相當。

簡而言之：

類型提示避免了所有反射調用
記錄使您可以優化訪問/更新屬性
first和rest避免了向量解構-使用nth / nthFrom並對seq執行順序訪問。

希望可以對向量解構進行優化，以免出現類似這種常見情況（綁定中僅存在first和rest的情況）時出現nthFrom的情況。

第一步-具有類型提示和記錄：

您還可以使用記錄而不是通用映射：

(refer 'clojure.set :only '[union])

(defn char-range-set
  "Generate set containing all characters in the range [from; to]"
  [from to]
  (set (map char (range (int from) (inc (int to))))))

(def ident-initial (union (char-range-set \A \Z) (char-range-set \a \z) #{\_}))

(def ident-subseq (union ident-initial (char-range-set \0 \9)))

(defrecord Token [type value])
(defrecord Lex [tokens source])

(defn update-lex [^Lex lex ^Token token source]
  (assoc (update lex :tokens conj token) :source source))

(defn scan-identifier [^Lex lex]
  (let [[x & xs] (:source lex)]
    (loop [[c & cs :as source] xs
           value               [x]]
      (if (ident-subseq c)
        (recur cs (conj value c))
        (update-lex lex (Token. :identifier value) source)))))

(defn scan [^Lex lex]
  (let [[c & cs] (:source lex)
        tokens   (:tokens lex)]
    (cond
      (Character/isWhitespace ^char c) (assoc lex :source cs)
      (ident-initial c)                (scan-identifier lex))))

(defn tokenize [source]
  (loop [lex (Lex. [] source)]
    (if (empty? (:source lex))
      (:tokens lex)
      (recur (scan lex)))))

(use 'criterium.core)

(defn measure-tokenizer [n]
  (let [s (clojure.string/join (repeat n "abcde "))]
    (bench (tokenize s))
    (* n (count "abcde "))))

(measure-tokenizer 1000)

使用標准：

Evaluation count : 128700 in 60 samples of 2145 calls.
             Execution time mean : 467.378916 µs
    Execution time std-deviation : 329.455994 ns
   Execution time lower quantile : 466.867909 µs ( 2.5%)
   Execution time upper quantile : 467.984646 µs (97.5%)
                   Overhead used : 1.502982 ns

與原始代碼比較：

Evaluation count : 9960 in 60 samples of 166 calls.
             Execution time mean : 6.040209 ms
    Execution time std-deviation : 6.630519 µs
   Execution time lower quantile : 6.028470 ms ( 2.5%)
   Execution time upper quantile : 6.049443 ms (97.5%)
                   Overhead used : 1.502982 ns

優化版本大約是x13加速。 在n = 1,000,000的情況下，現在大約需要0.5秒。

Clojure中的慢速詞法分析器

問題描述

2 個解決方案

解決方案1
3 2016-11-16 15:00:53

解決方案2
3 2016-11-17 15:27:33

Clojure中的慢速詞法分析器

問題描述

2 個解決方案

解決方案1 3 2016-11-16 15:00:53

解決方案2 3 2016-11-17 15:27:33

解決方案1
3 2016-11-16 15:00:53

解決方案2
3 2016-11-17 15:27:33