[英]Slow lexer in clojure
我正在嘗試用clojure寫一個簡單的詞法分析器。 目前,它僅識別由空格分隔的標識符。
(refer 'clojure.set :only '[union])
(defn char-range-set
"Generate set containing all characters in the range [from; to]"
[from to]
(set (map char (range (int from) (inc (int to))))))
(def ident-initial (union (char-range-set \A \Z) (char-range-set \a \z) #{\_}))
(def ident-subseq (union ident-initial (char-range-set \0 \9)))
(defn update-lex [lex token source]
(assoc (update lex :tokens conj token) :source source))
(defn scan-identifier [lex]
(assert (ident-initial (first (:source lex))))
(loop [[c & cs :as source] (rest (:source lex))
value [(first (:source lex))]]
(if (ident-subseq c)
(recur cs (conj value c))
(update-lex lex {:type :identifier :value value} source))))
(defn scan [{tokens :tokens [c & cs :as source] :source :as lex}]
(cond
(Character/isWhitespace c) (assoc lex :source cs)
(ident-initial c) (scan-identifier lex)))
(defn tokenize [source]
(loop [lex {:tokens [] :source source}]
(if (empty? (:source lex))
(:tokens lex)
(recur (scan lex)))))
(defn measure-tokenizer [n]
(let [s (clojure.string/join (repeat n "abcde "))]
(time (tokenize s))
(* n (count "abcde "))))
Lexer在15秒鍾內處理大約600萬個字符。
=> (measure-tokenizer 1000000)
"Elapsed time: 15865.909399 msecs"
之后,我將所有地圖和矢量轉換為瞬態。 這沒有改善。
另外,我已經在C ++中實現了類似算法。 相同的輸入僅需0.2秒。
我的問題是:如何改善代碼? 也許我不正確地使用clojure數據結構?
更新:
所以這是我的C ++代碼。
#include <iostream>
#include <vector>
#include <chrono>
#include <unordered_set>
#include <cstdlib>
#include <string>
#include <cctype>
using namespace std;
struct Token
{
enum { IDENTIFIER = 1 };
int type;
string value;
};
class Lexer
{
public:
Lexer(const std::string& source)
: mSource(source)
, mIndex(0)
{
initCharSets();
}
std::vector<Token> tokenize()
{
while (mIndex < mSource.size())
{
scan();
}
return mResult;
}
private:
void initCharSets()
{
for (char c = 'a'; c <= 'z'; ++c)
mIdentifierInitial.insert(c);
for (char c = 'A'; c <= 'Z'; ++c)
mIdentifierInitial.insert(c);
mIdentifierInitial.insert('_');
mIdentifierSubsequent = mIdentifierInitial;
for (char c = '0'; c <= '9'; ++c)
mIdentifierSubsequent.insert(c);
}
void scan()
{
skipSpaces();
if (mIndex < mSource.size())
{
if (mIdentifierInitial.find(mSource[mIndex]) != mIdentifierInitial.end())
{
scanIdentifier();
}
mResult.push_back(mToken);
}
}
void scanIdentifier()
{
size_t i = mIndex;
while ((i < mSource.size()) && (mIdentifierSubsequent.find(mSource[i]) != mIdentifierSubsequent.end()))
++i;
mToken.type = Token::IDENTIFIER;
mToken.value = mSource.substr(mIndex, i - mIndex);
mIndex = i;
}
void skipSpaces()
{
while ((mIndex < mSource.size()) && std::isspace(mSource[mIndex]))
++mIndex;
}
unordered_set<char> mIdentifierInitial;
unordered_set<char> mIdentifierSubsequent;
string mSource;
size_t mIndex;
vector<Token> mResult;
Token mToken;
};
void measureBigString(int n)
{
std::string substri = "jobbi ";
std::string bigstr;
for (int i =0 ;i < n;++i)
bigstr += substri;
Lexer lexer(bigstr);
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
lexer.tokenize();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << n << endl;
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() <<std::endl;
std::cout << "\n\n\n";
}
int main()
{
measureBigString(1000000);
return 0;
}
我沒有發現此代碼有任何明顯的錯誤。 我不希望瞬變對您有太大幫助,因為您不是批量加載,而是每個循環更新一次(我懷疑這實際上是最慢的部分)。
我對哪些事情進展緩慢的猜測:
更新:
一個更重要的調整是向量解構。 通過替換這樣的代碼:
(let [[c & cs] xs] ...)
有:
(let [c (first xs)
cs (rest xs)] ...)
將帶來另一個x2的性能提升。 總之,您將獲得x26加速-應該與C ++實現相當。
簡而言之:
希望可以對向量解構進行優化,以免出現類似這種常見情況(綁定中僅存在first和rest的情況)時出現nthFrom的情況。
第一步-具有類型提示和記錄:
您還可以使用記錄而不是通用映射:
(refer 'clojure.set :only '[union])
(defn char-range-set
"Generate set containing all characters in the range [from; to]"
[from to]
(set (map char (range (int from) (inc (int to))))))
(def ident-initial (union (char-range-set \A \Z) (char-range-set \a \z) #{\_}))
(def ident-subseq (union ident-initial (char-range-set \0 \9)))
(defrecord Token [type value])
(defrecord Lex [tokens source])
(defn update-lex [^Lex lex ^Token token source]
(assoc (update lex :tokens conj token) :source source))
(defn scan-identifier [^Lex lex]
(let [[x & xs] (:source lex)]
(loop [[c & cs :as source] xs
value [x]]
(if (ident-subseq c)
(recur cs (conj value c))
(update-lex lex (Token. :identifier value) source)))))
(defn scan [^Lex lex]
(let [[c & cs] (:source lex)
tokens (:tokens lex)]
(cond
(Character/isWhitespace ^char c) (assoc lex :source cs)
(ident-initial c) (scan-identifier lex))))
(defn tokenize [source]
(loop [lex (Lex. [] source)]
(if (empty? (:source lex))
(:tokens lex)
(recur (scan lex)))))
(use 'criterium.core)
(defn measure-tokenizer [n]
(let [s (clojure.string/join (repeat n "abcde "))]
(bench (tokenize s))
(* n (count "abcde "))))
(measure-tokenizer 1000)
使用標准:
Evaluation count : 128700 in 60 samples of 2145 calls.
Execution time mean : 467.378916 µs
Execution time std-deviation : 329.455994 ns
Execution time lower quantile : 466.867909 µs ( 2.5%)
Execution time upper quantile : 467.984646 µs (97.5%)
Overhead used : 1.502982 ns
與原始代碼比較:
Evaluation count : 9960 in 60 samples of 166 calls.
Execution time mean : 6.040209 ms
Execution time std-deviation : 6.630519 µs
Execution time lower quantile : 6.028470 ms ( 2.5%)
Execution time upper quantile : 6.049443 ms (97.5%)
Overhead used : 1.502982 ns
優化版本大約是x13加速。 在n = 1,000,000的情況下,現在大約需要0.5秒。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.