為什么 bitset 會拋出 out_of_range 錯誤？

Question

我正在借助 c++ 中的位集實現布隆過濾器，以找出惡意 URL。 我有一個 100 位的 bitset 和一個簡單的 hash function。 但我仍然收到此錯誤。

#include<bits/stdc++.h>
using namespace std;
typedef long long int ll;
#define m 100
#define k 1

ll hash1(string str)
{   
    ll value=0;
    for(ll i=0;i<str.size();i++)
    {
        value=(value+str[i])%m;
    }
    return value;
}

int main(int argc, char *argv[])
{   
    vector<bitset<m>>v(1);
    ifstream fin;
    fin.open(argv[1]);

    string line;
    string temp;
    while(getline(fin,line))
    {   
        vector<string>row;
        stringstream s(line);
        string word;
        while(getline(s ,word, ','))
        {
            row.push_back(word);
        }
        if(row.size()!=2) continue;
        for(ll i=0;i<k;i++)
        {
            if(row[1]=="bad")
            {   
                v[0].set(hash1(row[0]));
                cout<<row[0]<<" inserted into bloom filter\n";
            }
        }
        row.clear();
    }
    //Now bitset contains all the malicious urls
    //Generating validation dataset
    fin.clear();
    fin.seekg(0);

    vector<vector<string>>validation;
        
    while(getline(fin,line))
    {
        vector<string>row;
        
        ll x=rand()%10;
        if(x>0) continue;
        
        string word;
        stringstream s(line);
        while(getline(s,word,','))
        {
            row.push_back(word);
        }
        validation.push_back(row);
        row.clear();
    }

    for(ll i=0;i<validation.size();i++)
    {   
        if(v[0].test(hash1(validation[i][0])))
        { 
            if(validation[i][1]=="bad")
            {
                cout<<i+1<<" : True Positive\n";
            }
            else
            {
                cout<<i+1<<" : False Positive\n";
            }
        }
        else 
        {
            cout<<i+1<<" : True Negative\n";
        }
    }
    return 0;
}

錯誤是：- 在拋出 'std::out_of_range' what() 的實例后調用終止：bitset::set: __position (即 18446744073709551587) >= _Nb (即 100) Aborted (core dumped)

數據集包含 2 列，即。 URL 和好/壞。

這是數據集的鏈接。 https://www.kaggle.com/antonyj453/urldataset

Answer 1

仔細閱讀后，我認為原因可能是簽名字符。 如果涉及簽名字符，

for (char i : str) {
    value = (value + i) % m;
}

可能導致負哈希。 然而，這不太可能真正發生，因為 url 通常不包含高 ascii（實際上您會希望它們的IDNA 編碼版本在列表中）。

快速檢查顯示列表中有 70 個此類域

xn---21-6cdxogafsegjgafgrkj4opcf2a.xn--p1ai/images/aaa/,bad xn--cafsolo-dya.de/media/header/A.php,bad xn--b1afpdqbb8d.xn--p1ai/web/data/mail/-/aaaaa/Made.htm,bad xn-----6kcbb3dhfdcijbv8e2b.xn--p1ai/libraries/openid/Auth/OpenID/sec/RBI/index/index.htm,bad

等等

如果不是這種情況，那么其他東西就會拋出out_of_range 。 沒有/應該/，因為operator[]沒有根據標准進行邊界檢查。

然而，也許某些實現會在 Debug 構建中進行邊界檢查（看看這里的 MSVC，它們在 Debug 構建中默認啟用了所有方式的迭代器調試，所以也許這也是如此？）。

具有諷刺意味的是，您可以使用一些邊界檢查自己，例如：
 int main(int argc, char* argv[]) { std::vector<std::string_view> const args(argv, argv + argc); std::string const filename(args.at(1));
這樣，當沒有給出命令行參數時，您就不會只調用未定義的行為。

線端

有一個帶有線端的陷阱。 這些文件是 CRLF，因此您解析列的方式會導致最后一列包含"bad\r" ，而不是 Linux 上的"bad" 。

審查/簡化

在尋找其他錯誤時，我已經簡化了代碼。 現在它的表現會好很多。 以下是改進建議的概要。

包括。 只包括你需要的，真的

#include <bitset> #include <fstream> #include <iostream> #include <sstream> #include <string> #include <vector>

不需要不穩定的類型或宏。

 static constexpr size_t m = 100; static constexpr size_t k = 1; // unused for now

如前所述，防止簽名字符結果：

 static size_t hash1(std::string_view str) { size_t value = 0; for (unsigned char i: str) { value = (value + i) % m; } return value; }

同樣如前所述，防止分類文本中出現尾隨特殊字符：

 enum goodbad { good, bad, invalid }; // The Good, The Bad and... static goodbad to_classification(std::string_view text) { if (text == "bad") return bad; if (text == "good") return good; return invalid; }

接下來，一個大的。 您解析同一個文件兩次。 並重復代碼。 不要吧。 而是有一個 function 來解析它，並傳遞一個回調來決定如何處理解析的數據。

當我們這樣做的時候，讓我們停止無處不在的vector<vector<vector<string> > >疾病。 真的，您知道有多少列，以及它們的含義。 這也大大減少了分配。

 void process_csv(std::string filename, auto callback) { std::ifstream fin(filename); std::stringstream ss; for (std::string line; getline(fin, line);) { std::string_view row(line); // eat line end remnants row = row.substr(0, row.find_last_not_of("\r\n") + 1); if (auto comma = row.find_last_of(','); comma + 1) { auto url = row.substr(0, comma); auto goodbad = row.substr(comma + 1); auto classif = to_classification(goodbad); if (classif == invalid) std::cerr << "Ignored unclassified line '" << row << "'\n"; else callback(url, to_classification(goodbad)); } } }

就這樣。 請注意我們僅使用最后一個逗號進行拆分的關鍵見解。 因為如果 url 包含逗號，否則您會得到錯誤的結果。

現在，您可以拼湊主程序：
```
 int main(int argc, char* argv[]) { std::vector const args(argv, argv + argc); std::string const filename(args.at(1));
```
從前面提到的使用命令行arguments的安全方式開始，
```
 std::bitset<m> bloom;
```
減少冗長vector<vector<> >綜合症（並改進名稱 - v ？！）

這是第一個文件讀取過程：

 size_t bloom_size = 0; process_csv(filename, [&](std::string_view url, goodbad classification) { if (classification == bad) { bloom_size += 1; bloom.set(hash1(url)); //std::cout << url << " inserted into bloom filter\n"; } });

我決定沒有必要（而且很慢）打印所有bad的 url，所以讓我們打印它們的數量：

 // Now bitset contains all the malicious urls std::cerr << "Bloom filter primed with " << bloom_size << " bad urls\n";

現在驗證通過：

 // do a 1 in 10 validation check process_csv(filename, [&bloom, line = 0](std::string_view url, goodbad classification) mutable { line += 1; if (rand() % 10) return; // TODO #include <random> auto hit = bloom.test(hash1(url)); bool expected = (classification == bad); std::cout << line << ": " << std::boolalpha << (hit == expected) << (hit? " positive": " negative") << "\n"; }); }

現場演示

住在科利魯

#include <bitset>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

static constexpr size_t m = 100;
//static constexpr size_t k = 1;

static size_t hash1(std::string_view str) {
    size_t value = 0;
    for (unsigned char i : str) {
        value = (value + i) % m;
    }
    return value;
}

enum goodbad { good, bad, invalid }; // The Good, The Bad and ...

static goodbad to_classification(std::string_view text) {
    if (text == "bad") return bad;
    if (text == "good") return good;
    return invalid;
}

void process_csv(std::string filename, auto callback) {
    std::ifstream fin(filename);

    std::stringstream ss;
    for (std::string line; getline(fin, line);) {
        std::string_view row(line);

        // eat line end remnants
        row = row.substr(0, row.find_last_not_of("\r\n") + 1);

        if (auto comma = row.find_last_of(','); comma + 1) {
            auto url     = row.substr(0, comma);
            auto goodbad = row.substr(comma + 1);
            auto classif = to_classification(goodbad);

            if (classif == invalid)
                std::cerr << "Ignored unclassified line '" << row << "'\n";
            else
                callback(url, to_classification(goodbad));
        }
    }
}

int main(int argc, char* argv[]) {
    std::vector const args(argv, argv + argc);
    std::string const filename(args.at(1));

    std::bitset<m> bloom;

    size_t bloom_size = 0;
    process_csv(filename, [&](std::string_view url, goodbad classification) {
        if (classification == bad) {
            bloom_size += 1;
            bloom.set(hash1(url));
        }
    });

    // Now bitset contains all the malicious urls
    std::cerr << "Bloom filter primed with " << bloom_size << " bad urls\n";

    // do a 1 in 10 validation check
    process_csv(filename,
        [&bloom, line = 0](std::string_view url,
                           goodbad classification) mutable {
            line += 1;
            if (rand() % 10) return;

            auto hit      = bloom.test(hash1(url));
            bool expected = (classification == bad);

            std::cout << line << ":" << std::boolalpha
                      << (hit == expected)
                      << (hit ? " positive" : " negative") << "\n";
        });
}

在 Coliru 上，只有壞的數據集適合，所以我們永遠不會得到任何積極的結果

g++ -std=c++2a -O2 -Wall -pedantic -pthread main.cpp
./a.out bad.csv  | cut -d: -f2 | sort | uniq -c | sort -n

印刷

Ignored unclassified line 'url,label'
Bloom filter primed with 75643 bad urls
Ignored unclassified line 'url,label'
   7602 true positive

在我自己的系統上：

~~哦，它運行在 0.4 秒而不是之前的 3.7 秒。~~

解析錯誤修復后更快：整套平均為 0.093 秒。

為什么 bitset 會拋出 out_of_range 錯誤？

問題描述

1 個解決方案

解決方案1
1 已采納 2021-04-10 00:41:40

線端

審查/簡化

現場演示

為什么 bitset 會拋出 out_of_range 錯誤？

問題描述

1 個解決方案

解決方案1 1 已采納 2021-04-10 00:41:40

線端

審查/簡化

現場演示

解決方案1
1 已采納 2021-04-10 00:41:40