为什么我的程序很慢？我怎样才能提高效率？

Question

我有一个程序执行Block Nested循环连接（链接文本）。 基本上它的作用是，它将文件（比如10GB文件）中的内容读入buffer1（比如400MB），将其放入哈希表中。 现在将第二个文件（比如10GB文件）的内容读入缓冲区2（比如说100MB），看看buffer2中的元素是否存在于哈希中。 输出结果无关紧要。 我现在只关心程序的效率。 在这个程序中，我需要从两个文件一次读取8个字节，所以我使用long long int。 问题是我的程序效率很低。 我怎样才能提高效率？

//我使用g++ -o hash hash.c -std=c++0x编译g++ -o hash hash.c -std=c++0x

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdint.h>
#include <math.h>
#include <limits.h>
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_map>
using namespace std;

typedef std::unordered_map<unsigned long long int, unsigned long long int> Mymap; 
int main()
{

uint64_t block_size1 = (400*1024*1024)/sizeof(long long int);  //block size of Table A - division operator used to make the block size 1 mb - refer line 26,27 malloc statements.
uint64_t block_size2 = (100*1024*1024)/sizeof(long long int);   //block size of table B

int i=0,j=0, k=0;
uint64_t x,z,l=0;
unsigned long long int *buffer1 = (unsigned long long int *)malloc(block_size1 * sizeof(long long int));
unsigned long long int *buffer2 = (unsigned long long int *)malloc(block_size2 * sizeof(long long int));

Mymap c1 ;                                                          // Hash table
//Mymap::iterator it;

FILE *file1 = fopen64("10G1.bin","rb");  // Input is a binary file of 10 GB
FILE *file2 = fopen64("10G2.bin","rb");

printf("size of buffer1 : %llu \n", block_size1 * sizeof(long long int));
printf("size of buffer2 : %llu \n", block_size2 * sizeof(long long int));


while(!feof(file1))
        {
        k++;
        printf("Iterations completed : %d \n",k);
        fread(buffer1, sizeof(long long int), block_size1, file1);                          // Reading the contents into the memory block from first file

        for ( x=0;x< block_size1;x++)
            c1.insert(Mymap::value_type(buffer1[x], x));                                    // inserting values into the hash table

//      std::cout << "The size of the hash table is" << c1.size() * sizeof(Mymap::value_type) << "\n" << endl;

/*      // display contents of the hash table 
            for (Mymap::const_iterator it = c1.begin();it != c1.end(); ++it) 
            std::cout << " [" << it->first << ", " << it->second << "]"; 
            std::cout << std::endl; 
*/

                while(!feof(file2))
                {   
                    i++;                                                                    // Counting the number of iterations    
//                  printf("%d\n",i);

                    fread(buffer2, sizeof(long long int), block_size2, file2);              // Reading the contents into the memory block from second file

                    for ( z=0;z< block_size2;z++)
                        c1.find(buffer2[z]);                                                // finding the element in hash table

//                      if((c1.find(buffer2[z]) != c1.end()) == true)                       //To check the correctness of the code
//                          l++;
//                  printf("The number of elements equal are : %llu\n",l);                  // If input files have exactly same contents "l" should print out the block_size2
//                  l=0;                    
                }
                rewind(file2);
                c1.clear();                                         //clear the contents of the hash table
    }

    free(buffer1);
    free(buffer2);  
    fclose(file1);
    fclose(file2);
}

更新：

是否可以直接从文件中读取一个块（比如400 MB）并使用C ++流读取器将其直接放入哈希表中？ 我认为这可以进一步减少开销。

Answer 1

如果你正在使用fread，那么尝试使用setvbuf（）。 标准lib文件I / O调用使用的默认缓冲区很小（通常约为4kB）。 当快速处理大量数据时，您将受到I / O限制，并且获取许多小型缓冲区数据的开销可能成为一个重要的瓶颈。 将其设置为更大的尺寸（例如64kB或256kB），您可以减少这种开销，并且可以看到显着的改进 - 尝试一些值，看看哪里可以获得最佳收益，因为您将获得递减收益。

Answer 2

程序的运行时间是（l ₁ x bs ₁ xl ₂ x bs ₂ ）（其中l ₁是第一个文件中的行数，bs ₁是第一个缓冲区的块大小，l ₂是因为你有四个嵌套循环，所以第二个文件中的行数和bs ₂是第二个缓冲区的块大小。 由于您的块大小是常量，您可以说您的订单是O（nx 400 xmx 400）或O（1600mn），或者在最坏的情况下O（1600n ² ）基本上最终为O（n ² ）。

如果你做这样的事情你可以有一个O（n）算法（伪代码如下）：

map = new Map();
duplicate = new List();
unique = new List();

for each line in file1
   map.put(line, true)
end for

for each line in file2
   if(map.get(line))
       duplicate.add(line)
   else
       unique.add(line)
   fi
end for

现在， duplicate将包含重复项的列表，而unique将包含唯一项的列表。

在原始算法中，您将不必要地遍历第一个文件中每一行的第二个文件。 所以你实际上最终失去了哈希的好处（它给你O（1）查询时间）。 当然，在这种情况下的权衡是你必须将整个10GB存储在内存中，这可能没那么有用 。 通常在这些情况下，在运行时和内存之间进行权衡。

可能有更好的方法来做到这一点。 我需要再考虑一下。 如果没有，我很确定有人会想出更好的主意:)。

UPDATE

如果你能找到一种好的方法来散列行（你从第一个文件中读入），那么你可以减少内存使用量，这样你就可以得到一个唯一的值（即行和行之间的一对一映射）。哈希值）。 基本上你会做这样的事情：

for each line in file1
   map.put(hash(line), true)
end for

for each line in file2
   if(map.get(hash(line)))
       duplicate.add(line)
   else
       unique.add(line)
   fi
end for

这里hash函数是执行散列的函数。 这样您就不必将所有行存储在内存中。 您只需存储其散列值。 这可能对你有所帮助。 即便如此，在更糟糕的情况下（您要么比较两个相同或完全不同的文件），您仍然可以在内存中以10Gb结束duplicate或unique列表。 如果只是存储唯一或重复项目的计数而不是项目本身，则可以丢失一些信息。

Answer 3

long long int *ptr = mmap()你的文件，然后将它们与块中的memcmp（）进行比较。 一旦发现差异，退回一个块并更详细地比较它们。 （更多细节意味着在这种情况下长long int。）

如果您希望经常发现差异，请不要打扰memcmp（），只需编写自己的循环，将长长的int相互比较。

Answer 4

要知道的唯一方法是对其进行分析，例如使用gprof 。 创建当前实现的基准，然后有条不紊地尝试其他修改并重新运行基准测试。

Answer 5

我敢打赌，如果你阅读更大的块，你会获得更好的表现。 每次传递fread（）和处理多个块。

Answer 6

我看到的问题是你正在读n次的第二个文件。 真的很慢。

使速度更快的最佳方法是对文件进行预排序，然后执行排序合并连接。 根据我的经验，这种情况几乎总是值得的。

为什么我的程序很慢？我怎样才能提高效率？

问题描述

6 个解决方案

解决方案1
3 2010-10-05 21:59:48

解决方案2
2 已采纳 2010-10-05 21:46:57

解决方案3
1 2010-10-05 21:48:07

解决方案4
0 2010-10-05 21:43:08

解决方案5
0 2010-10-05 21:49:30

解决方案6
0 2010-10-05 21:50:54

为什么我的程序很慢？我怎样才能提高效率？

问题描述

6 个解决方案

解决方案1 3 2010-10-05 21:59:48

解决方案2 2 已采纳 2010-10-05 21:46:57

解决方案3 1 2010-10-05 21:48:07

解决方案4 0 2010-10-05 21:43:08

解决方案5 0 2010-10-05 21:49:30

解决方案6 0 2010-10-05 21:50:54

解决方案1
3 2010-10-05 21:59:48

解决方案2
2 已采纳 2010-10-05 21:46:57

解决方案3
1 2010-10-05 21:48:07

解决方案4
0 2010-10-05 21:43:08

解决方案5
0 2010-10-05 21:49:30

解决方案6
0 2010-10-05 21:50:54