[英]Iteration performance Java vs. C++
This one puzzles me for 3 days. 这个让我困惑了3天。 I have an application which needs to evaluate a certain set of integer polynomials (of multiple args) which have very few elements.
我有一个应用程序需要评估一组具有很少元素的整数多项式(多个args)。 I already have an implementation written in Java and I am currently porting to C++.
我已经有一个用Java编写的实现,我目前正在移植到C ++。
During testing, I noticed that the C++ version is orders of magnitudes slower that the Java variant. 在测试期间,我注意到C ++版本比Java版本慢了几个数量级。 I know of course about JIT-ing and that this scenario is particulary well-posed for this kind of compilers, but what I see is way off from what I had expected.
我当然知道JIT-ing,这种情况对于这种编译器特别适合,但我看到的远离我的预期。
The sample code is below, you'll need boost to compile the C++ code (but that dependency is only required for simple time measurement). 示例代码如下所示,您需要提升以编译C ++代码(但只需要简单的时间测量就可以使用该依赖关系)。
choeger@daishi ~/uebb % clang++ -O3 -std=c++11 polytest.cpp -lboost_timer -lboost_system
choeger@daishi ~/uebb % ./a.out
0.011694s wall, 0.010000s user + 0.000000s system = 0.010000s CPU (85.5%)
Ideal Result: 1e+07
0.421986s wall, 0.420000s user + 0.000000s system = 0.420000s CPU (99.5%)
Result: 1e+07
choeger@daishi ~/uebb % javac PolyTest.java
choeger@daishi ~/uebb % java PolyTest
evals: 10000000 runtime: 17ms
Ideal Result: 1.0E7
evals: 10000000 runtime: 78ms
Result: 1.0E7
Apparently, the C++ version (compiled with clang-3.3) runs slightly faster when it comes to pure computational power, but Java (openjdk 1.7.0.60) does much better when the polynomial is interpreted. 显然,C ++版本(使用clang-3.3编译)在纯计算能力方面运行得稍微快一点,但Java(openjdk 1.7.0.60)在解释多项式时表现更好。 My guess so far is, that my C++ code is not quite optimal due to the iteration over small (in the sample 1-element) vectors.
到目前为止,我的猜测是,由于迭代过小(在样本1元素中)向量,我的C ++代码不是很理想。 I assume the JVM does much better here when it comes to cache-hits misses.
我认为当涉及到缓存命中未命中时,JVM在这方面做得更好。
Is there any way to make my C++ version perform better? 有没有办法让我的C ++版本表现更好? Is there a different cause I just did not see?
我有没有看到不同的原因? And as a side note: is there way to measure cache-coherence for a C++ and a Java process?
并且作为旁注:有没有办法测量C ++和Java进程的缓存一致性?
The C++ code looks like this: C ++代码如下所示:
#include <boost/timer/timer.hpp>
#include <iostream>
#include <vector>
using namespace std;
struct Product {
int factor;
vector<int> fields;
};
class SumOfProducts {
public:
vector<Product> sum;
/**
* evaluate the polynomial with arguments separated by width
*/
inline double eval(const double* arg, const int width) const {
double res = 0.0;
for (Product p : sum) {
double prod = p.factor;
for (int f : p.fields) {
prod *= arg[f*width];
}
res += prod;
}
return res;
};
};
double idealBenchmark(const double* arg, const int width) {
boost::timer::auto_cpu_timer t;
double res = 0.0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
res = res + arg[width] * arg[width];
}
return res;
}
double benchmark(const double* arg, const SumOfProducts& poly) {
boost::timer::auto_cpu_timer t;
double res = 0.0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
res = res + poly.eval(arg, 1);
}
return res;
}
int main() {
//simple polynomial: x_1^2
Product p;
p.factor = 1;
p.fields.push_back(1);
p.fields.push_back(1);
SumOfProducts poly;
poly.sum.push_back(p);
double arg[] = { 0, 1 };
double res = idealBenchmark(arg, 1);
cout << "Ideal Result: " << res << endl;
res = benchmark(arg, poly);
cout << "Result: " << res << endl;
}
The Java version like this: 像这样的Java版本:
public class PolyTest {
static class Product {
public final int factor;
public final int[] fields;
public Product(int pFactor, int[] pFields) {
factor = pFactor;
fields = pFields;
}
}
static class SumOfProducts {
final Product[] sum;
public SumOfProducts(Product[] pSum) {
sum = pSum;
}
/**
* evaluate the polynomial with arguments separated by width
*/
double eval(final double[] arg, final int width) {
double res = 0.0;
for (Product p : sum) {
double prod = p.factor;
for (int f : p.fields) {
prod *= arg[f*width];
}
res += prod;
}
return res;
}
}
static double idealBenchmark(final double[] arg, final int width) {
final long start = System.currentTimeMillis();
double res = 0.0;
long evals = 0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
evals++;
res = res + arg[width] * arg[width];
}
System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
return res;
}
static double benchmark(final double[] arg, final SumOfProducts poly) {
final long start = System.currentTimeMillis();
double res = 0.0;
long evals = 0;
// run 10M evaluations
for (long l = 0; l < 10000000; l++) {
evals++;
res = res + poly.eval(arg, 1);
}
System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
return res;
}
public static void main(String[] args) {
//simple polynomial: x_1^2
Product p = new Product(1, new int[]{1, 1});
SumOfProducts poly = new SumOfProducts(new Product[]{p});
double arg[] = { 0, 1 };
double res = idealBenchmark(arg, 1);
System.out.println("Ideal Result: " + res);
res = benchmark(arg, poly);
System.out.println("Result: " + res);
}
}
You are making expensive copies here: 你在这里制作昂贵的副本:
for (Product p : sum)
Each copy means fully copying the std::vector<int>
data member of each element. 每个副本意味着完全复制每个元素的
std::vector<int>
数据成员。 Use references instead: 改为使用引用:
for (const Product& p : sum)
Note that I made them const
, because you do not need to change the elements of the range. 请注意,我将它们设为
const
,因为您不需要更改范围的元素。
For starters, you should change this line 对于初学者,您应该更改此行
for (Product p : sum)
to become 成为
for (Product const& p: sum)
Every iteration a new Product
with its contained std::vector<int>
is allocated, copied, and deallocated. 每次迭代都会分配,复制和取消分配包含
std::vector<int>
的新Product
。 I didn't see any other of that but since it is close to the inner loop I'd expect a large impact. 我没有看到任何其他的,但由于它接近内循环,我预计会产生很大的影响。
Based on the answer to my question, it looks like you are using the following structure: 基于我的问题的答案,看起来你使用以下结构:
struct Product
{
int factor;
vector<int> fields;
};
in a highly inefficient manner. 以非常低效的方式。 That is, the polynomial
4 x ^ 3
would be stored as 也就是说,多项式
4 x ^ 3
将被存储为
Product p { 4, {1, 1, 1} };
This is incredibly inefficient both in terms of processing power and memory. 无论在处理能力还是内存方面,这都是非常低效的。 Instead, if you stored a given term of the polynomial in a predetermined vector:
相反,如果您将多项式的给定项存储在预定向量中:
vector<int> Polynomial { 1, 4, 3, 5 }; // 5x^3 + 3x^2 + 4x + 1
Where the degree of the term is determined by the index. 该术语的程度由指数决定。 Then, your function to evaluate the polynomial is just:
然后,您评估多项式的函数就是:
int evaluate(int x, const std::vector<int>& polynomial)
{
int result = 0;
for (std::size_t i = 0; i < polynomial.size(); ++i)
{
// coefficient x to the correct power
result += polynomial[i] * std::pow(x, i);
}
return result;
}
As a side note: the same optimization can be applied to your Java code. 作为旁注:可以将相同的优化应用于Java代码。
If you don't want to use std:pow
for whatever reason, it is simple enough to implement yourself: 如果您不想因任何原因使用
std:pow
,那么实现起来很简单:
int pow(int x, unsigned int p)
{
int result = 1;
for (unsigned int i = 0; i < p; ++i)
{
result *= x;
}
return result;
}
And if sparse polynomials are your concern: 如果您关心稀疏多项式:
struct SubPolynomial
{
int Coefficient;
unsigned int Degree;
};
std::vector<SubPolynomial> polynomial;
int evaluate(int x, const std::vector<int>& polynomial)
{
int result = 0;
std::for_each(polynomial.begin(), polynomial.end(), [&](const SubPolynomial& s)
{
// coefficient x to the correct power
result += s.Coefficient * pow(x, s.Degree);
});
return result;
}
Note that if you have a full polynomial, you'll be using twice the memory required of the first example. 请注意,如果您有一个完整的多项式,那么您将使用第一个示例所需内存的两倍。 But if you have a sparse polynomial (eg a polynomial of degree
N
with less than N / 2
coefficients being non-zero), you'll be using at most the same amount of memory. 但是如果你有一个稀疏多项式(例如
N
多项式且N / 2
系数小于0的多项式),你将使用最多相同数量的内存。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.