Optimize sort algorithm – or how I managed to get a high performance gain

By , last updated September 16, 2019

I was recently hired as a C++ guru and performance specialist at a fairly large software consultancy company (1000+ employees). This particular division turned out to be a special division. More on that later. They had some software, which was in dire need of some new set of eyes watching over the code. One should think software developers are quite ubiquitous at a place like that, and they are. However, being mainly a managed shop (C#, Java) the presence of C++ developers is simply stunning (</irony>).

The software package is usually divided into a front end and a back end. The front end is full of GUI and is written in C# and .NET. The back end is written in C++. While GUI is not my dish, the back end is totally my dish.

I think most GUIs are ugly, and mine is no exception.

However, writing back end code and services enables me, as a developer and architect, to write beautiful code with beautiful abstractions. And those beautiful abstractions will stay beautiful no matter how far GUI development and research goes.

The problem

The other day, I was faced with a task. Make this routine in this DLL go faster. For some datasets, the routine did “OK”, for others not so OK.

Sometimes, the problem is very easy to understand. And in this case, it was very easy to understand. Make this go faster.

The analysis

As always when optimizing software, you must have a baseline benchmark. You must know if you are improving the routine or not. Actually, the largest anti pattern when it comes to optimizing software, is not measuring performance before and after. Without a baseline, it’s only guesswork and that’s just as good as doing nothing, which is enough getting fired for.

I used a simple case of std::chrono to capture how much time was used for each call to the method. This particular method had no side-effects. All behavior was determined by the input parameters. I repeated the calls about 5-10 times for each test case, and then took the average and measured it on different computers.

And one other thing I’ve learned recently, is that memory latency in a laptop can not be compared with the latency in a multi-socket NUMA server. The laptop is generally 3-4 times faster than the NUMA server / cluster when it comes to single process latency. However, a server have a much higher throughput than an ordinary laptop / single socket workstation.

After some analysis of the method with the test data, I found some hotspots. Since I was hired as a performance specialist, my request for a license for the Intel Parallel Studio 2015 XE was not denied, and it’s usefulness is invaluable. It directed me to what code the CPU used most time with.

The solution

Since I don’t know anything about the data and the meaning behind it, I had the constraint that the data layout and order of the data could not be altered. This meant I had to analyse the predicates and understand the order the predicate wanted to have the data sorted.

The most prominent culprit was a horrible macro named QSORT. By horrible, I mean a real abomination… However, the algorithm may be OK in itself, but the real problem was how it was used.

Here is the previous predicate.

int compareDatatype(const void *a, const void *b)
    datatype  *p1 = (struct datatype *)a, *p2 = (struct datatype *)b;
        return 1;
        return -1;
    if(p1->p1 < p2->p1)
        return -1;                                      // x1 < x2
    if(p1->p1 > p2->p1)
        return 1;                                       // x1 > x2
    if(p1->p2 < p2->p2)
        return -1;                                      // x1 == x2 && y1 < y2
    if(p1->p2 > p2->p2)
        return 1;                                      // x1 == x2 && y1 > y2
    return 0;

It tries to be smart and do more things than necessary. First, it tries to partition the data into logical deleted and non-deleted elements. Sounds familiar? Say hi to std::partition.

The final code, about 10x faster in the general case, and about 20x faster in the worst case scenario.

size_t candMatchTypesOp::partitionByDeleted(datatypeList & arrayStart, size_t arraySize)
	// Partition the data with std::stable_partition with datatype::deleted as predicate
	auto deletedPredicate = [&](const datatype & cand) -> bool
		return !cand.deleted;

	// From beginning to number of elements
	const auto & begin    = arrayStart.begin();
	const auto & end      = begin + arraySize;

	// std::partition is about 10% faster than std::stable_partition
	//const auto & partit = std::stable_partition(begin, end, deletedPredicate);
	const auto & partit = std::partition(begin, end, deletedPredicate);

	// Size of non-deleted partition
	size_t newArraySize = partit - begin;

	// Can probably resize vector here, removing the deleted elements????
	if (newArraySize != arraySize)

	return newArraySize;

void datatypeOp::sortComparePairs(datatypeList & arrayStart, size_t arraySize)

	auto pairCompare = [&](const datatype & a, const datatype & b) -> bool
		if (a.p1 < b.p1)
			return true;       // x1 < x2
		if (a.p1 > b.p1)
			return false;        // x1 > x2

		if (a.p2 < b.p2)
			return true;       // x1 == x2 && y1 < y2
		if (a.p2 > b.p2)
			return false;        // x1 == x2 && y1 > y2

		return false;

	size_t activeElements = partitionByDeleted(arrayStart, arraySize);

	const auto & begin = arrayStart.begin();
	const auto & end = begin + activeElements;

	std::sort(begin, end, pairCompare);

This code is superior to the above code in many ways:

  1. There is no casting from void * to the data type, and the data is passed as a const reference. This means increased type safety.
  2. The data is partitioned before the sorting takes place, with std::partition. The previous version did partition and sorting in the predicate, and they were not entirely consistent either. This confused the original QSORT2 algorithm, causing it to spin until it reached a “stable state”. This reduced the runtime complexity to about O(N) + O(k log k) from O(n log n), where k is the number of non-deleted elements.
  3. Inlining. With the QSORT2 macro, there is no way the predicate can be inlined. For every compare, the predicate has to be called, the arguments cast to the respective types, and then compare. With std::sort, the predicate can be inlined, causing even faster evaluation of the predicates.
  4. And above all, the code is more clear and easily maintainable.

The bonus

After replacing the horrible calls to the QSORT2 macro, there was one more thing bugging me. There were nothing more to gain by replacing sort algorithms. The remaining was a double for loop, with a slow outer loop and a very hot inner loop.

The structure of a loop, is a loop counter and a loop body.

// Old style loop
size_t n = datatypeList.size();
for (size_t i=0; i<n; ++i)
    // Body

By running the body in parallel, there were additional speedups to be gained. Again, the Intel Parallel Studio 2015 XE suite proved to be indispensable. It showed clearly this loop was not parallelized and most of the remaining execution time were in this loop.

By transforming the code like this, and using the Concurrency Runtime (ConCRT):

auto hotloop = [&](size_t i)
    // Body

Concurrency::parallel_for<size_t>(0, listSize, 1, hotloop, Concurrency::affinity_partitioner());

Another 2x-3x speedups were gained. In total, about 40x speedup from before and after.