Assigning integers to sorted unique elements in std::vector

Question

Example: Given std::vector<string> v = {"C", "A", "B", "A"},

we seek

vector<size_t> s := {2,0,1,0}.

These integers are assigned based on sorted order of unique values in v: 0-"A", 1-"B", 2-"C"

Possible way to do this is:

vector<string> unique(v.begin(), v.end());
unique.sort();
unique.erase(std::unique(unique.begin(), unique.end()), unique.end());
vector<size_t> s(v.size());
for(size_t i(0); i < v.size(); i++)
{
   s[i] = std::lower_bound(unique.begin(), unique.end(), v[i]) - unique.begin();
}

Is there more elegant, compact and, most importantly, efficient method to perform the same routine? I know how to do this with std::map or unordered_map but not in sorted order.

UPDATE: Obviously asymptotic complexity cannot be improved - lower bound is O(n*logn) (as above). However, different O(n*logn) algo can possibly beat the constant or just be nicer :)

[`std::lower_bound`](http://en.cppreference.com/w/cpp/algorithm/lower_bound) has O(logN) complexity, so you algorithm has O(NlogN) (not including the initial sort, which is likewise said-same). you will be hard-pressed to beat that no matter what else you do, be it using a set or otherwise container. An `unordered_set` could perform better provided the hash is perfect or nearly so. — WhozCraig, Mar 11 '14 at 21:41
@WhozCraig: re unordered set, the problem is getting a sorted list of the items. that reintroduces the n log n. essentially, it transforms that unordered set back into an ordinary ordered set, just more complicated. — Cheers and hth. - Alf, Mar 11 '14 at 21:52

Cheers and hth. - Alf · Answer 1 · 2014-03-11T21:46:56.467

2

off the cuff code:

set<string> unique( v.begin(), v.end() );
vector<int> s( v.size() );
for( int i = 0; i < (int) v.size(); ++i )
{
   s[i] = unique.find( v[i] ) - unique.begin();
}

I think this is more elegant and I suspect that it might be a tad more efficient.

Disclaimer: code not touched by compiler's hands, logic not checked by execution…

Update: checking the code, hey set iterators don't support subtraction. so possibly efficiency of this is not so good. but it looks better (more elegant), i think! :-)

Test code:

#include <iostream>
#include <set>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
using namespace std;

auto main()
    -> int
{
    vector<string> const v = {"C", "A", "B", "A"};
    set<string> const unique( v.begin(), v.end() );

    vector<int> s( v.size() );
    for( int i = 0; i < (int) v.size(); ++i )
    {
       s[i] = distance( unique.begin(), unique.find( v[i] ) );
    }

    copy( s.begin(), s.end(), ostream_iterator<int>( cout, " " ) );
    cout << endl;
}

edited Mar 11 '14 at 21:46

answered Mar 11 '14 at 21:37

Cheers and hth. - Alf

142,714
15
209
331

@Alex: happily that turns out not to be the case. he he. :9 – Cheers and hth. - Alf Mar 11 '14 at 21:49
thanks for response. this is more than O(n*log n) :) find is logarithmic, distance is linear + linear loop :) – Oleg Shirokikh Mar 11 '14 at 21:55
it may still be fast(est) for small size. that sort of thing happens often. :9 – Cheers and hth. - Alf Mar 11 '14 at 21:57
possible.. but i seek for general case - small/huge, O(1) or O(n) unique elements, whatever input – Oleg Shirokikh Mar 11 '14 at 21:59
@Alf but STL set is BST and building it is O(n^2) in worse case, while sort as in initial variant is O(n logn). (did not notice that it is a set at first). – Spock77 Mar 11 '14 at 22:00
@Alex: Quicksort, the common implementation of `std::sort`, is O(n^2) in the worst case. So the OP's solution is O(n^2) in worst case. The above turns out to be O(n^2) in *every* case, but that's not due to the `set`. A `set` guarantees O(n log n) time construction from a range, and O(n) if the range is already sorted. Usually that guarantee is implemented by using a red-black tree, which is roughly balanced binary search tree. – Cheers and hth. - Alf Mar 11 '14 at 22:06
@Alf quicksort was used in std::sort in C++98/03, it's true - it gives O(n^2) in the same case as BST (sorted sequence - insert all n elements with n each gives O(n^2)). **But in C++11 it is no more the case** - the standard guarantees O(n logn) for the [std::sort for all input](http://en.cppreference.com/w/cpp/algorithm/sort). – Spock77 Mar 11 '14 at 22:21
@Alf but RB-tree have O(n log n) as it is self-balancing. So your decisions seems equal in O-terms. – Spock77 Mar 11 '14 at 22:50

Spock77 · Answer 2 · 2014-03-11T23:59:48.757

A little analysis of two implementations: with sorted vector (initial) and a set. Initial variant with sorted vector should be faster than usage of the set.

In big-O terms these decisions are equal. In initialization phase we need only once sort the vector and build the set. The complexity of std::sort is O(n logn) worst case since C++11, insertion into std::set is the same O(n logn) (standard 2011 23.2.4). (As a rule set is implemented as a Red-Black Tree.) On the second step of search, lower_bound is O(logn) and the same is set::find.

But in the terms of constants search in sorted vector (lower_bound) should be faster than set::find as it uses continues memory which is good to hit the processor caches. For instance this analysis shows twice more faster and also use 3 times less memory. One can measure on the concrete data and hardware (really the results will be quite interesting).

So if we do not need make insertions into the sorted vector, its usage is preferable.

score 0 · Answer 3 · answered Mar 11 '14 at 21:41

If you're going for efficiency, and have only a small number of different values that can easily be mapped to small integers (like letters), take a look at http://en.wikipedia.org/wiki/Counting_sort.

There are plenty of implementation examples out there. See: http://www.codeproject.com/Tips/290197/Cplusplus-Count-Sort-Implementation for one.

Assigning integers to sorted unique elements in std::vector

3 Answers3