C++ set/unordered_set | "Duplicate Removal" Complete Guide

C++ set/unordered_set | "Duplicate Removal" Complete Guide

이 글의 핵심

Covers selection criteria for set and unordered_set, multiset·hash·iterator rules, and practical set operations.

set vs unordered_set

Featuresetunordered_set
SortingO (auto-sorted)X
SpeedO(log n)O(1)
DuplicatesNot allowedNot allowed
TraversalSorted orderRandom

set vs unordered_set Performance Comparison

From time complexity (average/amortized) perspective, set is based on balanced binary search tree (typically red-black), so insert/delete/search is O(log n). unordered_set is a hash table, so average O(1), but with many hash collisions can approach worst case O(n).

Practical differences are:

  • Small n·order needed: If element count is hundreds~thousands and need sorted traversal or range search like lower_bound, set is simpler.
  • Large n·key lookup only: For millions+ where you only need fast “exists/not exists” check and order doesn’t matter, unordered_set is often advantageous.
  • Memory: Hash table has additional overhead from buckets/load factor, tree has pointer overhead per node. Varies by n and key size, so measurement is safe.
  • Cache locality: Neither uses contiguous memory, both have frequent skip access causing cache misses. Hard to say “unordered is always faster” based on traversal alone.

Selection Summary: If need order/range search use set, if keys are hash-friendly (integers, well-distributed strings) and mostly lookups, consider unordered_set first.

multiset and unordered_multiset

set / unordered_set store only one of each key. If need same value multiple times, use multiset, unordered_multiset.

Featuremultisetunordered_multiset
OrderMaintains sort (same value insertion order is implementation-defined)No order
Lookupcount, equal_range for same key rangecount, equal_range
Deleteerase(key) removes all of that key, use iterator to remove oneSame
#include <set>
#include <unordered_set>
#include <iostream>

int main() {
    std::multiset<int> ms{1, 1, 2, 2, 2};
    std::cout << ms.count(2) << '\n';  // 3

    auto [first, last] = ms.equal_range(2);
    for (auto it = first; it != last; ++it) { /* ... */ }

    std::unordered_multiset<std::string> ums;
    ums.insert("a"); ums.insert("a");
    std::cout << ums.count("a") << '\n';  // 2
}

When only top k needed, multiset maintains sorted state making it easy to handle max/min at ends. If only counting frequency and order not needed, unordered_multiset or unordered_map<T, int> is often better.

Custom Comparator and Hash Function

set: Compare and operator<

set sorting criterion must satisfy strict weak ordering. Can use operator< or std::less for custom types.

struct Person {
    std::string name;
    int id;
};

struct ById {
    bool operator()(const Person& a, const Person& b) const {
        return a.id < b.id;
    }
};

std::set<Person, ById> by_id_set;

unordered_set: Hash and KeyEqual

In unordered_set<Key, Hash, KeyEqual>, same key must always produce same hash, and KeyEqual determines actual equality on hash collision. Custom types typically provide both.

struct Point {
    int x, y;
    bool operator==(const Point& o) const { return x == o.x && y == o.y; }
};

struct PointHash {
    std::size_t operator()(const Point& p) const noexcept {
        // Simple combination example (consider boost::hash_combine for projects)
        return std::hash<int>{}(p.x) ^ (std::hash<int>{}(p.y) << 1);
    }
};

std::unordered_set<Point, PointHash> pts;

Caution: When putting strings in unordered_set, bad custom hash causes severe performance drop. If key is std::string, typically use default hash.

Practice: Duplicate Removal·Intersection·Union Summary

  • Duplicate removal only with original order preservation: Putting in set then moving to vector changes order. If order preservation needed, consider sort then unique or unordered_set + separate order array.
  • Intersection·union·difference: Above set_intersection / set_union / set_difference patterns require both ranges sorted. Using set automatically satisfies sorted order via iterators.
  • unordered_set operations: If library doesn’t provide one-shot operations, collect one side into vector and sort, or iterate smaller side and check inclusion in other with count/find (choose better algorithm by size).

Iterator Invalidation

set / multiset

  • insert / emplace: Existing element iterators and references not invalidated.
  • erase(it): Only that iterator invalid. Other iterators maintained.
  • erase(key) / clear: Iterators to deleted elements invalid.

unordered_set / unordered_multiset

  • Rehash changes bucket structure, all iterators may be invalid. insert exceeding load factor can trigger rehash, so indiscriminate insertion during iteration is dangerous.
  • reserve(n) (or appropriate bucket count reservation) reduces rehash frequency, making iterator invalidation less frequent.
  • erase: Only iterator at deleted position invalid, other iterators usually valid after C++11, but implementation-dependent, so using erase’s return iterator pattern is safe.
for (auto it = s.begin(); it != s.end(); ) {
    if (condition) it = s.erase(it);  // erase returns next iterator
    else ++it;
}

set Basic Usage

#include <set>
#include <iostream>
using namespace std;

int main() {
    set<int> s;
    
    // Insert
    s.insert(3);
    s.insert(1);
    s.insert(2);
    s.insert(1);  // Duplicate ignored
    
    // Traverse (auto-sorted: 1, 2, 3)
    for (int x : s) {
        cout << x << " ";
    }
    
    // Search
    if (s.find(2) != s.end()) {
        cout << "\n2 exists" << endl;
    }
    
    // Delete
    s.erase(2);
    
    // Size
    cout << "Size: " << s.size() << endl;
    
    return 0;
}

unordered_set Basic Usage

#include <unordered_set>
#include <iostream>
using namespace std;

int main() {
    unordered_set<string> words;
    
    words.insert("apple");
    words.insert("banana");
    words.insert("apple");  // Duplicate ignored
    
    // Check existence (fast)
    if (words.count("apple") > 0) {
        cout << "apple exists" << endl;
    }
    
    // Traverse (order not guaranteed)
    for (const string& word : words) {
        cout << word << " ";
    }
    
    return 0;
}

Practical Examples

Example 1: Remove Duplicates from Array

#include <iostream>
#include <vector>
#include <set>
using namespace std;

vector<int> removeDuplicates(const vector<int>& arr) {
    set<int> s(arr.begin(), arr.end());
    return vector<int>(s.begin(), s.end());
}

int main() {
    vector<int> arr = {5, 2, 8, 2, 9, 5, 1, 8};
    
    cout << "Original: ";
    for (int x : arr) cout << x << " ";
    
    vector<int> unique = removeDuplicates(arr);
    
    cout << "\nDuplicate removed (sorted): ";
    for (int x : unique) cout << x << " ";
    
    return 0;
}

Explanation: Most basic pattern utilizing set’s auto-sorting and duplicate removal.

Example 2: Intersection/Union of Two Arrays

#include <iostream>
#include <set>
#include <algorithm>
using namespace std;

int main() {
    set<int> a = {1, 2, 3, 4, 5};
    set<int> b = {3, 4, 5, 6, 7};
    
    // Intersection
    set<int> intersection;
    set_intersection(a.begin(), a.end(),
                     b.begin(), b.end(),
                     inserter(intersection, intersection.begin()));
    
    cout << "Intersection: ";
    for (int x : intersection) cout << x << " ";  // 3 4 5
    
    // Union
    set<int> union_set;
    set_union(a.begin(), a.end(),
              b.begin(), b.end(),
              inserter(union_set, union_set.begin()));
    
    cout << "\nUnion: ";
    for (int x : union_set) cout << x << " ";  // 1 2 3 4 5 6 7
    
    // Difference
    set<int> difference;
    set_difference(a.begin(), a.end(),
                   b.begin(), b.end(),
                   inserter(difference, difference.begin()));
    
    cout << "\nDifference (A-B): ";
    for (int x : difference) cout << x << " ";  // 1 2
    
    return 0;
}

Explanation: Can implement mathematical set operations directly. Frequently used in algorithm problems.

Example 3: Visit Check (Graph Traversal)

#include <iostream>
#include <unordered_set>
#include <queue>
#include <vector>
using namespace std;

void bfs(int start, vector<vector<int>>& graph) {
    unordered_set<int> visited;
    queue<int> q;
    
    q.push(start);
    visited.insert(start);
    
    while (!q.empty()) {
        int node = q.front();
        q.pop();
        
        cout << node << " ";
        
        for (int neighbor : graph[node]) {
            if (visited.count(neighbor) == 0) {
                visited.insert(neighbor);
                q.push(neighbor);
            }
        }
    }
}

int main() {
    // Graph: 0-1, 0-2, 1-3, 2-3
    vector<vector<int>> graph = {
        {1, 2},    // Node 0's neighbors
        {0, 3},    // Node 1's neighbors
        {0, 3},    // Node 2's neighbors
        {1, 2}     // Node 3's neighbors
    };
    
    cout << "BFS traversal: ";
    bfs(0, graph);
    
    return 0;
}

Explanation: Use unordered_set to check visited status at O(1) speed. Essential pattern for graph algorithms.

Common Problems

Problem 1: Cannot Modify set Elements

Symptom: Compile error when trying to modify set element directly

Cause: set elements are const to maintain sorting

Solution:

// ❌ Wrong code
set<int> s = {1, 2, 3};
auto it = s.find(2);
*it = 5;  // Compile error! const int&

// ✅ Correct code (delete then reinsert)
set<int> s = {1, 2, 3};
auto it = s.find(2);
if (it != s.end()) {
    s.erase(it);
    s.insert(5);
}

// ✅ For structs (use mutable)
struct Item {
    int id;
    mutable int count;  // mutable can be modified
    
    bool operator<(const Item& other) const {
        return id < other.id;
    }
};

set<Item> items;
auto it = items.find(Item{1, 0});
if (it != items.end()) {
    it->count++;  // OK (mutable)
}

Problem 2: Custom Comparison Function

Symptom: Compile error when inserting custom type into set

Cause: Needs operator< or comparison function

Solution:

// ❌ Compile error
struct Point {
    int x, y;
};

set<Point> points;  // Error! No operator<

// ✅ Method 1: Define operator<
struct Point {
    int x, y;
    
    bool operator<(const Point& other) const {
        if (x != other.x) return x < other.x;
        return y < other.y;
    }
};

set<Point> points;  // OK

// ✅ Method 2: Comparison function object
struct PointCompare {
    bool operator()(const Point& a, const Point& b) const {
        if (a.x != b.x) return a.x < b.x;
        return a.y < b.y;
    }
};

set<Point, PointCompare> points;  // OK

// ✅ Method 3: Lambda (C++11)
auto cmp = [](const Point& a, const Point& b) {
    if (a.x != b.x) return a.x < b.x;
    return a.y < b.y;
};

set<Point, decltype(cmp)> points(cmp);  // OK

Problem 3: Confusion with multiset

Symptom: Want to allow duplicates but set doesn’t

Cause: set doesn’t allow duplicates, multiset does

Solution:

// ❌ set doesn't allow duplicates
set<int> s;
s.insert(1);
s.insert(1);
s.insert(1);
cout << s.size();  // 1 (duplicates ignored)

// ✅ Use multiset
#include <set>
multiset<int> ms;
ms.insert(1);
ms.insert(1);
ms.insert(1);
cout << ms.size();  // 3 (duplicates allowed)

// Count specific value
cout << ms.count(1);  // 3

// Delete all of specific value
ms.erase(1);  // All 1s deleted

// Delete only one
auto it = ms.find(1);
if (it != ms.end()) {
    ms.erase(it);  // Delete only one
}

Performance Optimization

Optimization Strategies

  1. Choose efficient data structure

    • How: Use appropriate STL container for situation
    • Effect: Improve time complexity
  2. Prevent unnecessary copying

    • How: Use reference passing
    • Effect: Reduce memory usage
  3. Compiler optimization

    • How: Use -O2, -O3 flags
    • Effect: Improve execution speed

Benchmark Results

MethodExecution TimeMemory UsageNotes
Basic implementation100ms10MB-
Optimization 180ms8MBReference passing
Optimization 250ms5MBSTL algorithms

Conclusion: Appropriate optimization can improve performance 2x or more

Caution: Absolute time/memory numbers in above table are examples. In real services, remeasure before/after with same input and compile options.


Advanced: reserve·rehash·Load Distribution

unordered_set triggers rehash when exceeding load factor, at which point all iterators may be invalid. If approximate element count known before insertion:

std::unordered_set<int> s;
s.reserve(1'000'000); // Reduce rehash frequency, stabilize benchmark

set doesn’t have reserve, but reducing unnecessary copies/temporary objects helps perceived performance.


Advanced: Heterogeneous Lookup (C++20, unordered_set)

When key is std::string, to prevent find("literal") from creating temporary std::string, need custom policy with transparent hash/transparent equality (std::hash<std::string_view>, std::equal_to<>). Standard specialization varies by implementation/version, so check project’s standard library documentation.

// Concept example: lookup with string_view (must verify project policy/standard version)
// std::unordered_set<std::string, SVHash, SVEqual> etc.

Advanced: node_type / extract (C++17)

set/unordered_set can extract nodes and move to other containers, used for patterns that update keys without reallocation.

std::set<int> a{1, 2, 3};
auto nh = a.extract(2);
if (!nh.empty()) {
    nh.value() = 5;
    a.insert(std::move(nh));
}

unordered_set similar but watch rehash timing.


Advanced: Benchmark Method (Practically)

  1. Fix: Compiler optimization (-O2/-O3), CPU fixed frequency (if possible), same seed.
  2. Warmup: Start timer after cache/hash table warmup.
  3. Metrics: Not just average but p95/p99, especially for unordered_* include worst-case hash input as separate case.
#include <chrono>

template<typename F>
auto bench(F&& f, int reps) {
    using clock = std::chrono::steady_clock;
    auto t0 = clock::now();
    for (int i = 0; i < reps; ++i) f();
    return clock::now() - t0;
}

Advanced: Debugging Guide

SymptomCheck
Slow in unordered_setHash quality, key collision, load factor, wrong operator==
Abnormal behavior during iterationRehash invalidates iterators — use reserve or copy-then-iterate
set sorting is strangeCheck if comparison operation satisfies strict weak ordering

Advanced: Common Mistake Patterns (Additional)

  • Using vector as key in unordered_set with only default hash → Hash inefficient or definition missing. Use immutable key as identifier or define custom hash.
  • Wrong iterator increment during erase while iterating → Use it = container.erase(it) pattern.
  • Assuming const member read-only is safe in multithreading → unordered_set needs synchronization even for reads (read-only snapshot or shared_mutex design).

FAQ

Q1: Can beginners learn this?

A: Yes, this guide is written for beginners. Basic C++ syntax knowledge is sufficient.

Q2: Frequently used in practice?

A: Yes, very frequently. Essential concept in production projects.

Q3: Comparison with other languages?

A: C++ advantage is performance and control. Faster than Python, more flexible than Java.

Q4: How long to learn?

A: Basic concepts 1-2 hours, mastery about 1-2 weeks.

A:

  1. Learn basic syntax
  2. Follow simple examples
  3. Apply to practical projects
  4. Learn advanced techniques

Q6: Common mistakes?

A:

  • Not initializing
  • Memory management mistakes
  • Not considering time complexity
  • Missing exception handling

Posts connected to this topic.

  • C++ string | “String Processing” Complete Guide [Practical Function Summary]
  • C++ Algorithm Set | “Set Algorithms” Guide
  • C++ STL vector | “More Convenient Than Arrays” Vector Complete Guide [Practical Examples]
  • C++ map vs unordered_map (STL Series)
  • C++ map·set Complete Guide | ordered vs unordered· Custom Keys

Practical Tips

Tips you can apply immediately in practice.

Debugging Tips

  • Check compiler warnings first when problems occur
  • Reproduce problem with simple test cases
  • Use print debugging to trace execution

Performance Tips

  • Choose correct data structure for use case
  • Measure before optimizing
  • Profile to find bottlenecks

Code Review Tips

  • Pre-check frequently pointed out parts
  • Follow team coding conventions
  • Write clear variable names

Practical Checklist

When Choosing Container

  • Need sorted order?
  • Need range queries (lower_bound)?
  • Only existence check?
  • What is expected element count?

When Using Custom Types

  • Defined operator< or comparator?
  • Defined hash and operator==?
  • Tested with sample data?

When Iterating

  • Modifying container during iteration?
  • Using safe erase pattern?
  • Considered iterator invalidation?

Keywords

C++, set, unordered_set, Set, STL, Duplicate Removal, Hash Table, Binary Search Tree, Iterator Invalidation, Custom Comparator, multiset