Binary Literacy 2: Static Analysis of C++ with Hex-Rays

This week-long class is designed to teach students the features of C++ that are most commonly encountered in binaries, their implementation details, and how to cope with them in Hex-Rays while reverse engineering. The class also covers Hex-Rays as a tool in great detail. With practice, after completing this course, students should be able to produce databases such as these:

Prerequisites:

  • The student should have a working understanding of C programming. Knowledge of C++ is not required.

  • A copy of IDA and Hex-Rays, and experience with using IDA for static reverse engineering of C code. If you are uncomfortable with static reverse engineering in general, consider taking Binary Literacy 1 before taking this class.

  • Binary Literacy 1 is not a formal prerequisite, but it does cover everything required for this class. In particular, C++ reverse engineering involves a lot of type reconstruction, which Binary Literacy 1 covers methodically and with plenty of exercises. (This class will briefly review type reconstruction at an accelerated pace.)

Course Background

C++ reverse engineering is an uncommon skill and topic. The few writeups on C++ binaries that exist are usually light on details. Roughly every year since the late 1990s, a handful of scattered tutorials are published on virtual functions, inheritance, exception handling, and/or the standard template library (STL). However, since important parts of C++ are not standardized, and hence are implemented differently between compilers and platforms, these publications generally age poorly. No cohesive, comprehensive materials on C++ reverse engineering have emerged in public.

C++ is a huge, complex, and rapidly-evolving language with unique features. Former C++ programmers who return after a hiatus struggle to reacclimate themselves to major features introduced in the meantime. Owing to its complexity and its limitations, hobbyists tend to choose languages other than C++. Owing to its niche specialization to high-performance applications, and its rapid evolution, few programmers who are not employed professionally as C++ developers can justify the time investment of keeping up to date with the language.

Course Philosophy

Binary Literacy 1, the predecessor to this class, contained a module on C++. However, every time we taught the material, we found that students -- even the ones most excited to learn about C++ -- struggled with it. Upon discussion and reflection, the fundamental issue was that students were generally unfamiliar with C++ as a language, and particularly, how C++ programmers use its features to develop real software. A student who doesn't understand why programmers use virtual functions or templates, or what role multiple inheritance plays in software design, has little use for details of their implementation; they will struggle when encountering these constructs in binaries. These observations lead to the design philosophy for this course. Students are not assumed to be have experience programming in C++.

Most features of C++ that are not in C came about because of common situations in software development for which C offered poor solutions. For every feature of C++ that we cover, we discuss the limitations of C that lead to the introduction of those C++ features, and we show examples of using them in the course of developing real software. Our blog entry about STL template type reconstruction shows an example of this educational approach.

Course Syllabus

Most of the binaries, and the primary coverage, shall be drawn from Microsoft Visual C++ binaries compiled for Windows. Where other platforms or compilers differ substantially (such as for virtual function tables and multiple inheritance), we shall discuss those differences.

  • Elements of software design in C

    • Modularity
    • Encapsulation
    • Access control
  • Structures

    • Their role in software design
    • Alignment
    • How structure accesses are compiled
    • Passing and returning structures by value
    • Compiler optimizations involving structures
    • Structure reconstruction
  • Classes

    • The limitations of structures
    • Access control
    • Class methods and __thiscall
    • Resource management
    • Global, local, and dynamic storage
    • Constructors and destructors
  • Miscellaneous topics in C++

    • References vs. pointers
    • Name mangling
    • Namespaces
    • Operator overloading
  • Inheritance

    • Motivation from C programming
    • Implementation in C++
    • Inheritance as a data structure design technique
    • Comparison with other programming languages
    • Discovering inheritance relationships
  • Virtual functions

    • How function pointers are used in C programming
    • Virtual functions as an improvement upon function pointers
    • Implementation via virtual tables
    • Reconstructing VTables, resolving virtual functions
  • Multiple inheritance

    • Motivation
    • Similarities with single inheritance
    • Differences with single inheritance
    • Dealing with multiple inheritance in Hex-Rays' type system

Course Limitations

As discussed above, C++ is a huge language and gains new features every few years. A week-long class is not enough to cover even all common features of C++ circa 2003, let alone in the 2020s.

Also as discussed, the goal of the course is to deeply instill students with specific practical skills to use when reverse engineering C++ binaries. Therefore, we opted to cover the most common features in real-world binaries rather than sacrifice time for the core material on superficial treatment of other features.

Therefore, the material on C++ templates and the STL has been removed, and we will not cover exception handling or virtual inheritance. As for more modern features, if they are not listed in the syllabus above, they will not be covered. Perhaps the future holds a Binary Literacy 3 to cover additional topics.