LLVM is a set of compiler toolchain technologies for C, C++ and CUDA.

Better profiling using compiler-inspired optimisation strategies.

Inspired by LLVM, our Python SDK uses a variation of Profile-Guided Optimisation (PGO) to instrument your code with measurements that our platform can turn into useful insights. PGO is an approach to compiler optimisation that collects profile data about the typical execution of your program and then uses that data to guide the optimisation. In our SDK, we use a similar strategy to help you collect useful data about your code that we can turn into interesting insights beyond what would typically be available.


Python and Decoded.AI logos

PGO: (1) build an execution profile of a program and then (2) use that profile to guide our optimisation.

Intermediate Representations (IR) refer to the 'representation' of code as it moves through some stages of a compiler.

What does PGO help us to do?

PGO is a useful strategy for doing 'informed' optimisation. The basic idea is that our compilers can better evaluate the trade-offs of different choices if we know a little more about the execution patterns of our code. For example, a compiler might choose to inline a function that was frequently called or make different assumptions about machine-code layout. The key is that there are some choices that can only be made when we know that certain conditions or patterns are true and that PGO can help us to find that information by profiling our code.

LLVM partitions PGO into two sorts of profiling strategies:

  • Sampling 'looks in from the outside' to collect data like hardware counters without much overhead or intrusion.
  • Instrumentation which deconstructs the execution instrinsically from within the program to collect more detailed data.
There are different ways of achieving either of these strategies but the most modern is 'IR-level' instrumentation which inserts the instrumentation intrinsics during the first pass of the compiler (whilst building the 'intermediate representation'). By gathering data about the execution of a program, we hope to uncover patterns in the code that the compiler can exploit to reduce program sizes, accelerate our runtimes or make better use of memory. Clever instrumentation at the intermediate representation is one way to achieve that.

...but wait, Python isn't a compiled language?

Python is an interpreted language that is compiled into bytecode by the CPython interpreter at runtime. Broadly speaking, the 'compiler' is an implementation detail and modules are more likely to be shared as compressed source code than .pyc files (compiled bytecode). So why do we care about PGO?

"Similarly, we use IR-level PGO-inspired strategies to instrument your AI/ML learning code so that we can profile it during execution to help you gather useful insights about your work without much extra work."

The trick of PGO is that we can instrument code by inserting instrumentation intrinsics into the intermediate representation (IR). There is a lot of information available during this stage of compilation that can help us to choose what kind of instrumentation we should use for different components of the code. Similarly, we use IR-level PGO-inspired strategies to instrument your AI/ML learning code so that we can profile it during execution to help you gather useful insights to understand the structure, action and patterns of your learning algorithms.

We can compile Python into .pyc bytecode files but it definitely isn't a compiled language like C++ or Rust (.whl files are built not compiled).

PGO workflow from Python code to instrumentation and then to execution.

Abstract Syntax Trees are a representation of code that we can manipulate the syntax within the bounds of the language to generate new code.

What does that look like?

There are two components to IR-level, instrumentation-based profiling:

  1. Instrumentation of the program where the required measurement intrinsics are inserted into the IR.
  2. Execution of the program to generate profiling data from the measurement intrinsics.

We start by parsing the Python source code into an Abstract Syntax Tree (AST) that is analysed by our SDK to find the best places to insert our measurement instrumentation. The insertion is tricky because we need to reach inside of most components of the code to understand what kind of information we can profile. Our SDK looks for things like how a function affects complexity, safety and performance metrics on our model weights or whether a function calls out to an API, which is distinct from typical benchmarking strategies. Next, we take the instrumented AST and execute it with sample data provided by you (our user). During the execution, our measurement intrinsics gather profile data and stitch together a representation of the computational fabric of the program suited to analysis with our API.

Measurements like performance benchmarks need to be run many times to account for factors like cache warmups that distort the profile.

What are some challenges with this approach?

Instrumentation-based PGO is effective at gathering detailed data because the measurement intrinsics are baked into the execution of the program. If we're not careful, the measurement intrinsics can pollute the collected data because they are themselves a part of the program. If both bar and foo are instrumented with a driver to measure their wallclocks then the execution time of foo will include the time taken to measure bar!

                                
def bar(...):
    start_measurement(); ← pollutes the measurement of foo
    ...
    end_measurement();

def foo(...):
    start_measurement();
    bar(...);
    end_measurement();
                                
                            

One approach is to minimise the 'interference' generated by our instrumentation so that the profile data is noisy but workable. But when the information that we want to collect is complex or difficult to work with, adding a 'non-interference' constraint can make our work excessively challenging. More importantly, minimising the interference of a measurement is likely to reduce the robustness of our profiling so we're making a direct trade-off between performance and quality. The real problem is that the instrumented program is in fact two programs (the instrumentation and the original program) that need to be separated so that we can collect clean profiling data without noise from our instrumentation.

Our approach

To do that in our SDK, we 'box up' components at the IR-level to extract it's dependencies and 'isolate' the execution from the rest of the program. To 'box up' a function we execute it, capture it's effect on the state of the program (as well as any dependencies) and then store it in a repeatable 'box'. Isolation lets us repeat pieces of the computation under different conditions, measurements and even to evaluate 'what if' conditions during execution. As a result, we can collect detailed and clean profile data that can be transformed into analysable insights. With that information, we hope to provide the community with new ways of looking at machine-learning pipelines that can help us build more bold, robust and intelligent AI systems.