
The latest 32-bit microcontrollers (MCU), such as the PIC32 MCUs from Microchip, are a major departure from the 8- and 16-bit era. The new PIC32 devices have as much as 512 Kbytes of Flash, up to 32 Kbytes of SRAM, and 32 general purpose registers. With fast DSP instructions for multiply and divide, a 256-byte instruction cache, 5 stage pipeline, DMA, and fast context switching, the PIC32 offers instruction throughput of 1.56DMIPS/MHz.
One way to squeeze the maximum performance from a PIC32 is to choose a compilation methodology that can exploit the benefits of the architecture. Basically, there are two approaches to compilation for the PIC32 architecture: conventional compilation which optimizes and generates object code independently for each individual program module; and omniscient compilation which optimizes code based on a view of all program modules across the entire program.
Most PIC32 compilers generate code by individually compiling each program module into an independent sequence of low-level machine instructions, without any knowledge about what is in the other modules. Once all the modules are compiled, a linker links the modules together, along with any code being used from pre-compiled libraries.
The drawback to this approach is that the compiler never has complete information about the program being compiled. Optimization is done within modules, but not between them.
This situation leads to restrictive, fixed calling conventions specified by core and MCU vendors to prevent the over writing of register data. Another thing conventional compilers do to prevent overwriting the registers is to save and restore very large contexts that include every register that might be used by an interrupt, or during interrupts. About 30% of a RISC CPU's cycles are spent on load/store instructions - many of which are the result of restrictive calling conventions and extensive context generation.
Omniscient Code
GenerationOn the other hand, omniscient code generation (OCG) fosters more efficient, dynamic register use. Newer compilers are now available with OCG technology. OCG has the intelligence to eliminate arbitrary restrictions on register usage and to save only those context registers that are used for each particular interrupt. OCG works by collecting comprehensive data on register, stack, pointer, object and variable declarations from all program modules before compiling the code (see Fig). It analyzes all the program modules in one step, and extracts a call graph structure. The OCG compiler creates a pointer reference graph that tracks each instance of a variable having its address taken, plus each instance of an assignment of one pointer to another (either directly, via function return, function parameter passing, or indirectly via another pointer). It then identifies all objects that can possibly be referenced by each pointer. This information is used to determine exactly the size and scope for each pointer variable. For PIC32 devices, all pointers are 32-bits wide as there is no advantage in allocating smaller pointers. However, the OCG compiler detects when a pointer only has one target, and side-steps the pointer completely, making it a direct access.
Since an OCG compiler knows exactly which registers are available at any point in the program and also which registers will be needed for every interrupt function in the program, it can generate code that maximizes register coverage, and minimizes stack utilization, code size, and the number of cycles required to save and restore those registers.
While conventional compilers save every register that might be used by an interrupt, an OCG compiler knows exactly which registers will be used by every interrupt function in the program. It has the intelligence to minimize the size of the context that needs to be switched. This capability improves performance and reduces code size by limiting the number of save and store instructions that must be generated and executed. It also conserves SRAM resources by minimizing the amount of SRAM that is used to store saved registers.
Depending on the application, the cycle savings can be substantial. When compiled by a conventional, non-OCG compiler, a simple benchmark program with 65,535 interrupts requires over 8,650,624 cycles for the PIC32 to execute at 80MHz with 2 wait states. The same program, compiled by an OCG compiler takes only 6,356,898 cycles, or 26.5% less. In an interrupt-intensive program, the OCG compiler would give the CPU the equivalent of a near 25% performance boost. A more interrupt-intensive program could see an even larger performance improvement.
As embedded programs become more sensor-driven and interrupt intensive, minimizing interrupt overhead becomes an important component of getting the best performance possible from the PIC32 microcontroller. Part of this responsibility falls on the software engineer, who should take care to keep interrupts as small as possible. In addition, care should be taken to select a compiler with the intelligence to minimize the number of registers to be saved so interrupt latency is lower and CPU cycles are conserved for other computation.
The large number of registers on the PIC32 provides a substantial opportunity for boosting CPU performance because functional parameters and other data that are stored in SRAM in smaller MCUs can be stored in the registers which have a zero-cycle penalty to access. Efficiently exploiting the PIC32's 32 registers can reduce the number of load/store cycles, freeing them up for computation and potentially improving processor throughput.
How much the compiler "knows" about the register usage of a called function plays a big role in how efficiently the compiler exploits the PIC32's registers. Conventional compilation technology relies on rigid, static calling conventions that define specific CPU registers for use across calls. All functions in the program must adhere to the same calling convention.
The calling convention in most PIC32 compilers specifies a fixed set of four registers for passing parameters to functions. They cannot be used for anything else. If the function requires more than four parameters, the extra parameters are passed on the stack in SRAM, even if other non-reserved registers are available.
In a sample program, with function calls nested three deep and each function having six parameters, a non-OCG PIC32 compiler uses four registers for passing parameters between functions as required by the calling convention. It frequently moves parameters between the four registers and the stack. This data shifting consumes 144 bytes of stack space in SRAM, and generates 476 instructions to move the data between the registers and the stack that require 118 CPU instruction cycles to execute every time it happens.
Rather than relying on static calling conventions to allocate data between an unknown number of available registers, a compiler based on an OCG defers object code generation until a view of the whole program is available. Based on this global view of the complete program, the code generator employs optimization techniques that fully optimize register coverage. Since the compiler knows which registers are available at every point in the program, it can allocate the registers dynamically, based on the resources that are actually available at the time.
The PIC32 has such a large register set, that it is usually possible to completely avoid using the stack for function parameters. The OCG compiler determined that 18 registers were available for the nested functions at this point in the program. Accordingly, it allocated all the parameters to the registers and used the stack only for the return address, cutting stack usage to only 16 bytes of SRAM. Because the parameters remained in the registers as long as they were needed, 30% fewer instructions were required (336) and the number of clock cycles required to execute fell from 188 cycles to only 80 instruction cycles - 33% less.
If the smaller contexts and more flexible register coverage provided by an OCG compiler can cut that amount by 1/3 (to 20%), the available number of cycles of actual processing will increase by about 15% (from 70% to 80% of cycles). At 80MHz, this is equivalent to increasing the PIC32's substantial 125 DMIPS capability to 145 DMIPS. In addition, since fewer instructions are generated to move data, the code size is smaller and the amount of SRAM required for the stack is also smaller, potentially allowing the use of a less expensive microcontroller with smaller Flash and SRAM memories.
The compilation methodology used can make a significant contribution to code density and execution speed. More importantly however, OCG technology allows embedded C programs to be written without the use of architecture-specific extensions, maintaining compile-and-go portability.
by Clyde Stubbs, Founder and CEO, Hi-Tech Software