Photo by Beth Macdonald on Unsplash

C Bare Metal Programming on STM32F103 — Booting up

Mattia Maldini
20 min readJul 24, 2022

--

In the embedded software development field, bare metal refers to a minimalist approach that rejects all kinds of SDKs and tooling that make the developer’s life easier.
No IDE, no SDK, no preconfigured project template for the MCU in front of you; just you, an editor, a compiler and the cold, hard hardware (no pun intended).

Why would anyone do that?

Well, if your objective is to get things done quickly and effectively, there is really no reason not to rely on solid tools (solid is paramount here — many MCU manufacturers provide unreliable IDEs); there is, however, an educational benefit in doing everything from scratch.
It brings to a deeper understanding of our work that is worth a couple of days of effort.
Last but not least, it is satisfying and fun.

For this tutorial I’m going to assume a somewhat competent level of C programming and the basic understanding of compiler commands.
At the same time, I’ll do my best to explain the inner workings of the hardware as if you had never heard of such a thing as a computer.

Let’s get into it.

The target

The subject of this tutorial is the inexpensive and easy to obtain Blue Pill development board.
It sports an ARM microcontroller, the STM32F103 — that, or a knockoff clone that is good enough for our purposes — with a whopping 128KB of Flash memory and 20KB of RAM.

You can grab one on multiple online platforms that also sell the required tools to interface with it (see this other tutorial for details on how to actually load the firmware).

As this is more of a conceptual effort, the only feature of the Blue Pill that I’m going to use is the on board yellow LED attached to PB13.
To show signs of life and proof that our work is functional, that’s more than enough.

The language

C, a timeless classic. We could (and will) put a twist on this exercise, but not now.

The tools

For our task, grab:

  • any editor of your choice. I like a mix of VSCode and NeoVim, but anything that can read and write a C source file is free game.
  • an ARM C compiler. I’m going to use arm-none-eabi-gcc; llvm would work just as well, but it’s of less common usage.
  • an external tool to load firmware onto the Blue Pill and the related software. Again, see this article for an in-depth guide on flashing.

That’s it, nothing else.

The project

Without any template project or tool that creates it for us, how do we get started?
Well, it is supposed to be a C example, so create a file named `main.c` with a main function in it:

int main(void) {
return 0;
}

This, of course, does nothing.
The return value itself is only there to please the compiler, which may complain if main has any other signature.

Before moving onto compilation we have to consider the problem of how to tell the CPU to execute this code.
Come to think of it, how does any processor know to begin the execution of a C program from main? What does it even mean to “execute” code?

The foundations

First of all, let’s lay down some core concepts. Be aware that what follows is a gross oversimplification.

When we give commands to the machine, we are controlling two main elements:

  1. The processor. The CPU. The mind of the operation. The C compiler translates our code into machine instructions that said mind understands and can execute.
  2. The memory. An endless (not really; barely more than 100KB in this case) string of bytes that the processor manipulates to carry on its calculations.
    It is addressable, which means that given an address (nothing more than a number) we can ask to read and/or write the contents of the memory cell it refers to — with some restrictions.

While a fairly complex construct, we are not really interested in the details of the CPU: suffice to say that it can execute instructions and interact with the memory.
The memory itself can be divided into three more categories:

  1. Registers: very fast, ephemeral and expensive memory. The processor has only a handful of those and it uses them for the fastest computations.
  2. RAM: fast but abundant memory (20KB). It can be read and written fairly quickly but the contents do not survive a power off.
  3. FLASH: available in great quantities but slow (128KB). What is written on it stays there potentially forever, even while the beast slumbers.
    In fact, not all parts of Flash memory can even be written, but that’s another story.

Note: as the STM32F013 is a 32-bit device, each memory cell (or “word”) is 32-bit wide (or 4 bytes).
Every register has this size and RAM and FLASH memory can be read and written in with this as the base unit.

So, there is a known set of instructions (the ARM instruction set) that the CPU can understand, and they all revolve around basic arithmetic operations and reading and writing these 3 kinds of memory.
FLASH memory contains our program (as a sequence of instructions) and RAM memory the data that it is elaborating.
Executing our program means feeding instructions to the CPU and mangling with memory until we get the desired result.

Specifically, there are two registers that help in this task: the Stack Pointer (SP) and the Program Counter (PC).
Both are interpreted as addresses in memory. While SP points to the area of RAM we are currently working on, PC indicates the currently executed instruction.
Upon completing every instruction (unless when the instruction itself says differently) the PC increases, moving onto the next one.

Now, back to how to tell the little computer to run the main function.
Given what we just said it should be reasonable to say that we must set the two aforementioned registers, SP and PC, to values coherent with the start of the main function.

Easier said than done.
Our program can manipulate the registers, but we need the registers to have a specific value for our program to start.
It’s an egg-or-chicken-first problem, and it must be solved by none less than the hardware itself.

First of all: what happens when the Blue Pill is powered on?
To really understand the effects of the flow of electricity inside the circuit of our creature we must delve deep into the sacred texts: namely, the programming manual.
This part is the most boring, so I’m going to skip to the interesting bits.

Document PM0056, STM32F10xxx/20xxx/21xxx/L1xxxx Cortex®-M3 programming manual; chapter 2 (The Cortex-M3 Processor), section 3 (Exception Model), subsection 2 (Exception Types).
The Reset exception.

Reset is invoked on power up or a warm reset. The exception model
treats reset as a special form of exception. When reset is asserted,
the operation of the processor stops, potentially at any point in an
instruction. When reset is deasserted, execution restarts from the
address provided by the reset entry in the vector table. […]

So, when the CPU awakens a Reset exception is asserted.
Broadly speaking, an exception is an event that causes the processor to “change the context of execution”.
Some registers change value and among them there are PC and SP.
The act of powering on leads to a “Reset” exception, in which case “execution restarts from the address provided by the reset entry in the vector table”.

What is the vector table, you ask? Chapter 2, section 3, subsection 4:

The vector table contains the reset value of the stack pointer, and the start addresses, also
called exception vectors, for all exception handlers. […]
The least-significant bit of each vector must be
1, indicating that the exception handler is Thumb code. […]
On system reset, the vector table is fixed at address 0x00000000.
So when an exception arises the corresponding entry in the vector table is loaded as address for the code to be executed in the PC register.
Additionally, the first entry in the vector table is the initial value assigned to the SP register.

The vector table contains a total of 83 entries, but we are only interested in those first two.
The “Stack” in “Stack Pointer register” is the area of memory where our program will store the data it’s working on.
As the need of RAM memory grows, so does the stack; in this particular family of devices the stack grows towards lower addresses.
For this reason, the first entry in the vector table (and the SP register with it) should be initialized to the last address of available RAM, so that the software can make full use of it as it counts downward.

The second entry should be the pointer to the main function.

Those two steps are enough to get our program rolling, and they amount to writing two specific memory cells with specific values.
The issue is that this should be done before any of our code is executed. How?

Memory maps

Imagine for a minute you already have a working, compiled binary for the Blue Pill.
The next step would be to load it onto the device. How? See here. But let’s talk about the why for now.

The area of memory that is both permanent and writable on the STM32F103 is the Flash memory; a region that starts at the address 0x8000000 (see the reference manual, section 3.3.3) and goes forth for 128KB.
Our compiled binary will reside there.

We just saw that to start the execution a couple of special values must be written at the beginning of the vector table; but that is at address 0x00000000!
We cannot write that — not before our program is running.

It turns out that it doesn’t matter. Depending on certain pins and registers when booting up the device maps the memory that starts at 0 to another address.
This is useful to execute different code that can, for example, be used to upload a new binary (see for example the UART bootloader path of the flashing tutorial).

The default option is the following (excerpt from section 3.4 of the reference manual):

Boot from main Flash memory: the main Flash memory is aliased in the boot memory
space (0x0000 0000), but still accessible from its original memory space (0x800 0000).
In other words, the Flash memory contents can be accessed starting from address
0x0000 0000 or 0x800 0000.

Basically, the memory are that starts at address 0x00000000 and the one starting at 0x8000000 (which is the flash memory where our program is stored) are seen as the same.

This means that by placing the two desired values at the very beginning of our binary the first two entry of vector table can be manipulated.
That’s great, but how?

The linker script

Usually the memory layout of the compiled binary is handled by some tool or predefined configuration that knows the final target better than the developer.
Here, we are alone; it falls on us to mold the compilation results to fit the device’s necessities. This is done via an explicit linker script.

A linker script is — you guessed it — a script meant for the linker. The linker is the part in the compiler that stitches together the intermediate compilation results into a coherent binary.

It understands a particular syntax, the study of which is beyond the scope of this document.
Every time a linking phase takes place there is a linker script involved — most of the times it’s just hidden to the developer, and that’s okay.
Usually manipulating the memory configuration beyond the default case is not needed, but here we don’t have a “default” case; rather, we are building towards it.

Moreover, the linker script I’m about to show you is very, very simple. Just follow along intuitively.

First things first, outline how the memory is organized:


MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 128K
RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 20K
}
__reset_stack_pointer = ORIGIN(RAM) + LENGTH(RAM);

Here we declare two memory regions, FLASH and RAM. The names are arbitrary and have no specific value beyond what we give them.
The addresses and sizes should follow the actual resources of the device; the letters inside the parentheses indicate the capability of the corresponding section: r is for read, w is for write and x is for execute.

Outside of the MEMORY command we declare a variable, __reset_stack_pointer, pointing to the last available address in RAM, its ORIGIN plus its LENGTH.
It will be useful in a while, when we place it as the first entry of the vector table.
We could have used directly the result (0x20005000), but abstracting and avoiding repetition is always good practice.

Then there is the SECTIONS command, where we actually specify the binary layout:

SECTIONS
{
.text : {
/* Set the initial address to 0 */
. = 0;
LONG(_reset_stack_pointer);
LONG(main | 1);
/* The whole interrupt table is 332 bytes long. Advance to that position. */
. = 332;
/* And here comes the rest of the code */
*(.text*)
} > FLASH /* Put this in the flash memory region */
}

A binary executable is made of sections that contain different types of data.
Those sections can contain all kinds of symbols that are found in the C source files.
Right now the only one we care about is the text section, the one where the instructions are stored.

In a linker script the dot symbol corresponds to the current address; assigning the value 0 to it means rewinding the address to the beginning of the section.

Then we place a couple of LONGs (i.e. 32 bit integers): the first one with the value of the initial stack pointer (the one we declared previously) and the second one as the address of the main function.

The latter also needs the least significant bit set to 1 (“The least-significant bit of each vector must be 1, indicating that the exception handler is Thumb code.”), hence the bitwise or operation.

As a whole the vector table contains 83 entries, so it must not end before 83*4 = 332 bytes. We enforce this limit by putting the number 332 into the dot, thus setting the current address to that value.

These are all very specific directives that build the first two entries of the vector table just like we required.
After the table is prepared, we simply add every symbol that regards the text section (i.e. the code) with the wildcard *(.text*), then we dump this whole map in the Flash memory part.

As a final result we get a binary that starts with exactly the two values we talked about, exactly at the starting position for the vector table.
In this way the MCU will boot up with the correct values to start with.

Now that the configuration is handled, all that’s left is to actually compile the project.

Compilation

I will skip talking about a proper build system and simply list the three commands that are required to compile our basic program.
The compiler I’m going to use is arm-none-eabi-gcc. It’s available as an official package in all major distributions.

One may also wish to use llvm instead, but not me, not today. Today we keep it classic.

First things first, compile the only source file (main.c) into an object file.
We don’t need the linker script for now, this step is a matter of translation into binary instructions:

arm-none-eabi-gcc -o main.o -c -g -mcpu=cortex-m3 -mthumb main.c

Assuming you have a basic understanding of the compilation process, the only atypical options are:

- -nostdlib commands the compiler to avoid linking a standard library.
This excludes all the usually available functions like malloc and free, which would require a specific implementation anyway.

- -mcpu=cortex-m3 specifies the target architecture (a Cortex M3 microcontroller).

- -mthumb requires the generated code to execute in thumb state.
The Thumb instruction set is overall more compact with respect to its ARM counterpart, only requiring 16 bits per instruction.

The intermediate output main.o will contain instructions that are compiled from the source code but not in the required memory configuration.
To fix that, in the next step we mention the custom linker script:

arm-none-eabi-gcc -o application.elf -Wl,-Tmemory.ld -nostartfiles main.o

Again, a couple of specific flags:

- even if we are technically invoking gcc, -Wl prefaces flags that are to be passed to the linker. In this case it’s just -Tmemory.ld, or the linker script to use.

- -nostartfiles cuts off any startup file that normally constitutes the first entry point for userspace programs; we spent the last chapter ensuring the application will start properly, anything else would just be a hindrance.

The end result is application.elf, which is almost what we need. The Executable Linkable Format (ELF) is not something that can directly be written into memory.
Suffice to say it integrates more information that are necessary; a third party tool (like gdb) could make use of those, but to get something that can simply be loaded onto the Flash memory we need an extra step that passes through the objcopy utility.

arm-none-eabi-objcopy -O binary application.elf application.bin

objcopy converts the more sophisticated application.elf to a brutish application.bin. This will do! Will it?

Well, it depends. It can be loaded onto the device and it will certainly execute the main function, but that’s empty.
It will just return — and since there’s nothing to return to, the processor will probably get stuck in some kind of error state. Unfortunate.

Fear not, for this was the most tedious part; now comes the actual programming. We will breathe the spark of life on the board.

Blink

The Blue Pill development board doesn’t have much of an arsenal of its own.
The most readily available tool to make apparent that some code is running is the on board green LED, so that’s the direction we want to take

Peripherals

Outside of the processor proper and the memory banks, any MCU integrates a varying amount of peripheral modules.
Those are connected to the physical pins that are exposed from the integrated circuit and allow the developer to control them to send or receive electrical signals for a wide array of purposes.

Perhaps the most simple of them all is the GPIO, or General Purpose Input Output.
In its output form this basic peripheral can decide whether the voltage level found on the tip of the desired pin is high or low — which typically means 3.3V or 0V, respectively

While not much, it allows us to drive a very basic functionality like flashing an LED.
On the Blue Pill we can find one built-in and connected to the PC13 GPIO.

Peripherals are controlled by the developer through appropriately defined registers; those however are different from the processor registers we talked about a while ago: peripherals registers are memory mapped.
This means that there are certain memory location that can be written and/or read to control the peripheral. Greatly simplified, imagine setting the output level of the GPIO by writing 0 or 1 in a memory cell.

Controlling a GPIO

Broadly speaking, controlling the output of a GPIO on an STM32 microcontroller is a three step process:
1. The port where the GPIO is located must be enabled as a whole (in this case, PORTC).
2. The GPIO must be configured as an output pin.
3. The output pin can be forced to a high or low level.

Enabling any peripheral on the STM32F103 is done by writing to the Reset and Clock Control (RCC for short) class or registers.
To find out where the interesting memory sections can be found you should look at the memory map, section 3.3 of the Family Reference Manual.

The highlighted entry shows that said register group can be found starting at address 0x40021000.
Memory mapped registers are not actually part of the memory banks, but behave as such for the purposes of our program.
The number 0x40021000 can be treated as a memory address, assigned to a pointer and used to read and write the contents of the register.

That’s not actually the specific register we need, but merely where the group of RCC registers begins.
The register that enables the PORTC is the APB2 peripheral clock enable register (or RCC_APB2ENR).
As shown in the image its address (relative to the start of the group) is 0x18, placing it a the 0x40021018 memory location.

As for all memory locations on a 32-bit processors, registers are organized as 32-bit words.
The RCC_APB2ENR register manages multiple peripherals, with a function for each of its 32 bits.
The one that controls PORTC is the IOPCEN, positioned at bit number 4 (the fifth, starting from 0).
Its function is quite simple:

To avoid writing directly arbitrary numerical values in our code we could use a few support definitions. It’s finally time to pick up `main.c` again; add the following directives at the top:

#include <stdint.h>#define RCC_BASE 0x40021000
#define RCC_APB2ENR_REGISTER (*(volatile uint32_t *)(RCC_BASE + 0x18))
#define RCC_APB2ENR_IOPCEN (1 << 4)

stdint.h is a convenient header — available even in this bare metal environment — that defines explicitly sized types like `uint32_t`.

First, we define RCC_BASE as the starting address of the RCC register group.
Then, the address of the register we are actually interested in, which can be writte as RCC_BASE + 0x18 for clarity.

There is some more nuance to it: by prepending and asterisk and a cast to a pointer type, we can use RCC_APB2ENR_REGISTER as if it was a variable and directly assign or read values from it.

The volatile keyword is not strictly necessary but can prevent some problems.
When processing a program the compiler can decide (always depending on the flags) to optimize some operations.
For example, if we were to read from the same memory location twice without ever writing anything else in it, the compiler could cut off the second memory access to improve performance.

While normally this is harmless, peripheral registers can behave unexpectedly.
They can change contents on their own reacting to hardware events and these optimization might interfere with it.
volatile tells the compiler that the corresponding variable or pointer contents can change erratically, so the code that accesses it should not be mangled with.

Finally, RCC_APB2ENR_IOPCEN is a number value with just the bit 4 active.

With these defines we are ready to do something in the `main` function:

int main(void) {
// Enable port C clock gate.
RCC_APB2ENR_REGISTER |= RCC_APB2ENR_IOPCEN;
}

This assignment sets the fifth bit of the RCC_APB2ENR register to 1, enabling the PORTC peripheral.

Instead of simply writing the value we use a bitwise or with the original value not to mess with the previous configuration — which is the reset value here, but may change as we enable more and more peripherals.

Configuring the PC13 pin as outputs is similar.
The corresponding memory group starts at address 0x40011000, as shown in the memory map.

There are two registers that control IO direction, as each pin needs 4 bits and each port manages 16 pins; the total is 64 bits, covered by two registers: GPIOC_CRL and GPIOC_CRH.
Since the LED is connected to PC13 it is controlled by GPIO_CRH, specifically by bits 20–23.

There is an instance of this register for each GPIO port, so let’s abstract a bit more.
The macro that calculates the address of the CRH register will depend on a parameter, the starting base address for the required GPIO port.

#define GPIO_PORTC_BASE 0x40011000
#define GPIO_CRH_REGISTER(x) (*(volatile uint32_t *)(x + 0x4))
#define GPIO_CHR_MODE_MASK(x) (0x3 << ((x — 8) * 4))
#define GPIO_CHR_MODE_OUTPUT(x) (0x1 << ((x — 8) * 4))
#define GPIO_BLINK_PORT GPIO_PORTC_BASE
#define GPIO_BLINK_NUM 13

A bit of bitfield arithmetic is required but everything should be pretty clear.
For the sake of cleaner code we also define two macros for the target port and GPIO number.

To configure PC13 as a GPIO the following two instructions are sufficient:

// Configure GPIO C pin 13 as output.
GPIO_CRH_REGISTER(GPIO_BLINK_PORT) &= ~(GPIO_CHR_MODE_MASK(GPIO_BLINK_NUM));
GPIO_CRH_REGISTER(GPIO_BLINK_PORT) |= GPIO_CHR_MODE_OUTPUT(GPIO_BLINK_NUM);

The first assignment cleans the 2 bits controlling the mode and make way for the second one to set the desired value.
Again, it’s good manners not to touch bits unrelated to our direct goal.

By default the output pin will be set to a low state, and that happens to turn on the LED.
The main function is not supposed to return in our environment, so add an (empty) infinite loop after the GPIO initialization.
If you recompile the project with the three commands I mentioned and flash it on the board you should see the LED shining a fixed bright green!

This is a sign of life, but we didn’t actually manipulate the peripheral beyond configuring it.
To do that, we need to access the ODR register. Compared to previous register it is quite simple: the nth bit corresponds to the state of the nth output pin.

As we did before, define a few helpful macros:

#define GPIO_ODR_REGISTER(x) (*(volatile uint32_t *)(x + 0xC))
#define GPIO_ODR_PIN(x) (1 << x)

Then we alternatively turn on and off the LED by switching the output pin low and high.

To achieve a perceptible state change one would need to keep track of time, but there is no such thing in this bare environment.
The STM32F103 has timer peripherals available, but let’s keep this very simple: to let some time pass by just let a for loop run for a fixed number of iterations.
It’s not precise, but works sufficiently well.

    for (;;) {
// Set the output bit.
GPIO_ODR_REGISTER(GPIO_BLINK_PORT) |= GPIO_ODR_PIN(GPIO_BLINK_NUM);
for (uint32_t i = 0; i < 400000; ++i) {
__asm__ volatile(“nop”);
}
// Reset it again.
GPIO_ODR_REGISTER(GPIO_BLINK_PORT) &= ~GPIO_ODR_PIN(GPIO_BLINK_NUM);
for (uint32_t i = 0; i < 10000; ++i) {
__asm__ volatile(“nop”);
}
}

I put different time periods for on and off states to give a better feeling of “heartbeat”.
The delay loops are filled with NOP operations: those are special machine instructions that waste a CPU cycle without doing anything (from No OPeration).
The volatile keyword, as for the registers, prevent the compiler from optimizing those instructions out.

The end result for main.c should look something like this:


#include <stdint.h>

#define RCC_BASE 0x40021000
#define RCC_APB2ENR_REGISTER (*(volatile uint32_t *)(RCC_BASE + 0x18))
#define RCC_APB2ENR_IOPCEN (1 << 4)
#define GPIO_PORTC_BASE 0x40011000#define GPIO_CRH_REGISTER(x) (*(volatile uint32_t *)(x + 0x4))
#define GPIO_CHR_MODE_MASK(x) (0x3 << ((x — 8) * 4))
#define GPIO_CHR_MODE_OUTPUT(x) (0x1 << ((x — 8) * 4))
#define GPIO_ODR_REGISTER(x) (*(volatile uint32_t *)(x + 0xC))
#define GPIO_ODR_PIN(x) (1 << x)
#define GPIO_BLINK_PORT GPIO_PORTC_BASE
#define GPIO_BLINK_NUM 13
int main(void) {
// Enable port C clock gate.
RCC_APB2ENR_REGISTER |= RCC_APB2ENR_IOPCEN;
// Configure GPIO C pin 13 as output.
GPIO_CRH_REGISTER(GPIO_BLINK_PORT) &= ~(GPIO_CHR_MODE_MASK(GPIO_BLINK_NUM));
GPIO_CRH_REGISTER(GPIO_BLINK_PORT) |= GPIO_CHR_MODE_OUTPUT(GPIO_BLINK_NUM);
for (;;) {
// Set the output bit.
GPIO_ODR_REGISTER(GPIO_BLINK_PORT) |= GPIO_ODR_PIN(GPIO_BLINK_NUM);
for (uint32_t i = 0; i < 400000; ++i) {
__asm__ volatile(“nop”);
}
// Reset it again.
GPIO_ODR_REGISTER(GPIO_BLINK_PORT) &= ~GPIO_ODR_PIN(GPIO_BLINK_NUM);
for (uint32_t i = 0; i < 100000; ++i) {
__asm__ volatile(“nop”);
}
}
return 0;
}

To create a binary image the required commands are as follows:

arm-none-eabi-gcc -o main.o -c -g -nostdlib -mcpu=cortex-m3 -mthumb main.c
arm-none-eabi-gcc -o application.elf -Wl,-Tmemory.ld -nostartfiles main.o
arm-none-eabi-objcopy -O binary application.elf application.bin

The flashing process depends on the tools at your disposal: again, see this other tutorial for a few different paths.

If everything is pieced up correctly you will see the LED blink like it’s Christmas.

The blinking speed can be tweaked by modifying the loop counters — empirically but effectively.

Conclusion

That’s about it for now.

We went from zero to embedded development with just a couple of text files.
The hard part is really to study and understand the nuances of the target that are explained in thousands of pages of the Friendly manuals, but working with no crutches should bring a certain satisfaction.

Not that you should avoid supportive tools and libraries just for kicks, but now you know that if the need arose you would be autonomous enough to bootstrap your project from nothing, and that’s something.

There are improvements and more or less important details that should be covered, including — but not limited to

  • A proper build system.
  • Preparations for a correct runtime environment for C code .
  • More elegant ways to populate the reset vector.
  • A more convenient approach to memory mapped peripheral control.
  • Proper time keeping.
  • More programming languages!

but I think we’ve done enough for one tutorial.
I’ll probably expand on this topic in the future. Blinking LEDs is fun.

--

--

Mattia Maldini

Computer Science Master from Alma Mater Studiorum, Bologna; interested in a wide range of topics, from functional programming to embedded systems.