Writing an autograd engine in C, by hand

June 18, 2026

I've known the theory behind neural network training for years. The chain rule, backpropagation, gradient descent: I could explain all of it on a whiteboard. But at some point I realized I had never actually implemented it. I had used PyTorch, trusted loss.backward(), and never once written the thing that fills in the gradients myself.

So I sat down and built one. The result is microcrad, a small scalar-valued automatic differentiation engine written in C, with a tiny neural network on top. It's a re-implementation of Andrej Karpathy's micrograd, which is in Python.

I want to walk through how it works and, more interestingly, through the part the Python original gets to ignore entirely: who owns the memory.

Two constraints

Before the code, the rules I set for myself, because they shaped everything that follows.

First, I wrote it in C. I hadn't used C seriously since university, and that was the point: I wanted the friction. A high-level language lets you stay vague about what's actually happening; C does not. Salvatore Sanfilippo's C programming course is what finally nudged me back to the language.

Second, no AI assistance for any of the source code. If I was going to claim I understood backpropagation, I wanted to have typed every line of it myself. That's a slow way to work in 2026, and that was also the point.

This is a learning project, not a framework. It's scalar, not tensor-based, so it's slow and won't train anything serious. It exists to be read.

The whole thing is one idea repeated

The engine operates on scalars, not tensors. Every number that takes part in a computation is a node in a graph, every operation records how it was produced, and a single backward pass walks the graph in reverse to compute the derivative of the output with respect to every input. No vectorization, no GPU, no clever tricks. Just the chain rule, applied one scalar at a time. There's nowhere to hide, which is exactly why it's good for learning.

The fundamental type is Value. It wraps a single double and, when it's the result of an operation, remembers the operands it came from:

typedef struct Value {
    uint32_t ref_count;   /* How many references point at this Value. */
    uint32_t n_prevs;     /* How many operands produced this Value. */
    double data;          /* The scalar this node holds. */
    double extra_data;    /* Operation parameter (e.g. the exponent in pow). */
    struct Value **prev;  /* The operands (previous nodes in the graph). */
    int32_t op_code;      /* Which operation produced this Value. */
    uint32_t magic;       /* Debug canary for some invalid or stale pointers. */
    double grad;          /* dLoss/dThisValue, filled in by backward. */
} Value;

A leaf node, an input or a weight, has no operands. A node produced by an operation points back at the operands it depends on through prev. Follow those pointers from any node and you walk a directed acyclic graph: the computation graph.

The smallest program that computes something and its gradient looks like this:

Value *a = value_create_leaf(2.0);
Value *b = value_create_leaf(3.0);
Value *c = value_mul(a, b);   /* c = a * b = 6 */

value_backward(c);            /* fills in every grad field */

printf("dc/da = %f\n", a->grad);   /* 3.000000  (== b) */
printf("dc/db = %f\n", b->grad);   /* 2.000000  (== a) */

The operations I implemented are addition, multiplication, power (with a constant exponent), exp, log, and ReLU. That's a deliberately small set, but it's exactly enough to build a ReLU network with a mean-squared-error or negative-log-likelihood loss. I didn't write subtraction or division as separate operations: subtraction is addition with a negated operand, and division is multiplication by a reciprocal. Keeping the primitive set small keeps the backward pass small too.

The backward pass

value_backward does the same two-step dance as the original micrograd.

First it builds a topological ordering of the graph with a depth-first traversal, so that every node comes after all the nodes it depends on. Then it seeds the output's gradient to 1 and walks the sorted list in reverse, and at each node it pushes gradient onto that node's operands according to the local derivative of the operation that produced it.

In code, the reverse walk is just a switch over the operation codes:

v->grad = 1;
for (size_t i = topo->size-1; i != 0; i--) {
    Value *node = topo->data[i];
    switch (node->op_code) {
        case ADD_OP_CODE:
            for (uint32_t j = 0; j < node->n_prevs; j++)
                node->prev[j]->grad += node->grad;
            break;
        case MUL_OP_CODE:
            for (uint32_t j = 0; j < node->n_prevs; j++)
                node->prev[j]->grad += node->grad * node->prev[1-j]->data;
            break;
        /* ... pow, exp, log, relu ... */
    }
}

The detail worth pausing on is +=. Gradients accumulate. If a Value is used in more than one place in the graph, it correctly receives the sum of the gradients flowing back through each path. That single += is doing the multivariable chain rule for you. It's also why the engine doesn't reset gradients itself: when you train in a loop you have to zero them yourself before each backward pass, which both training examples do explicitly.

The part Python hides

Here is where C stopped being an aesthetic choice and started teaching me something.

In Python, you build the graph, call backward, and let the garbage collector clean up whenever it feels like it. You never think about it. In C there is no garbage collector, and a computation graph is a tangle of shared pointers. The same weight Value can be an operand of thousands of result nodes. The same intermediate result can feed several downstream operations. Freeing that by hand, node by node, is exactly the kind of thing that produces either leaks or double-frees.

I solved it the way a lot of long-lived C programs do: reference counting.

void value_retain(Value *v);    /* take a reference: ref_count++ */
void value_release(Value *v);   /* drop a reference: ref_count-- */

The rules are small. Every Value is born with a reference count of 1, owned by whoever created it. Each operation retains its operands, meaning a result node co-owns the values it was computed from. And value_release decrements the count; when it hits zero, the node frees itself and releases its own operands first, which may free them in turn, and so on down the graph:

void value_release(Value *v) {
    if (v == NULL) return;
    if (v->magic != VALID) { /* log: stale pointer */ return; }
    v->ref_count--;
    if (v->ref_count > 0) return;
    v->magic = INVALID;
    for (uint32_t i = 0; i < v->n_prevs; i++)
        value_release(v->prev[i]);
    free(v->prev);
    free(v);
}

That recursion is the whole trick. You almost never free a graph node by node. You release the root (the loss, or the output of a forward pass) and the reference counts cascade downward, freeing exactly the nodes that nothing else still holds. The weights survive, because the network structure also holds a reference to them. The pure intermediates, held only by the result you just released, disappear.

It's a genuinely satisfying property once it clicks: a graph that frees itself, and sharing that is both free and correct. A weight used in ten thousand multiplications is just retained ten thousand times, and it's freed neither too early nor too late.

The honest cost is that you have to balance every single reference by hand. Every value_create, every value_retain, every operation's implicit retain of its operands has to be matched by exactly one value_release. Forget one and you leak; do one too many and you free memory that's still in use. This is why the training examples in the repo look so verbose in their error paths: they're being scrupulous about releasing everything they allocated, even when an allocation halfway through fails. That verbosity isn't boilerplate for its own sake. It's the price of leak-free C, and seeing it laid out is part of what the project is meant to teach.

The other surprise was that operations co-own their operands, so you can't reason about a single Value's lifetime in isolation. After c = value_add(a, b), the nodes a and b stay alive because c holds them, even if you've released your own references. That's what makes the recursive free work, but it took me a couple of confusing debugging sessions to internalize that I had to think about the whole graph, not one node.

To make those sessions less confusing, every Value carries a magic field, set to a known constant when it's created and poisoned right before the node is freed. value_retain and value_release check it. If I ever touched a stale pointer, I'd often get a clear complaint instead of silent corruption. It's a debug canary, not a correctness guarantee, and it caught a lot of my own mistakes.

A couple of small design choices I enjoyed

The topological sort needs to avoid visiting a shared node twice, so it needs a set keyed on node identity. I wrote a SimpleSet that keys on the pointer's address, rendered as a fixed-width uppercase hex string, and stores those strings in a trie with 17 children per node (16 hex digits plus a sentinel slot for the end of the key). It only supports insertion and membership, which is all the sort needs. It's almost certainly over-engineered for the job a hash set would have done, but I enjoyed writing it, and that counts for something in a project like this.

Error handling is two small macros. MC_LOG_ERR logs a message with file, line, and function. MC_GUARD(cond, label, msg) logs and jumps to a cleanup label if a condition holds. The result is the classic single-cleanup-label goto pattern for unwinding partial allocations, which I'd read about for years and never actually had a reason to use until now.

The neural network

On top of the engine sits a deliberately plain feed-forward network. A Neuron computes relu(w·x + b). A Layer is a group of neurons sharing the same input. An MLP chains layers. Because every parameter is a Value, a forward pass automatically builds a computation graph you can backpropagate through, with no extra machinery.

One simplification worth flagging: every neuron applies a ReLU, including the ones in the output layer. That means the network's outputs are always non-negative, which constrains what it can represent. It's why the toy example targets y = max(0, x1 + x2), a function that's itself non-negative. That example trains a 2-8-1 network with mean-squared error and full-batch gradient descent, and it's the smallest complete program in the repo: create a model, build a graph, backpropagate, update parameters, run inference. There's also an MNIST example, but I'm upfront in the repo that it's a structural demonstration of wiring the engine to a real dataset, not a practical training recipe. A scalar, ReLU-only engine isn't going to train a real classifier.

What it is and isn't

It's about 1,350 lines of C. No dependencies beyond a C compiler, the standard library, and libm. MIT licensed, driven by a Makefile, with a standalone test suite per component. It is not a production autograd package, not a practical deep-learning framework, and not built for numerical robustness or large datasets. It's a thing to read.

If you've ever used .backward() without knowing what it does, or you write C and want to tell me what I got wrong about the ownership model, the code is here: github.com/oraziorillo/microcrad. I'd genuinely like to hear it. The most instructive part of the whole exercise was the memory, and I suspect there are people who'd have done it more elegantly.