Noodle is a lightweight CNN-style inference library designed for microcontrollers and other memory-constrained systems. Its primary design principle is streaming-based execution: instead of storing all intermediate tensors in RAM, Noodle can read inputs and weights from external storage and write intermediate activations back to storage. This approach allows the peak memory footprint to remain small and predictable.

This documentation describes the public API, as well as the core invariants—data layouts, file formats, and buffer requirements—needed to use Noodle correctly and safely.

What Noodle is (and is NOT)

Noodle is:

A compact set of C/C++ functions for mainly convolution, activation, pooling, flattening, fully connected, style pipelines.
Designed for memory-constrained environments, with APIs that support file-backed I/O to avoid large W×W×C allocations.
Backend-agnostic at the call site: filesystem operations are routed through a small abstraction layer (see Filesystem backend layer).

Noodle is NOT:

A training framework (no automatic differentiation or optimizers).
A dynamic tensor runtime with graph scheduling.
A replacement for highly optimized vendor DSP or accelerator libraries; its focus is clarity, portability, and pedagogical transparency.

Key design concepts

Streaming-first memory model

A common failure mode in embedded machine learning is that models work in desktop environments but exceed the memory limits of microcontrollers. This is largely due to tensor sizes scaling as W×W×C.

Noodle addresses this issue through a streaming execution model:

Inputs may reside on external storage.
Each layer consumes an input tensor (or file).
The output is written to a new file.
Only small working buffers are kept in RAM.

This design explains why multiple function variants exist, such as file → file, memory → file, and file → memory operations.

Data layout conventions

Noodle uses explicit and consistent layout conventions to maintain predictability across platforms.

CHW file layout (feature maps on storage):

Channel planes are stored sequentially:

[ch0 plane][ch1 plane] … [chC−1 plane]

HWC-flatten layout (common for in-memory outputs):

Pixel-major interleaving of channels:

out[(y·W + x)·C + c]

When mixing streaming and in-memory operations, ensure that the appropriate layout conversions are applied (for example, flattening or reordering).

Quick start

1) Select a filesystem backend (compile time)

Enable a backend using build flags. The exact macros may vary by platform. None means all no external file storage is needed.

SD via SdFat: NOODLE_USE_SDFAT
FFat: NOODLE_USE_FFAT
LittleFS: NOODLE_USE_LITTLEFS
None: : NOODLE_USE_NONE

See: noodle_config and Filesystem backend layer.

2) Provide working buffers

Noodle relies on a small number of reusable temporary buffers, typically two, that are shared across layer calls. This approach stabilizes peak memory usage, but it introduces several constraints:

Buffers must be allocated and sized appropriately.
Calls are not re-entrant.
Concurrent use from multiple threads is not supported without isolation.

In RTOS-based systems, treat Noodle as a single-threaded worker unless separate instances and buffers are provided.

3) Construct a processing pipeline

A typical streaming pipeline follows this pattern:

// file → file convolution
noodle_conv_float("in.bin", Cin, Cout, "c1.bin", W, conv1, pool1, NULL);
 
// next convolution stage
noodle_conv_float("c1.bin", Cout, Cout2, "c2.txt", V, conv2, pool2, NULL);
 
// flatten for fully connected head
noodle_flat("c2.txt", flat_mem, V2, Cout2);
 
// dense + softmax
noodle_fcn(flat_mem, out_scores, ...);

In this model, external storage effectively serves as activation memory, enabling inference on devices with very limited RAM.

Training and parameter export workflow

Noodle is an inference-only library. Model training is performed using standard deep learning frameworks such as Keras or PyTorch, after which the learned parameters are exported into a format suitable for embedded deployment.

A typical workflow consists of the following steps:

Model design and training
- Define a compact convolutional neural network (e.g., LeNet-style or MobileNet-style).
- Train the model using Keras or PyTorch on a desktop or server environment.
- Validate accuracy and adjust the architecture as needed.
Parameter extraction
- After training, extract the learned weights and biases from each layer.
- Convert them into a simple, deterministic layout suitable for embedded inference.
- Typical parameters include:
  - Weight matrices and bias vectors for convolution, depthwise convolution, and fully-connected layers
  - NO bias vectors for depthwise convolution
  - Batch-normalization parameters (gamma, beta, mean, variance)
Export to Noodle-compatible format
- A lightweight model exporter script is provided to convert trained parameters into Noodle-compatible files.
- The exporter generates:
  - .h header files containing C-style arrays for in-memory inference, and
  - .txt files containing raw numeric parameters for streaming-based execution.
- Each layer is exported as a separate file (for example, w01.h, bn01.txt, etc.), following Noodle’s layout conventions.
This dual-format export allows the same trained model to be deployed in:
- memory-resident mode (using .h arrays), or
- streaming mode (using .txt files on SD or flash storage).
Embedded deployment
- Copy exported parameter files to the target platform (flash, SD card, or filesystem).
- Build the inference pipeline using Noodle APIs.
- Run inference on the microcontroller.

By exposing the parameter extraction and deployment steps, Noodle provides a transparent, end-to-end TinyML workflow rather than a black-box runtime.

API overview

The documentation is organized into the following modules:

noodle_api — Core public API (layers, activations, pooling, flatten, etc.)
Filesystem backend layer — Filesystem abstraction layer (SdFat, FFat, LittleFS)
noodle_config — Compile-time configuration options and defaults

Limitations and considerations

Several helpers rely on shared global state (file handles and temporary buffers).
Concurrent or re-entrant usage is not supported by default.
Some path utilities may use static internal buffers and are not thread-safe.
Certain APIs assume fixed dimensional structures.
In particular, feature maps are assumed to be square (W×W) in many parts of the API. Rectangular grids are not currently supported.
Convolution padding is symmetric only.
Current convolution routines assume uniform padding on all sides (for example, one pixel for a 3×3 kernel).
Asymmetric or framework-specific padding modes are not supported.
Pooling operations use valid regions only.
Pooling layers do not apply padding. The pooling window is applied only to fully valid regions of the input feature map.

Storage latency can dominate:
Many microcontrollers are compute-capable but I/O-bound.

Glossary

CHW : Channel-first planar storage; channels are stored as full planes.
HWC-flatten : Pixel-major interleaved channels, commonly used before dense layers.
Streaming : Moving tensors through external storage to reduce peak RAM usage.
Cin / Cout : Input and output channel counts.
in-variable: parameters / inputs / outputs that are stored in variables (variable mnemonic)
in-file: Parameters / inputs / outputs that are stored in files (file mnemonic)

: