Skip to content

Anomaly Detection on ESP32

Preface

MLPerf Tiny anomaly detection model

In this section, we will implement the anomali detection which is one of benchmark found in MLPerf Tiny. The network is entirely made of dense layers with ReLU activations (Auto-encoder Fully-Connected or AEFC).

Layer # Type Input dim Output dim Activation Extracted weights and biases
1 Dense 640 128 ReLU w01.txt, w02.txt
2 Dense 128 128 ReLU w03.txt, w04.txt
3 Dense 128 128 ReLU w05.txt, w06.txt
4 Dense 128 128 ReLU w07.txt, w08.txt
5 Dense 128 8 ReLU w09.txt, w10.txt
6 Dense 8 128 ReLU w11.txt, w12.txt
7 Dense 128 128 ReLU w13.txt, w14.txt
8 Dense 128 128 ReLU w15.txt, w16.txt
9 Dense 128 128 ReLU w17.txt, w18.txt
10 Dense 128 640 None (Linear) w19.txt, w20.txt

Extracting the weights and biases

The repository provides weights and biases that we can use directly. For this purpose, we selected the trained network stored in ad01_fp32.tflite.

Generating .txt from .wav files

For our benchmark test, we perform the same preprocessing pipeline used during training (as found in this part of the repository). However, we will run the test offline and with viewer data. The ESP32 never processes raw audio. It only consumes precomputed feature vectors stored as text files. Each input .wav file is converted into multiple fixed-length feature vectors using the following steps:

  • Load WAV file: the audio file is loaded using its native sampling rate (no resampling). The WAV file is 11 seconds with 342 frames.
  • Feature extraction: a log-mel spectrogram is computed using the same parameters defined in baseline.yaml:
    • n_mels = 128
    • frames = 5
    • n_fft = 1024
    • hop_length = 512
    • power = 2.0
  • Temporal cropping: only the central portion of the spectrogram is kept: frames 50 to 250 → 200 frames total.
  • Sliding window segmentation: a sliding window of length frames = 5 is applied across the cropped spectrogram, producing: 200 − 5 + 1 = 196 feature vectors per WAV file.
  • Flattening and storage: each window is flattened into a 1-D vector of size: inputDim = n_mels × frames = 128 × 5 = 640 and stored as a float32 text file:
<wav_name>_part000.txt
<wav_name>_part001.txt
…
<wav_name>_part195.txt
  • Take 5 parts from the an anomaly set and name them anom1.txt to anom5.txt.
  • Take 5 parts from the a normal set and name them norm1.txt to norm5.txt.

These .txt files represent the actual inputs to the auto-encoder and match exactly the data format used during training and evaluation in the original baseline implementation.

ESP32 inference workflow

On the ESP32, we will perform the following inference procedures:

  • For a given WAV sample, all corresponding *_partXXX.bin files are loaded sequentially from SD card (or FFAT).
  • Each .bin file is read into a float[640] input buffer.
  • The input vector is passed through the auto-encoder implemented using Noodle, producing a reconstructed output vector of the same size.
  • The mean squared reconstruction error (MSE) between input and output is computed for that window.
  • Errors are accumulated across all 196 windows.
  • The final anomaly score for the WAV file is computed as the average reconstruction error: score = mean(MSE_part_0 … MSE_part_195)

We will only retain the final scalar score and discard the individually reconstructed vectors immediately to minimize memory usage.

Hardware

For this benchmark, we will use ESP32-S3-N16R8 which gives us plenty of space in the flash.

Testing scenario

  • Apply 5 parts from an anomaly dataset ➜ 5/196 of a full WAV.
  • Apply 5 parts from a normal dataset ➜ 5/196 of a full WAV.
  • Each part is float32[640] values (2560 bytes).
  • ESP32 returns mse and elapsed time (us) for each part.

Code on the ESP side

static constexpr uint16_t INPUT_DIM = 640;
static constexpr uint16_t HIDDEN_DIM = 128;
static constexpr uint16_t BOTTLENECK_DIM = 8;

// Ping-pong buffers
static float BUF1[INPUT_DIM];
static float BUF2[INPUT_DIM];

// Copy of input for MSE
static float X0[INPUT_DIM];

FCNFile L1;  
L1.weight_fn  = "/w01.txt"; L1.bias_fn  = "/w02.txt"; L1.act  = ACT_RELU;

FCNFile L2;  
L2.weight_fn  = "/w03.txt"; L2.bias_fn  = "/w04.txt"; L2.act  = ACT_RELU;

FCNFile L3;  
L3.weight_fn  = "/w05.txt"; L3.bias_fn  = "/w06.txt"; L3.act  = ACT_RELU;

FCNFile L4;  
L4.weight_fn  = "/w07.txt"; L4.bias_fn  = "/w08.txt"; L4.act  = ACT_RELU;

FCNFile L5;  
L5.weight_fn  = "/w09.txt"; L5.bias_fn  = "/w10.txt"; L5.act  = ACT_RELU;

FCNFile L6;  
L6.weight_fn  = "/w11.txt"; L6.bias_fn  = "/w12.txt"; L6.act  = ACT_RELU;

FCNFile L7;  
L7.weight_fn  = "/w13.txt"; L7.bias_fn  = "/w14.txt"; L7.act  = ACT_RELU;

FCNFile L8;  
L8.weight_fn  = "/w15.txt"; L8.bias_fn  = "/w16.txt"; L8.act  = ACT_RELU;

FCNFile L9;  
L9.weight_fn  = "/w17.txt"; L9.bias_fn  = "/w18.txt"; L9.act  = ACT_RELU;

FCNFile L10; 
L10.weight_fn = "/w19.txt"; L10.bias_fn = "/w20.txt"; L10.act = ACT_NONE;

uint16_t V = INPUT_DIM;

// 640 -> 128 -> 128 -> 128 -> 128 -> 8 -> 128 -> 128 -> 128 -> 128 -> 640
V = noodle_fcn(BUF1, V, HIDDEN_DIM,      BUF2, L1,  NULL);
V = noodle_fcn(BUF2, V, HIDDEN_DIM,      BUF1, L2,  NULL);
V = noodle_fcn(BUF1, V, HIDDEN_DIM,      BUF2, L3,  NULL);
V = noodle_fcn(BUF2, V, HIDDEN_DIM,      BUF1, L4,  NULL);
V = noodle_fcn(BUF1, V, BOTTLENECK_DIM,  BUF2, L5,  NULL);
V = noodle_fcn(BUF2, V, HIDDEN_DIM,      BUF1, L6,  NULL);
V = noodle_fcn(BUF1, V, HIDDEN_DIM,      BUF2, L7,  NULL);
V = noodle_fcn(BUF2, V, HIDDEN_DIM,      BUF1, L8,  NULL);
V = noodle_fcn(BUF1, V, HIDDEN_DIM,      BUF2, L9,  NULL);
V = noodle_fcn(BUF2, V, INPUT_DIM,       BUF1, L10, NULL);

Compare ESP32 and Google Colab

ESP 32

=== anom set ===
anom1: mse=9.17803478 us=23812318
anom2: mse=8.465662 us=23812164
anom3: mse=8.66184902 us=23812145
anom4: mse=9.07823086 us=23812145
anom5: mse=8.99418163 us=23812144
Mean anom MSE = 8.87559128

=== norm set ===
norm1: mse=10.8290482 us=23812112
norm2: mse=11.2695885 us=23812183
norm3: mse=11.530509 us=23812167
norm4: mse=11.6987247 us=23812152
norm5: mse=11.0928583 us=23812148
Mean norm MSE = 11.2841454

DONE (processed anom1..5 + norm1..5)

Google Colab

Input : input_1 shape= [  1 640] dtype= <class 'numpy.float32'>
Output: Identity shape= [  1 640] dtype= <class 'numpy.float32'>

--- ANOM ---
[TFLite] /content/sample_data/anom1.txt: mse=9.17908482
[TFLite] /content/sample_data/anom2.txt: mse=8.46575602
[TFLite] /content/sample_data/anom3.txt: mse=8.6624254
[TFLite] /content/sample_data/anom4.txt: mse=9.07872793
[TFLite] /content/sample_data/anom5.txt: mse=8.99621678

--- NORM ---
[TFLite] /content/sample_data/norm1.txt: mse=10.82921
[TFLite] /content/sample_data/norm2.txt: mse=11.2699071
[TFLite] /content/sample_data/norm3.txt: mse=11.5304345
[TFLite] /content/sample_data/norm4.txt: mse=11.6980121
[TFLite] /content/sample_data/norm5.txt: mse=11.0927933

--- SUMMARY ---
norm mean=11.2840714 std=0.344865398 min=10.82921 max=11.6980121
anom mean=8.87644219 std=0.300551306 min=8.46575602 max=9.17908482
anom/norm mean ratio = 0.786634706

Done.

Inference parity: ESP32 vs Colab

Anomaly set

Sample ESP32 MSE TFLite MSE Δ
anom1.txt 9.1780 9.1791 ~0.001
anom2.txt 8.4657 8.4658 ~0.0001
anom3.txt 8.6618 8.6624 ~0.0006
anom4.txt 9.0782 9.0787 ~0.0005
anom5.txt 8.9942 8.9962 ~0.002

Normal set

Sample ESP32 MSE TFLite MSE Δ (abs)
norm1.txt 10.82897 10.82921 ≈ 0.00024
norm2.txt 11.27035 11.26991 ≈ 0.00044
norm3.txt 11.53070 11.53043 ≈ 0.00027
norm4.txt 11.69870 11.69801 ≈ 0.00069
norm5.txt 11.09396 11.09279 ≈ 0.00117