The Brain: Liquid Neural Networks
The core of Golem is a Neural Circuit Policy (NCP) utilizing Closed-form Continuous-time (CfC) cells. With the introduction of multi-modal sensor fusion, the brain can dynamically scale its perception across visual, spatial (depth), auditory, and thermal domains.
flowchart TD
subgraph Engine [ViZDoom Extraction Buffers]
O_vis["RGB Visual (3x64x64)"]
O_dep["Depth Buffer (1x64x64)"]
O_thm["Thermal Labels (1x64x64)"]
O_aud["Raw Stereo Audio (2xN)"]
end
subgraph VisCortex [Visual Cortex]
Concat_Vis{"Concat Channels"}
VisIn["Input (4x64x64)"]
VisCNN["D Conv2d Layers + ReLU(stride=2, padding=1)"]
VisFlat["Flatten"]
V_t["Latent Vector V(t)"]
end
subgraph ThmCortex [Thermal Cortex]
ThmIn["Input (1x64x64)"]
ThmCNN["D Conv2d Layers + ReLU(stride=2, padding=1)"]
ThmFlat["Flatten"]
T_t["Latent Vector T(t)"]
end
subgraph AudCortex [Auditory Cortex]
DSP["DSP: MelSpectrogram & AmplitudeToDB"]
Mel["2D Spectrogram (2 x H_mels x W_time)"]
AudCNN["3 Conv2d Layers + ReLU(stride=2, padding=1)"]
AudPool["AdaptiveAvgPool2d(1, 1)"]
AudFlat["Flatten"]
A_t["Latent Vector A(t)"]
end
subgraph Core [Liquid Core & Motor Head]
Fusion{"Concatenate ⊕"}
I_t["Multi-Modal Input I(t)"]
CfC["Closed-form Continuous (CfC) Cell = working_memory"]
hx_in[/"Previous State x(t-1)"/]
hx_out[/"Next State x(t)"/]
Linear["Linear Layer (n_actions)"]
Sigmoid["Sigmoid Activation"]
Y_t[/"Action Probabilities y(t)"/]
end
%% Visual Flow
O_vis --> Concat_Vis
O_dep --> Concat_Vis
Concat_Vis --> VisIn
VisIn --> VisCNN --> VisFlat --> V_t
%% Thermal Flow
O_thm --> ThmIn
ThmIn --> ThmCNN --> ThmFlat --> T_t
%% Auditory Flow
O_aud --> DSP --> Mel --> AudCNN --> AudPool --> AudFlat --> A_t
%% Fusion
V_t --> Fusion
T_t --> Fusion
A_t --> Fusion
%% State and Output
Fusion --> I_t
I_t --> CfC
hx_in -.-> CfC
CfC -.-> hx_out
CfC --> Linear
Linear --> Sigmoid --> Y_t
1. Visual Cortex (CNN)
The input observation \(o_t\) is first processed by a Convolutional Neural Network (CNN) to extract spatial features. This hierarchy reduces the high-dimensional pixel space into a flattened, latent feature vector \(V(t)\).
The architecture scales dynamically based on the configured cortical_depth (\(D\)) and active sensors. Given an input tensor of \(C\times64\times64\) (where \(C=3\) for standard RGB, or \(C=4\) if the stereoscopic depth buffer is enabled), the sequential convolutions (using a kernel size of 4, a stride of 2, and a padding of 1 to prevent dropping spatial data on the edges) coupled with ReLU activations compress the spatial manifold. Each convolutional layer effectively halves the spatial dimensions (\(H\) and \(W\)) while doubling the feature channels (\(C\)). For example, a depth of \(D=4\) aggressively pools the feature maps to a highly dense representation, outputting a flattened latent feature vector \(V(t)\in\mathbb{R}^{1024}\).
2. Auditory Cortex (2D CNN & Mel Spectrograms)
If the audio sensor is enabled, Golem expands its phenomenology by processing the raw, high-frequency stereo audio buffer from the engine.
To ensure network stability and prevent gradient explosion, the raw audio arrays are first strictly normalized (zero-mean, unit-variance) during the extraction phase. During data loading and live inference, the normalized 1D waveforms are mathematically converted into dense 2D time-frequency tensors (scaled to decibels) using torchaudio transforms.
This transformation allows the network to process audio as a spatial map via a Short-Time Fourier Transform (STFT). The resulting Mel Spectrogram is routed through a parallel 2D Convolutional Neural Network (nn.Conv2d). By leveraging spatial locality, this architecture mathematically aligns sound classification with the existing visual processing hierarchy, enabling the model to recognize the "visual" shape of acoustic cues (such as a monster's growl or a plasma rifle firing) while naturally compressing high-frequency acoustic noise.
This pathway consists of sequential 2D convolutions (kernel size 3, stride 2, padding 1) and an AdaptiveAvgPool2d((1, 1)) layer to extract the final auditory features regardless of the variable temporal width (\(W_{time}\)) generated by the STFT hop length. This outputs a fixed-size latent audio vector \(A(t)\).
3. Thermal Cortex (Parallel 2D CNN)
If the thermal sensor is enabled, Golem utilizes ViZDoom's semantic segmentation labels_buffer to decouple spatial navigation from active enemy detection. This isolates dynamic entities (monsters, projectiles, items) from the static background geometry, projecting them as a binary "thermal" mask that severely reduces the visual noise the model must parse during combat.
The extracted binary mask is resized to \(1\times64\times64\) utilizing nearest-neighbor interpolation to prevent edge anti-aliasing artifacts, and routed through an isolated, parallel 2D Convolutional Neural Network (nn.Conv2d). This pathway scales identically to the Visual Cortex based on the cortical_depth (\(D\)), but initiates with a specialized filter width. It utilizes sequential convolutions (kernel size 4, stride 2, padding 1) and ReLU activations, starting at 16 output channels and doubling at each layer, allowing the network to learn independent dynamic entity-tracking filters without interference from static environmental textures.
This pathway compresses the binary mask into a latent thermal vector \(T(t)\in\mathbb{R}^{512}\) (at \(D=4\)).
Sensor Fusion Concatenation
If multiple modalities are active, their respective flattened feature vectors are dynamically concatenated:
Where \(I(t)\in\mathbb{R}^{W_f}\) is the final, unified multi-modal representation fed into the liquid core, missing modalities are omitted, and \(W_f\) is the total flat size of all active cortices combined.
To guarantee architectural stability and prevent tensor shape mismatches, \(W_f\) is no longer calculated via brittle, hardcoded algebra. Instead, it is resolved dynamically at initialization: the network constructs a set of zero-tensors (torch.zeros()) matching the configured sensory dimensions and passes them through the respective convolutional pathways inside a torch.no_grad() block. The resulting flattened vectors are measured to definitively compute \(W_f\), seamlessly supporting arbitrary changes to cortical_depth or kernel properties. For example, combining the aforementioned visual (\(1024\)) and thermal (\(512\)) cortices yields a unified \(W_f=1536\) input vector.
4. Liquid Core (CfC) & State Persistence
Standard Recurrent Neural Networks (RNNs) update their hidden state via discrete, uniform steps. In contrast, Liquid Time-Constant (LTC) networks model the hidden state \(x(t)\) as a continuous-time dynamical system of Ordinary Differential Equations (ODEs) responding to a continuous flow of time:
This ODE dictates that a neural network \(f\) not only determines the derivative of the hidden state but also serves as an input-dependent varying time-constant. This enables the network to dynamically adjust its "memory horizon," allowing specific neurons to adapt their coupling sensitivity in real-time.
However, solving this ODE numerically during live gameplay introduces severe computational latency, as traditional numerical solvers (like Runge-Kutta) require multiple iterative evaluations per time step. To achieve the inference target required by the ViZDoom engine, Golem utilizes the Closed-form Continuous (CfC) approximation (Hasani et al., 2022).
This mathematical formulation bypasses the numerical solver entirely by approximating the integral with a tight, closed-form gating mechanism:
Where \(f\), \(g\), and \(h\) represent distinct neural network branches parameterizing the state flow, and \(\odot\) denotes the Hadamard product. The exponential decay is approximated via the sigmoid activation \(\sigma\). This closed-form solution guarantees exponential stability and accelerates training and inference speeds by one to five orders of magnitude compared to strict ODE-based counterparts.
The "Amnesia" Constraint (Stateful Inference)
Because the underlying differential mathematics assume a continuous temporal flow, the network must accumulate evidence to build action potential. During asynchronous live gameplay (inference), the engine feeds the active cortices discrete buffers. The hidden state must be explicitly captured and recursively fed back into the network on the subsequent frame. Failing to persist this state across the deployment loop lobotomizes the network 35 times a second, preventing the CfC activation threshold from ever being reached.
5. Motor Cortex (Linear Head)
The liquid hidden state \(x(t)\in\mathbb{R}^{W_m}\) (where \(W_m\) is the dynamically configured working_memory, e.g., 64 or 128) is projected to the dynamic action space via a final linear transformation. To accommodate the variable supersets defined by the active profile \(\rho\), the output weight matrix dynamically scales its dimensionality \(n_\rho\in\{8, 9, 10\}\):
This produces raw logits \(\mathbf{z}_t\), which are subsequently passed through a continuous Sigmoid activation function to yield the final predicted probabilities for the multi-label Bernoulli distribution:
API Reference
Because the architecture is fully dynamic, the DoomLiquidNet class constructs its layers on-the-fly based on the active app.yaml configuration profile, the selected sensor fusion modalities, and the active Digital Signal Processing (DSP) hyperparameters.
Bases: Module
A continuous-time neural network for visual processing and temporal sequential decision-making.
This network acts as the agent's brain. It processes raw pixel buffers through a Convolutional Neural Network (Visual Cortex) to extract spatial features, which are then fed into a Closed-form Continuous-time (CfC) recurrent network (Liquid Core). The CfC core manages the agent's temporal state using differential equation approximations, allowing it to handle variable time-steps without requiring expensive ODE solvers.
It supports multi-modal sensor fusion, seamlessly integrating spatial depth, auditory spectrograms, and thermal semantic segmentation masks into a unified latent representation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_actions
|
int
|
The number of output actions for the Motor Cortex head. |
required |
cortical_depth
|
int
|
The number of convolutional layers to generate. Each layer halves the spatial dimensions and doubles the feature channels. Default: |
2
|
working_memory
|
int
|
The number of hidden units in the CfC liquid core, representing the capacity of the agent's temporal memory. Default: |
64
|
sensors
|
SensorsConfig
|
Booleans mapping which multi-modal networks to enable (e.g., visual, depth, audio, thermal). |
None
|
dsp_config
|
DSPConfig
|
Signal processing parameters for audio initialization. |
None
|
Source code in app/models/brain.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
forward(x_vis, x_aud=None, x_thm=None, hx=None)
Performs a forward pass through the visual, auditory, and thermal cortices, and the liquid core.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x_vis
|
Tensor
|
A batched sequence of visual frames of shape :math: |
required |
x_aud
|
Tensor
|
raw 1D waveforms: (Batch, Time, Stereo_Channels, Audio_Length). Default: |
None
|
x_thm
|
Tensor
|
A batched sequence of binary thermal masks of shape :math: |
None
|
hx
|
Tensor
|
The previous hidden state of the liquid core of shape :math: |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
A tuple containing:
- Tensor: The unnormalized action logits of shape :math: |
Source code in app/models/brain.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |