The Brain: Liquid Neural Networks

The core of Golem is a Neural Circuit Policy (NCP) utilizing Closed-form Continuous-time (CfC) cells. With the introduction of multi-modal sensor fusion, the brain can dynamically scale its perception across visual, spatial (depth), auditory, and thermal domains.

flowchart TD
    subgraph Engine [ViZDoom Extraction Buffers]
        O_vis["RGB Visual (3x64x64)"]
        O_dep["Depth Buffer (1x64x64)"]
        O_thm["Thermal Labels (1x64x64)"]
        O_aud["Raw Stereo Audio (2xN)"]
    end

    subgraph VisCortex [Visual Cortex]
        Concat_Vis{"Concat Channels"}
        VisIn["Input (4x64x64)"]
        VisCNN["D Conv2d Layers + ReLU(stride=2, padding=1)"]
        VisFlat["Flatten"]
        V_t["Latent Vector V(t)"]
    end

    subgraph ThmCortex [Thermal Cortex]
        ThmIn["Input (1x64x64)"]
        ThmCNN["D Conv2d Layers + ReLU(stride=2, padding=1)"]
        ThmFlat["Flatten"]
        T_t["Latent Vector T(t)"]
    end

    subgraph AudCortex [Auditory Cortex]
        DSP["DSP: MelSpectrogram & AmplitudeToDB"]
        Mel["2D Spectrogram (2 x H_mels x W_time)"]
        AudCNN["3 Conv2d Layers + ReLU(stride=2, padding=1)"]
        AudPool["AdaptiveAvgPool2d(1, 1)"]
        AudFlat["Flatten"]
        A_t["Latent Vector A(t)"]
    end

    subgraph Core [Liquid Core & Motor Head]
        Fusion{"Concatenate ⊕"}
        I_t["Multi-Modal Input I(t)"]
        CfC["Closed-form Continuous (CfC) Cell = working_memory"]
        hx_in[/"Previous State x(t-1)"/]
        hx_out[/"Next State x(t)"/]
        Linear["Linear Layer (n_actions)"]
        Sigmoid["Sigmoid Activation"]
        Y_t[/"Action Probabilities y(t)"/]
    end

    %% Visual Flow
    O_vis --> Concat_Vis
    O_dep --> Concat_Vis
    Concat_Vis --> VisIn
    VisIn --> VisCNN --> VisFlat --> V_t

    %% Thermal Flow
    O_thm --> ThmIn
    ThmIn --> ThmCNN --> ThmFlat --> T_t

    %% Auditory Flow
    O_aud --> DSP --> Mel --> AudCNN --> AudPool --> AudFlat --> A_t

    %% Fusion
    V_t --> Fusion
    T_t --> Fusion
    A_t --> Fusion

    %% State and Output
    Fusion --> I_t
    I_t --> CfC
    hx_in -.-> CfC
    CfC -.-> hx_out
    CfC --> Linear
    Linear --> Sigmoid --> Y_t

1. Visual Cortex (CNN)

The input observation \(o_t\) is first processed by a Convolutional Neural Network (CNN) to extract spatial features. This hierarchy reduces the high-dimensional pixel space into a flattened, latent feature vector \(V(t)\).

The architecture scales dynamically based on the configured cortical_depth (\(D\)) and active sensors. Given an input tensor of \(C\times64\times64\) (where \(C=3\) for standard RGB, or \(C=4\) if the stereoscopic depth buffer is enabled), the sequential convolutions (using a kernel size of 4, a stride of 2, and a padding of 1 to prevent dropping spatial data on the edges) coupled with ReLU activations compress the spatial manifold. Each convolutional layer effectively halves the spatial dimensions (\(H\) and \(W\)) while doubling the feature channels (\(C\)). For example, a depth of \(D=4\) aggressively pools the feature maps to a highly dense representation, outputting a flattened latent feature vector \(V(t)\in\mathbb{R}^{1024}\).

2. Auditory Cortex (2D CNN & Mel Spectrograms)

If the audio sensor is enabled, Golem expands its phenomenology by processing the raw, high-frequency stereo audio buffer from the engine.

To ensure network stability and prevent gradient explosion, the raw audio arrays are first strictly normalized (zero-mean, unit-variance) during the extraction phase. During data loading and live inference, the normalized 1D waveforms are mathematically converted into dense 2D time-frequency tensors (scaled to decibels) using torchaudio transforms.

This transformation allows the network to process audio as a spatial map via a Short-Time Fourier Transform (STFT). The resulting Mel Spectrogram is routed through a parallel 2D Convolutional Neural Network (nn.Conv2d). By leveraging spatial locality, this architecture mathematically aligns sound classification with the existing visual processing hierarchy, enabling the model to recognize the "visual" shape of acoustic cues (such as a monster's growl or a plasma rifle firing) while naturally compressing high-frequency acoustic noise.

This pathway consists of sequential 2D convolutions (kernel size 3, stride 2, padding 1) and an AdaptiveAvgPool2d((1, 1)) layer to extract the final auditory features regardless of the variable temporal width (\(W_{time}\)) generated by the STFT hop length. This outputs a fixed-size latent audio vector \(A(t)\).

3. Thermal Cortex (Parallel 2D CNN)

If the thermal sensor is enabled, Golem utilizes ViZDoom's semantic segmentation labels_buffer to decouple spatial navigation from active enemy detection. This isolates dynamic entities (monsters, projectiles, items) from the static background geometry, projecting them as a binary "thermal" mask that severely reduces the visual noise the model must parse during combat.

The extracted binary mask is resized to \(1\times64\times64\) utilizing nearest-neighbor interpolation to prevent edge anti-aliasing artifacts, and routed through an isolated, parallel 2D Convolutional Neural Network (nn.Conv2d). This pathway scales identically to the Visual Cortex based on the cortical_depth (\(D\)), but initiates with a specialized filter width. It utilizes sequential convolutions (kernel size 4, stride 2, padding 1) and ReLU activations, starting at 16 output channels and doubling at each layer, allowing the network to learn independent dynamic entity-tracking filters without interference from static environmental textures.

This pathway compresses the binary mask into a latent thermal vector \(T(t)\in\mathbb{R}^{512}\) (at \(D=4\)).

Sensor Fusion Concatenation

If multiple modalities are active, their respective flattened feature vectors are dynamically concatenated:

\[ I(t)=V(t)\oplus A(t)\oplus T(t) \]

Where \(I(t)\in\mathbb{R}^{W_f}\) is the final, unified multi-modal representation fed into the liquid core, missing modalities are omitted, and \(W_f\) is the total flat size of all active cortices combined.

To guarantee architectural stability and prevent tensor shape mismatches, \(W_f\) is no longer calculated via brittle, hardcoded algebra. Instead, it is resolved dynamically at initialization: the network constructs a set of zero-tensors (torch.zeros()) matching the configured sensory dimensions and passes them through the respective convolutional pathways inside a torch.no_grad() block. The resulting flattened vectors are measured to definitively compute \(W_f\), seamlessly supporting arbitrary changes to cortical_depth or kernel properties. For example, combining the aforementioned visual (\(1024\)) and thermal (\(512\)) cortices yields a unified \(W_f=1536\) input vector.

4. Liquid Core (CfC) & State Persistence

Standard Recurrent Neural Networks (RNNs) update their hidden state via discrete, uniform steps. In contrast, Liquid Time-Constant (LTC) networks model the hidden state \(x(t)\) as a continuous-time dynamical system of Ordinary Differential Equations (ODEs) responding to a continuous flow of time:

\[ \frac{dx(t)}{dt}=-\left[w_\tau+f(x(t),I(t);\theta)\right]\odot x(t)+A\odot f(x(t),I(t);\theta) \]

This ODE dictates that a neural network \(f\) not only determines the derivative of the hidden state but also serves as an input-dependent varying time-constant. This enables the network to dynamically adjust its "memory horizon," allowing specific neurons to adapt their coupling sensitivity in real-time.

However, solving this ODE numerically during live gameplay introduces severe computational latency, as traditional numerical solvers (like Runge-Kutta) require multiple iterative evaluations per time step. To achieve the inference target required by the ViZDoom engine, Golem utilizes the Closed-form Continuous (CfC) approximation (Hasani et al., 2022).

This mathematical formulation bypasses the numerical solver entirely by approximating the integral with a tight, closed-form gating mechanism:

\[ x(t)=\sigma(-f(x,I;\theta_f)t)\odot g(x,I;\theta_g)+\left[1-\sigma(-f(x,I;\theta_f)t)\right]\odot h(x,I;\theta_h) \]

Where \(f\), \(g\), and \(h\) represent distinct neural network branches parameterizing the state flow, and \(\odot\) denotes the Hadamard product. The exponential decay is approximated via the sigmoid activation \(\sigma\). This closed-form solution guarantees exponential stability and accelerates training and inference speeds by one to five orders of magnitude compared to strict ODE-based counterparts.

The "Amnesia" Constraint (Stateful Inference)

Because the underlying differential mathematics assume a continuous temporal flow, the network must accumulate evidence to build action potential. During asynchronous live gameplay (inference), the engine feeds the active cortices discrete buffers. The hidden state must be explicitly captured and recursively fed back into the network on the subsequent frame. Failing to persist this state across the deployment loop lobotomizes the network 35 times a second, preventing the CfC activation threshold from ever being reached.

5. Motor Cortex (Linear Head)

The liquid hidden state \(x(t)\in\mathbb{R}^{W_m}\) (where \(W_m\) is the dynamically configured working_memory, e.g., 64 or 128) is projected to the dynamic action space via a final linear transformation. To accommodate the variable supersets defined by the active profile \(\rho\), the output weight matrix dynamically scales its dimensionality \(n_\rho\in\{8, 9, 10\}\):

\[ \mathbf{z}_t=W_{out}x(t)+b_{out} \]

This produces raw logits \(\mathbf{z}_t\), which are subsequently passed through a continuous Sigmoid activation function to yield the final predicted probabilities for the multi-label Bernoulli distribution:

\[ \hat{\mathbf{y}}_t=\sigma(\mathbf{z}_t) \]

API Reference

Because the architecture is fully dynamic, the DoomLiquidNet class constructs its layers on-the-fly based on the active app.yaml configuration profile, the selected sensor fusion modalities, and the active Digital Signal Processing (DSP) hyperparameters.

Bases: Module

A continuous-time neural network for visual processing and temporal sequential decision-making.

This network acts as the agent's brain. It processes raw pixel buffers through a Convolutional Neural Network (Visual Cortex) to extract spatial features, which are then fed into a Closed-form Continuous-time (CfC) recurrent network (Liquid Core). The CfC core manages the agent's temporal state using differential equation approximations, allowing it to handle variable time-steps without requiring expensive ODE solvers.

It supports multi-modal sensor fusion, seamlessly integrating spatial depth, auditory spectrograms, and thermal semantic segmentation masks into a unified latent representation.

Parameters:

Name	Type	Description	Default
`n_actions`	`int`	The number of output actions for the Motor Cortex head.	required
`cortical_depth`	`int`	The number of convolutional layers to generate. Each layer halves the spatial dimensions and doubles the feature channels. Default: `2`.	`2`
`working_memory`	`int`	The number of hidden units in the CfC liquid core, representing the capacity of the agent's temporal memory. Default: `64`.	`64`
`sensors`	`SensorsConfig`	Booleans mapping which multi-modal networks to enable (e.g., visual, depth, audio, thermal).	`None`
`dsp_config`	`DSPConfig`	Signal processing parameters for audio initialization.	`None`

Source code in app/models/brain.py

class DoomLiquidNet(nn.Module):
    r"""
    A continuous-time neural network for visual processing and temporal sequential decision-making.

    This network acts as the agent's brain. It processes raw pixel buffers through a Convolutional Neural Network (Visual Cortex) to extract spatial features, which are then fed into a Closed-form Continuous-time (CfC) recurrent network (Liquid Core). The CfC core manages the agent's temporal state using differential equation approximations, allowing it to handle variable time-steps without requiring expensive ODE solvers.

    It supports multi-modal sensor fusion, seamlessly integrating spatial depth, auditory spectrograms, and thermal semantic segmentation masks into a unified latent representation.

    Args:
        n_actions (int): The number of output actions for the Motor Cortex head.
        cortical_depth (int, optional): The number of convolutional layers to generate. Each layer halves the spatial dimensions and doubles the feature channels. Default: ``2``.
        working_memory (int, optional): The number of hidden units in the CfC liquid core, representing the capacity of the agent's temporal memory. Default: ``64``.
        sensors (SensorsConfig, optional): Booleans mapping which multi-modal networks to enable (e.g., visual, depth, audio, thermal).
        dsp_config (DSPConfig, optional): Signal processing parameters for audio initialization.
    """

    def __init__(self, n_actions, cortical_depth=2, working_memory=64, 
                 sensors: SensorsConfig=None, dsp_config: DSPConfig=None):
        super().__init__()
        self.sensors = sensors
        self.dsp_config = dsp_config

        self.use_audio = self.sensors and self.sensors.audio
        self.use_thermal = self.sensors and self.sensors.thermal

        # 1. Build the Visual Cortex (CNN) dynamically
        layers = []
        in_channels = 4 if self.sensors and self.sensors.depth else 3
        out_channels = 32

        for _ in range(cortical_depth):
            # Added padding=1 to prevent dropping spatial data on the right/bottom edges
            layers.append(nn.Conv2d(in_channels, out_channels, kernel_size=4, stride=2, padding=1))
            layers.append(nn.ReLU())
            in_channels = out_channels
            out_channels *= 2

        layers.append(nn.Flatten())
        self.conv = nn.Sequential(*layers)

        # 2. Build Auditory Cortex (Parallel 2D CNN)
        if self.use_audio:
            # Embed GPU-Accelerated DSP directly into the network graph
            self.mel_transform = torchaudio.transforms.MelSpectrogram(
                sample_rate=self.dsp_config.sample_rate,
                n_fft=self.dsp_config.n_fft,
                hop_length=self.dsp_config.hop_length,
                n_mels=self.dsp_config.n_mels
            )
            self.amp_to_db = torchaudio.transforms.AmplitudeToDB()

            aud_layers = []
            a_in = 2 # Stereo channels over a 2D Time-Frequency Spectrogram
            a_out = 16
            for _ in range(3):
                aud_layers.append(nn.Conv2d(a_in, a_out, kernel_size=3, stride=2, padding=1))
                aud_layers.append(nn.ReLU())
                a_in = a_out
                a_out *= 2
            aud_layers.append(nn.AdaptiveAvgPool2d((1, 1)))
            aud_layers.append(nn.Flatten())
            self.audio_conv = nn.Sequential(*aud_layers)

        # 3. Build Thermal Cortex (Parallel 2D CNN)
        if self.use_thermal:
            thm_layers = []
            t_in = 1  # Binary Thermal Mask
            t_out = 16
            for _ in range(cortical_depth):
                # Added padding=1 to align with visual cortex dimensions
                thm_layers.append(nn.Conv2d(t_in, t_out, kernel_size=4, stride=2, padding=1))
                thm_layers.append(nn.ReLU())
                t_in = t_out
                t_out *= 2
            thm_layers.append(nn.Flatten())
            self.thermal_conv = nn.Sequential(*thm_layers)

        # 4. Dynamic Dummy Pass to resolve exact flat_size
        with torch.no_grad():
            dummy_c_in = torch.zeros(1, (4 if self.sensors and self.sensors.depth else 3), 64, 64)
            flat_size = self.conv(dummy_c_in).view(1, -1).size(1)

            if self.use_audio:
                # Time dimension is arbitrarily 100, AdaptiveAvgPool2d flattens it anyway
                dummy_a_in = torch.zeros(1, 2, self.dsp_config.n_mels, 100) 
                flat_size += self.audio_conv(dummy_a_in).view(1, -1).size(1)

            if self.use_thermal:
                dummy_t_in = torch.zeros(1, 1, 64, 64)
                flat_size += self.thermal_conv(dummy_t_in).view(1, -1).size(1)

        # 5. Liquid Core
        self.liquid_rnn = CfC(
            input_size=flat_size, 
            units=working_memory, 
            return_sequences=True 
        )

        # 6. Motor Cortex Head
        self.output = nn.Linear(working_memory, n_actions)

    def forward(self, x_vis, x_aud=None, x_thm=None, hx=None):
        r"""
        Performs a forward pass through the visual, auditory, and thermal cortices, and the liquid core.

        Args:
            x_vis (Tensor): A batched sequence of visual frames of shape :math:`(\text{Batch}, \text{Time}, C, H, W)`.
            x_aud (Tensor, optional): raw 1D waveforms: (Batch, Time, Stereo_Channels, Audio_Length). Default: ``None``.
            x_thm (Tensor, optional): A batched sequence of binary thermal masks of shape :math:`(\text{Batch}, \text{Time}, 1, H, W)`. Default: ``None``.
            hx (Tensor, optional): The previous hidden state of the liquid core of shape :math:`(\text{Batch}, \text{working\_memory})`. Default: ``None``.

        Returns:
            tuple: A tuple containing:
                - Tensor: The unnormalized action logits of shape :math:`(\text{Batch}, \text{Time}, \text{n\_actions})`.
                - Tensor: The updated hidden state (working memory) for the next time-step.
        """
        batch, time, c, h, w = x_vis.size()

        c_in = x_vis.view(batch * time, c, h, w)
        features = self.conv(c_in)

        if self.use_audio and x_aud is not None:
            # x_aud arrives as raw 1D waveforms: (Batch, Time, Stereo_Channels, Audio_Length)
            b, t, ac, a_len = x_aud.size()
            a_in = x_aud.view(b * t, ac, a_len)

            # Dynamically generate the Mel Spectrogram on the GPU
            a_in = self.mel_transform(a_in)
            a_in = self.amp_to_db(a_in)

            a_feat = self.audio_conv(a_in)
            features = torch.cat((features, a_feat), dim=1)

        if self.use_thermal and x_thm is not None:
            b, t, tc, th, tw = x_thm.size()
            t_in = x_thm.view(b * t, tc, th, tw)
            t_feat = self.thermal_conv(t_in)
            features = torch.cat((features, t_feat), dim=1)

        r_in = features.view(batch, time, -1)
        r_out, new_hx = self.liquid_rnn(r_in, hx)

        return self.output(r_out), new_hx

`forward(x_vis, x_aud=None, x_thm=None, hx=None)`

Performs a forward pass through the visual, auditory, and thermal cortices, and the liquid core.

Parameters:

Name	Type	Description	Default
`x_vis`	`Tensor`	A batched sequence of visual frames of shape :math:`(\text{Batch}, \text{Time}, C, H, W)`.	required
`x_aud`	`Tensor`	raw 1D waveforms: (Batch, Time, Stereo_Channels, Audio_Length). Default: `None`.	`None`
`x_thm`	`Tensor`	A batched sequence of binary thermal masks of shape :math:`(\text{Batch}, \text{Time}, 1, H, W)`. Default: `None`.	`None`
`hx`	`Tensor`	The previous hidden state of the liquid core of shape :math:`(\text{Batch}, \text{working\_memory})`. Default: `None`.	`None`

Returns:

Name	Type	Description
`tuple`		A tuple containing: - Tensor: The unnormalized action logits of shape :math:`(\text{Batch}, \text{Time}, \text{n\_actions})`. - Tensor: The updated hidden state (working memory) for the next time-step.

Source code in app/models/brain.py

def forward(self, x_vis, x_aud=None, x_thm=None, hx=None):
    r"""
    Performs a forward pass through the visual, auditory, and thermal cortices, and the liquid core.

    Args:
        x_vis (Tensor): A batched sequence of visual frames of shape :math:`(\text{Batch}, \text{Time}, C, H, W)`.
        x_aud (Tensor, optional): raw 1D waveforms: (Batch, Time, Stereo_Channels, Audio_Length). Default: ``None``.
        x_thm (Tensor, optional): A batched sequence of binary thermal masks of shape :math:`(\text{Batch}, \text{Time}, 1, H, W)`. Default: ``None``.
        hx (Tensor, optional): The previous hidden state of the liquid core of shape :math:`(\text{Batch}, \text{working\_memory})`. Default: ``None``.

    Returns:
        tuple: A tuple containing:
            - Tensor: The unnormalized action logits of shape :math:`(\text{Batch}, \text{Time}, \text{n\_actions})`.
            - Tensor: The updated hidden state (working memory) for the next time-step.
    """
    batch, time, c, h, w = x_vis.size()

    c_in = x_vis.view(batch * time, c, h, w)
    features = self.conv(c_in)

    if self.use_audio and x_aud is not None:
        # x_aud arrives as raw 1D waveforms: (Batch, Time, Stereo_Channels, Audio_Length)
        b, t, ac, a_len = x_aud.size()
        a_in = x_aud.view(b * t, ac, a_len)

        # Dynamically generate the Mel Spectrogram on the GPU
        a_in = self.mel_transform(a_in)
        a_in = self.amp_to_db(a_in)

        a_feat = self.audio_conv(a_in)
        features = torch.cat((features, a_feat), dim=1)

    if self.use_thermal and x_thm is not None:
        b, t, tc, th, tw = x_thm.size()
        t_in = x_thm.view(b * t, tc, th, tw)
        t_feat = self.thermal_conv(t_in)
        features = torch.cat((features, t_feat), dim=1)

    r_in = features.view(batch, time, -1)
    r_out, new_hx = self.liquid_rnn(r_in, hx)

    return self.output(r_out), new_hx