I Tested AI Music Models on Progressive House Lead Layering — Here's What Actually Worked

April 30, 2026 · 27 min read

AI & Software Engineer · Music Producer (EyeMad)

There is a very specific kind of tension that builds in a progressive house track — the kind that grabs your chest somewhere around bar 28 of the break, when the filter opens halfway and the reverb tail of the lead swells. Producers like Alesso, Martin Garrix, and Axwell /\ Ingrosso have turned that tension into a craft. I have spent years studying it, replicating it, and eventually producing it myself under my alias EyeMad.

But recently, as an AI engineer, I started asking a different question: could an AI music generation model — without fine-tuning, using only prompt engineering — accurately describe, architect, and guide the creation of those layered progressive house lead sounds that define the genre's signature drops?

This article is my attempt to answer that question. It is long, technical, and deeply personal. It lives at the intersection of two things I love most.

🎵 First: What Are We Actually Talking About?

Before we go deep into AI territory, let me establish what "progressive house lead layering" actually means from a production standpoint. Let's anchor the conversation in real music.

Progressive House Playlist — My Vision for the Genre

🎛️ The Anatomy of a Progressive House Lead Layer Stack

A "lead" in progressive house is not a single synth sound. It is a composite texture assembled from multiple individual layers, each serving a distinct psychoacoustic role. When done right, you cannot perceive the individual components — only the unified, enormous result.

Here is the typical architecture:

Layer 1 — The Foundation Saw (Sub-Octave)

This layer sits one octave below the main melody. It is usually a clean or lightly saturated sawtooth running through a low-pass filter set to around 2–4 kHz with moderate resonance. Its job is to add weight and body without muddying the mid-range. In Sylenth1 (the go-to synth for this era of progressive house), this means:

OSC: Sawtooth, 1 voice, no detune
Filter: LP24 at ~2.5 kHz, resonance 15–25%
Envelope: Fast attack (2ms), sustain at 100%, medium release (~300ms)
Volume: Sitting ~6–9 dB below the main layer

Layer 2 — The Supersaw Core

This is the centerpiece. The supersaw is a waveform made of multiple detuned sawtooth oscillators summed together. Pioneered by the Roland JP-8000 in the late 1990s, the supersaw achieved its defining status in trance and progressive house because of the natural chorus effect created by oscillator beating.

Key parameters:

Unison voices: 7–8 (odd numbers tend to produce more centered stereo images)
Detune amount: 0.15–0.35 semitones (measured in cents: 15–35 cents total spread)
Filter cutoff: ~5–8 kHz, LP24 mode
Filter resonance: 0–8% (high resonance in the main layer fights the mix)
Envelope: Very fast attack (~1ms), full sustain, medium release

The "width" of the supersaw is controlled by the detune spread. Too narrow and it sounds thin; too wide and it loses pitch definition — particularly destructive when live audiences are hearing the song through a festival PA system.

Layer 3 — The Noise/Air Component

A surprisingly critical layer that most listeners never consciously hear: a narrow-band noise burst or high-passed whitenoise through a bandpass filter centered somewhere between 8–16 kHz. This adds perceived brightness and "air" that makes the lead cut through a loud festival mix.

In Serum (the modern standard), this is typically an N+S oscillator (noise + sawtooth) run through a band-pass filter, with the cutoff modulated by an LFO synced to a quarter note — creating a subtle shimmer.

Layer 4 — The Pluck/Attack Transient

Humans perceive pitch largely from the attack transient of a sound. To make the melody of a progressive house lead feel defined and punchy (especially on compressed festival sound systems), producers add a short pluck layer on top:

A triangle or sine wave oscillator with very fast decay (~80–150ms)
Sometimes routed through a short reverb to smear the transient slightly
Pitched exactly to the melody, often one or two octaves up

This layer fires on every note but has almost no sustain — it exists purely to telegraph the pitch at the moment of each note onset.

Layer 5 — The Pitch-Shifted Harmony

The most ear-catching component of the Alesso/Garrix formula is the presence of a parallel harmony layer — usually a fifth or a minor third above the melody, running at ~10–15 dB lower volume. This is what creates the "massive" quality of the sound without requiring the main layer to be louder.

Parallel fifths are a production shortcut: they are acoustically ambiguous (a fifth does not imply a major or minor third), which means the harmony layer does not fight the chord progression underneath.

The Sidechain — The Pulse That Makes It All Breathe

Every layer above is run through a sidechain compressor keyed to the kick drum. The result is the characteristic pumping/breathing quality of all progressive house — the lead "ducks" on every kick hit, which creates the illusion of rhythmic space even when the mix is fully saturated.

The sidechain parameters in these tracks are typically:

Attack: 1–5ms (fast enough to catch the kick transient)
Release: 100–250ms (controls the "pump" shape and feel)
Ratio: 4:1 to 8:1
Threshold: Set so the lead ducks roughly 6–9 dB on each kick

This combination — layered supersaws, harmonic transient clarity, sidechain breathing — is what listeners feel as much as they hear.

🔬 Why This Is Hard to Replicate with AI

Now let's put on the AI engineering hat.

The Problem Is Not Generation — It's Architecture

Current commercial AI music systems like Suno, Udio, and AudioCraft can generate music that sounds broadly "progressive house." They have clearly ingested enough of the genre to mimic its surface texture. But there is a fundamental difference between sounding like progressive house and architectural accuracy in lead layering.

The challenge is hierarchical:

Global coherence: Does the track have a coherent structure (intro, build, drop, break, drop)?
Drop energy accuracy: Is the drop sonically heavier than the build? By how much?
Layer interdependence: Does the sidechain pumping match the kick pattern? Does the sub-layer octave relationship to the main layer hold across the melody?
Spectral occupancy: Does each layer occupy its intended frequency band without masking another layer?
Temporal micro-dynamics: Is the attack transient of the pluck layer phase-aligned with the main layer attack?

Current auto-regressive audio models generate audio token-by-token (or frame-by-frame in the case of diffusion models). This means they have a limited "look-ahead" for global musical structure. A model generating audio at 44.1 kHz with 50ms audio frames is making ~20 decisions per second — and the relationships between those decisions at the 32-bar scale are difficult to encode in the model's context window.

The Annotation Problem

Another fundamental challenge: the training data is not annotated at the layer level.

A model trained on finished audio files (even high-quality ones) cannot directly observe the production structure underneath. It sees the composite waveform — not the five layers that compose it. Without layer-level supervision, the model cannot learn "this is a supersaw core layer" versus "this is the pluck transient layer."

This is analogous to training a language model only on printed books, with no sentence-level parsing, and asking it to understand dependency grammar. It might learn statistical correlations that capture some of the structure, but it cannot explicitly represent the underlying architecture.

Fine-tuned models with stem separation pre-processing (using tools like Demucs or Spleeter on training data) could partially address this — but that is a significant data engineering effort, and most commercial systems have not gone that route for this specific genre.

🧠 Prompt Engineering as a Compensator

This is the crux of my investigation: can we compensate for the lack of architectural understanding through carefully structured prompts?

My hypothesis was: yes, but only if the prompt mirrors the internal structure of the production decision tree — i.e., if we describe the music the way a producer thinks about it, not the way a listener experiences it.

Approach 1: The Naive Prompt (Baseline)

Generate a progressive house track with a massive lead synth 
drop like Alesso or Martin Garrix.

Result: The models I tested (tested via API against Suno v4 and Udio's latest model) produced tracks that were tonally recognizable as progressive house but lacked:

Clean spectral separation between the layers
Accurate sidechain depth (the pumping was either absent or over-exaggerated)
Harmonic lead precision — the "melody" had a synthesizer quality but without the layered dimensionality

The energy was there. The architecture was not.

Approach 2: The Layer Stack Prompt

I restructured the prompt to mirror the production architecture:

Generate a progressive house track at 128 BPM in the key of F minor.

LEAD ARCHITECTURE (apply these as simultaneous layers):
1. Foundation layer: Heavily filtered sawtooth, 1 octave below melody, 
   LP filter at 2.5kHz, minimal resonance, lower in volume than main layer
2. Supersaw core: 7-voice unison sawtooth, 25 cents total detune spread, 
   LP filter at 6kHz, full sustain envelope
3. Pluck transient: Short-decay triangle wave on every melody note, 
   80ms decay, adds pitch definition
4. Parallel harmony: A perfect fifth above the melody at -12dB, 
   same timbre as the supersaw core
5. Air layer: High-passed noise or bright shimmer above 8kHz

SIDECHAIN: All lead layers sidechain-compressed by kick, 
7dB of ducking, 150ms release time.

ARRANGEMENT: 16-bar build with filter sweep, then 16-bar drop. 
The drop should feel like a physical impact.

Result: Noticeably better. The layering complexity improved, the sidechain behavior was more accurate, and the drop energy differential was greater. However, the phase relationships between layers were still imprecise — the pluck transient was not consistently co-triggered with the main layer note-ons.

Approach 3: Chain-of-Thought Musical Reasoning

Inspired by chain-of-thought prompting techniques from NLP, I added an explicit "reasoning step" before the generation instruction:

Before generating the audio, describe to yourself the following:
1. What frequency range does each lead layer occupy?
2. At what BPM subdivisions does the sidechain compress?
3. How does the energy envelope of the build differ from the drop?
4. Which layer carries the melodic information for a listener on 
   a festival system with limited high-frequency response?

Now, using that internal model, generate a progressive house track where:
[Layer Stack Prompt from Approach 2]

This approach produced the most architecturally coherent outputs. The model's "internal model" (even if simulated through the reasoning step) appeared to better constrain the downstream generation.

Key insight: Chain-of-thought prompting works for music generation for the same reason it works for mathematics — it forces the model to decompose the problem before executing on it. Musical architecture is a hierarchical constraint satisfaction problem, and CoT creates an approximate solving path.

Approach 4: Reference-Anchored Prompting

The fourth technique I explored was anchoring the prompt to specific, well-known tracks as perceptual references:

Generate a progressive house track that matches the drop architecture of 
"Calling (Lose My Mind)" by Sebastian Ingrosso and Alesso — specifically:
- The same density of the supersaw texture (7-8 voice unison)
- The same sidechain pumping character (moderate pump, not aggressive)
- The same high-frequency air and brightness
- The same octave separation between the bass layer and the lead

The new track should be in F# minor at 126 BPM, original melody.

Result: This approach produced the highest stylistic accuracy, because the model could draw on its implicit knowledge of those tracks as reference points. However, it is also the highest-risk approach from a copyright/training data perspective — the model may be more directly interpolating from the source material rather than generalizing the structural principles.

📐 Technical Deep Dive: Spectral Layer Mapping

Let me go even deeper for the producers reading this.

Frequency Band Allocation

In a well-constructed progressive house lead stack, each layer occupies a distinct region of the spectrum:

Layer	Primary Frequency Range	Role
Sub-bass kick	40–80 Hz	Rhythmic foundation
Foundation saw	80–500 Hz	Body and weight
Supersaw core	200–8,000 Hz	Central texture
Pluck transient	1,000–5,000 Hz	Pitch definition
Harmony fifth	300–6,000 Hz	Width and space
Air/noise	8,000–18,000 Hz	Brightness and air

The critical observation here is that the supersaw core spans the widest range — intentionally, because it is the carrier for the melody. But its filter is set to allow the pluck transient to exist in the upper mid-range without being masked.

This is why simply generating a "loud synth" fails: it is not about volume, it is about spectral architecture — each layer having a defined, non-masking frequency zone.

Stereo Field Distribution

The stereo image of a progressive house drop is not random:

Mono: Sub-bass, kick, bass line, lead melody (center-weighted)
Wide stereo: Supersaw unison spread, reverb tails, delay repeats
Mid-side: The core melody stays in the mid (mono-compatible), while the width comes from the unison detune

This mono-compatibility is critical for festival PA systems, where the sub-bass is almost always mono (single subwoofer array on the center axis). A producer who places the low-frequency lead content in the sides will lose it entirely at 80% of festivals.

When I described this mono/stereo allocation explicitly in my prompts, the generated audio showed meaningfully better translation to different listening environments.

Envelope Timing Mathematics

The sidechain "pump" in progressive house is not just an aesthetic choice — it is a tempo-synchronized event. At 128 BPM:

Quarter note = 468.75ms
The sidechain release should complete within ~70-80% of the quarter note: ~330–375ms
This ensures the lead is fully restored before the next kick hit

Setting the release shorter (< 200ms) creates an aggressive "EDM" pump. Setting it longer (> 400ms) causes the lead to still be ducked when the next kick fires — resulting in a muddy, low-energy perception even at high loudness.

This mathematical precision is exactly the kind of relationship that is difficult for auto-regressive audio models to maintain globally across a 4-minute track. A diffusion model generating the entire track holistically might handle this better — but current diffusion models at audio quality still lag behind auto-regressive approaches for complex, structured music.

🔧 My Prompt Engineering Framework for Layer Accuracy

After many iterations, here is the framework I found most effective for guiding commercial AI music generation toward progressive house lead accuracy. I call it the SALT framework (Spectral, Attack, Layer, Timing):

S — Spectral Description

Always specify frequency ranges explicitly for each element. Do not leave them implicit. Models have trained on text that describes music emotionally ("big," "massive," "energetic") rather than spectrally. Compensate by adding the technical language.

The lead should have:
- Deep, filtered weight in the 200-500Hz range
- Core harmonic density from 500Hz to 5kHz  
- Brilliant, airy extension above 8kHz
The low-end below 200Hz should be clean and reserved for kick and bass.

A — Attack Character

Describe the perceptual onset of every main element. "Attack" is not just a synthesizer ADSR parameter — it is how the listener experiences the beginning of each musical event.

Each note of the melody should begin with a sharp, bright transient 
(as if plucked), followed immediately by a sustained supersaw texture. 
The first 50-100ms of each note should be brighter and punchier than 
the sustained portion.

L — Layer Interdependence

Explicitly state which layers should react to which other layers. The sidechain relationship is the most critical.

Every lead layer should duck by approximately 7-8dB every time the 
kick drum hits. The recovery should be smooth and complete within 
350ms. This creates a rhythmic "breathing" quality.

T — Timing Anchors

Give explicit BPM-synchronized timing references for all dynamic events.

BPM: 128. 
The build begins at bar 1 and reaches its filter-sweep apex at bar 14.
The drop begins at bar 17 with full supersaw density.
The break begins at bar 33 and strips back to bass and kick only.

When all four components are present in a single prompt, I consistently observed:

Better spectral separation between elements
More accurate sidechain timing
More convincing drop/build contrast

🎨 Layer Timbral Vocabulary: The LTV Method

The SALT framework handles structure — where things sit in the mix, when they happen, and how they relate to each other. But there is a second dimension that is equally important and almost always missing from AI music prompts: timbre.

Timbre is the characteristic "color" or "texture" of a sound that persists regardless of pitch or volume. It is what makes a supersaw sound like a supersaw and not a piano. And it is almost impossible to describe accurately without a dedicated vocabulary.

I developed the Layer Timbral Vocabulary (LTV) method specifically to address this gap. The core principle is: for each layer in the stack, write a timbral descriptor block using a standardized schema. The schema has five fields:

LAYER NAME:
  texture: [1–3 adjectives describing the physical sensation of the sound]
  brightness: [where on the spectrum the sound "lives" — dark / warm / neutral / bright / brilliant]
  movement: [how the sound changes over time — static / breathing / evolving / aggressive]
  density: [how "thick" or "sparse" the sound is — sparse / moderate / dense / massive]
  edge: [how defined the attack and note boundaries are — soft / medium / sharp / percussive]

Why These Five Dimensions?

They map directly to the psychoacoustic parameters a listener processes in the first ~200ms of hearing a sound:

Texture → spectral complexity (harmonic content, noise component, modulation)
Brightness → spectral centroid (where the dominant energy lives)
Movement → temporal modulation (LFOs, filters, tremolo, vibrato)
Density → unison count, reverb wash, polyphonic weight
Edge → envelope attack slope and transient sharpness

These are not DAW parameters. They are perceptual primitives that a language model can map to synthesis decisions, because they match the vocabulary of the training data it has consumed — synthesizer reviews, production tutorials, sound design guides.

LTV Schema Applied to the "Calling" Drop

Here is the LTV descriptor block for the lead stack in "Calling (Lose My Mind)" as I would write it in a prompt:

LAYER: foundation_saw
  texture: warm, woody, slightly hollow
  brightness: dark
  movement: static (no modulation)
  density: sparse
  edge: medium (notes have slight softness at onset)

LAYER: supersaw_core
  texture: rich, glassy, chest-filling
  brightness: bright
  movement: breathing (sidechain pumping creates slow rhythmic swell)
  density: massive (7–8 detuned voices)
  edge: sharp (instant onset, full sustain)

LAYER: pluck_transient
  texture: crisp, dry, percussive
  brightness: brilliant (mostly upper midrange energy)
  movement: static (decays immediately, no sustain modulation)
  density: sparse
  edge: percussive (attack defines the entire sound)

LAYER: parallel_harmony
  texture: glassy, smooth, slightly ghostly
  brightness: bright
  movement: breathing (same sidechain as supersaw_core)
  density: moderate
  edge: sharp

LAYER: air_layer
  texture: airy, silky, weightless
  brightness: brilliant (exclusively 8kHz and above)
  movement: evolving (subtle shimmer LFO at quarter-note rate)
  density: sparse
  edge: soft (no defined transient, continuous presence)

Combining LTV with SALT: The Full Descriptor Prompt

When you merge the SALT architectural framework with LTV timbral descriptors, you get the most complete prompt structure I have found for this genre. Here is a full worked example targeting a Sebastian Ingrosso & Alesso "Calling (Lose My Mind)"-style drop — that cold, melodic, euphoric progressive house sound that also defines tracks like Axwell's "Sun Is Shining" and Swedish House Mafia's "Leave the World Behind":

TRACK PARAMETERS:
  BPM: 126
  Key: A minor
  Style: Melodic progressive house (Sebastian Ingrosso & Alesso / Axwell style,
         circa 2012–2013) — cold, euphoric, emotionally driven

SALT FRAMEWORK:
  Spectral: Sub below 200Hz is clean and reserved for kick+bass.
            Lead occupies 200Hz–16kHz, distributed across layers as below.
  Attack: Every melody note starts with a defined, bright transient
          (pluck character), immediately followed by warm, sustained supersaw texture.
          The melody is the emotional centrepiece — it must be clear and singable.
  Layer: All lead layers sidechained to kick, 7dB ducking, 350ms release
         (moderate pump — the lead recovers fully before the next kick,
         preserving melodic continuity and emotional arc).
  Timing: Build bars 1–15, filter sweeps from closed to fully open,
          emotional tension building through a rising melodic motif.
          Drop bar 17, full layer density, no filter, melody centred and dominant.
          Break bar 33, strip back to a sparse melodic element (piano or arp)
          and kick only — maximum emotional contrast before the second drop.

LAYER TIMBRAL VOCABULARY:
  foundation_saw:
    texture: warm, woody, slightly hollow
    brightness: dark
    movement: static
    density: sparse
    edge: medium

  supersaw_core:
    texture: rich, glassy, chest-filling
    brightness: bright
    movement: breathing (sidechain)
    density: massive
    edge: sharp

  pluck_transient:
    texture: crisp, dry, percussive
    brightness: brilliant
    movement: static
    density: sparse
    edge: percussive

  harmony_fifth:
    texture: glassy, smooth, slightly ghostly
    brightness: bright
    movement: breathing (sidechain)
    density: moderate
    edge: sharp

  noise_air:
    texture: airy, silky, weightless
    brightness: brilliant
    movement: evolving (quarter-note shimmer)
    density: sparse
    edge: soft

ARRANGEMENT INSTRUCTION:
  The drop should feel euphoric and emotionally lifted — spacious and
  longing rather than just physically heavy. The glassy supersaw texture
  and the high-frequency air layer should create a sense of openness that
  pulls the listener upward. The melody must be the emotional centrepiece:
  clear, singable, and dominant in the mix at all times. The moderate
  sidechain pump (7dB, 350ms) preserves melodic continuity across kick
  hits, keeping the emotional arc flowing throughout the entire drop section.

Timbral Vocabulary Reference Table

To make LTV reusable across any prompt, here is a reference table of tested descriptor values and what synthesis parameter each maps to:

Dimension	Descriptor	Approximate Synthesis Meaning
texture	warm	Strong even harmonics, low spectral centroid, no harsh peaks
texture	glassy	Clean upper harmonics, minimal noise, thin modulation
texture	gritty	Odd harmonics dominant, slight overdrive/saturation
texture	silky	Smooth filter rolloff, no resonance peak
texture	metallic	FM-like inharmonic partials, bell-like ring
texture	woolly	Heavy low-pass filtering, slightly muddy, warm
brightness	dark	Spectral centroid below 1kHz
brightness	warm	Centroid 1–3kHz, no harsh peaks
brightness	neutral	Centroid 3–5kHz, balanced spectrum
brightness	bright	Centroid 5–8kHz, presence peak
brightness	brilliant	Significant energy above 8kHz, airy top-end
movement	static	No LFO or modulation on main tonal parameters
movement	breathing	Rhythmically synced amplitude modulation (sidechain)
movement	evolving	Slow (>1 beat) filter or pitch drift
movement	aggressive	Fast, sharp modulation (tremolo, hard filter LFO)
density	sparse	1–2 oscillator voices, minimal reverb
density	moderate	3–4 unison voices or medium reverb wash
density	dense	5–6 unison voices or long reverb tail
density	massive	7–8+ unison voices, wide stereo field
edge	soft	Slow attack (>20ms), notes blend together
edge	medium	Attack 5–20ms, notes distinct but smooth
edge	sharp	Attack <5ms, very defined note boundaries
edge	percussive	Attack <2ms, sound is mostly transient

Why This Works

The LTV method works because it translates producer intent into a language that sits at the intersection of music journalism, synthesis documentation, and perceptual audio description — all of which are heavily represented in a large language model's training data.

When a model reads "texture: rich, glassy, chest-filling / brightness: bright / density: massive", it can draw on thousands of Serum preset descriptions, synthesizer magazines, and sound design tutorials that use exactly this vocabulary to describe supersaw patches. The model has the knowledge. LTV is the key that unlocks it.

Compare this to a model reading "make a big synth sound" — it has no anchor in the space of synthesis decisions. LTV gives it coordinates.

🌡️ Honest Assessment: Where AI Still Falls Short

Even with the SALT framework, there are areas where commercial models consistently fail:

1. Phase Coherence Across Layers

When you layer sounds in a DAW, you can phase-align them precisely. A model generating audio holistically cannot guarantee that Layer 1 and Layer 5 are phase-coherent at the note onsets. This creates subtle comb-filtering artifacts in the composite sound — particularly audible on headphones, where stereo imaging is most precise.

2. Sidechain Consistency Over Time

The sidechain pump should be completely consistent from bar 1 to bar 128. Human-produced tracks achieve this through routing: one compressor, keyed to the kick, applied globally to the lead bus. AI-generated tracks show variable pump depth across the timeline, particularly when the model "re-imagines" the arrangement partway through.

3. Global Harmonic Locking

In "Heroes (we could be)" and "Calling," the harmony of the lead stack is locked to the chord progression underneath for the entire drop section. Every chord change triggers a global re-voicing of every layer simultaneously. This is a compositional constraint that spans tens of seconds — far beyond the temporal resolution at which current audio models make moment-to-moment decisions.

4. Sub-Bass / Lead Relationship

The sub-bass and lead melody must play the same note (in different octaves) at all times. This is a hard constraint. Deviations cause the low-end to "fight" the lead, creating a muddy, undefined bottom. Models trained on multi-instrument audio sometimes allow these to diverge, because statistically it is more common for bass and melody to be harmonically related but not identical.

🧪 Comparative Results

Here is a structured comparison of outputs across the approaches:

Technique	Sidechain Accuracy	Spectral Separation	Drop Impact	Layer Complexity	Timbral Accuracy
Naive prompt	Low	Poor	Medium	Low	Poor
Layer Stack Prompt	Medium	Good	High	Medium	Medium
Chain-of-Thought	High	Good	High	High	Medium
Reference-Anchored	High	Very Good	Very High	High	Good
SALT Framework	High	Very Good	Very High	High	Good
SALT + LTV	High	Very Good	Very High	Very High	Very Good

The key takeaway: the more the prompt mirrors the internal production architecture, the more accurately the model generates that architecture. This is not surprising in hindsight — it aligns with the general principle that models perform best when the prompt's format matches the format of relevant training data. A music production tutorial or DAW template would describe a track in exactly the language of the SALT framework.

💡 Bigger Implications: What This Tells Us About AI Music

This investigation convinced me of something broader: the bottleneck in AI music quality is not model capacity, it is prompt vocabulary.

Most users of AI music tools describe music the way listeners describe it — "make it sound like [artist]" or "something energetic for the drop." This listener-vocabulary prompt produces listener-quality output: it sounds approximately right, but lacks the architectural precision of a produced track.

When you switch to producer-vocabulary prompting — describing spectral occupancy, envelope timing, layer interdependence — the output quality improves dramatically. The underlying model has the capacity. It just needs to be activated through the right framing.

This is not fundamentally different from how the same phenomenon plays out in code generation:

Listener equivalent: "make a login function"
Producer equivalent: "implement JWT authentication with RS256 signing, 15-minute access token expiry, sliding refresh tokens persisted in Redis with namespace isolation, and PKCE flow for public clients"

In both cases, the precision of the vocabulary in the prompt correlates directly with the precision of the output.

The Fine-Tuning Question

Could a specialized fine-tuned model do this better than prompt engineering alone? Almost certainly yes — and here I am specifically talking about fine-tuning open-source music generation models, not proprietary text-based LLMs.

The most relevant ecosystem right now is Meta's AudioCraft, which includes:

MusicGen — a transformer-based conditional music generation model that can be fine-tuned on your own audio dataset
AudioGen — focused on environmental and sound-effect generation
EnCodec — the neural audio codec underpinning both, which compresses audio into discrete tokens that the transformer operates on

Fine-tuning MusicGen on a curated progressive house dataset — stems, loops, or full tracks annotated with genre-specific metadata — would give you a model that natively understands the spectral vocabulary of the genre. Instead of prompting a general-purpose model to approximate layering logic, you would be sampling from a distribution that was explicitly trained on it.

The tradeoff is real though: you need a labeled audio dataset at the production-layer level (not just raw tracks), GPU infrastructure to run the fine-tuning job, and ongoing retraining as production styles evolve. AudioCraft's training scripts are open and well-documented, but the data curation is the hard part. Getting clean, properly tagged stems — kick, bass, lead layers separated and annotated — is a significant effort that most solo producers cannot sustain.

That said, for anyone building a music-generation product or a studio-grade AI tool, fine-tuning an open-source model like MusicGen is the right path. The SALT framework for prompt engineering remains the practical starting point for creative exploration, closing a meaningful quality gap without any model infrastructure — but a domain-adapted AudioCraft model is where the production ceiling sits.

🎯 Practical Takeaways

If you're a music producer exploring AI tools, or an AI engineer building music generation systems:

Describe architecture, not feeling. Replace "massive drop" with specific layer specifications.
Specify frequency bands explicitly. Models do not default to spectral precision; you have to ask for it.
Use BPM-synchronized timing. State exactly when events happen in bar numbers or milliseconds.
Declare sidechain relationships explicitly. It is the single most impactful technique for making AI progressive house sound authentic.
Use chain-of-thought for complex arrangement decisions. Ask the model to reason about structure before generating.
Reference real tracks for style anchoring. Providing exact track references activates the model's implicit genre knowledge.
Apply the LTV method for each layer. Describe texture, brightness, movement, density, and edge for every element in the stack. This closes the timbral gap that architecture prompts alone cannot address.

🎶 Closing Thoughts

Progressive house lead layering is, in some ways, the perfect test case for AI music generation. It is technically precise enough to reveal the gaps in model architecture, yet popular enough to have significant training data representation. It requires global structural coherence (the arrangement) and local signal precision (the layer phase relationships) simultaneously — a combination that exposes the weaknesses of current auto-regressive audio systems.

What surprised me most in this investigation was not where the AI failed — it was how close the SALT framework got to bridging the gap. A year ago, I would not have believed that a text prompt could meaningfully guide the spectral architecture of a synthesizer layer stack. Now I think the ceiling of what prompt engineering can achieve in this domain is still being discovered.

The fact that I can sit down at a DAW to produce a progressive house track and simultaneously design prompt frameworks that help AI understand the same production — that is the kind of cross-domain insight that I find genuinely exciting. Two passions, one engineering problem, a lot still to figure out.

The drop hits differently when you understand every layer that builds it. 🔊

Eduardo J. Barrios García is a Software & AI Engineer and electronic music producer (EyeMad). This article reflects personal experimentation and informal findings, not peer-reviewed research.

🎵 First: What Are We Actually Talking About?​

🎛️ The Anatomy of a Progressive House Lead Layer Stack​

Layer 1 — The Foundation Saw (Sub-Octave)​

Layer 2 — The Supersaw Core​

Layer 3 — The Noise/Air Component​

Layer 4 — The Pluck/Attack Transient​

Layer 5 — The Pitch-Shifted Harmony​

The Sidechain — The Pulse That Makes It All Breathe​

🔬 Why This Is Hard to Replicate with AI​

The Problem Is Not Generation — It's Architecture​

The Annotation Problem​

🧠 Prompt Engineering as a Compensator​

Approach 1: The Naive Prompt (Baseline)​

Approach 2: The Layer Stack Prompt​

Approach 3: Chain-of-Thought Musical Reasoning​

Approach 4: Reference-Anchored Prompting​

📐 Technical Deep Dive: Spectral Layer Mapping​

Frequency Band Allocation​

Stereo Field Distribution​

Envelope Timing Mathematics​

🔧 My Prompt Engineering Framework for Layer Accuracy​

S — Spectral Description​

A — Attack Character​

L — Layer Interdependence​

T — Timing Anchors​

🎨 Layer Timbral Vocabulary: The LTV Method​

Why These Five Dimensions?​

LTV Schema Applied to the "Calling" Drop​

Combining LTV with SALT: The Full Descriptor Prompt​

Timbral Vocabulary Reference Table​

Why This Works​

🌡️ Honest Assessment: Where AI Still Falls Short​

1. Phase Coherence Across Layers​

2. Sidechain Consistency Over Time​

3. Global Harmonic Locking​

4. Sub-Bass / Lead Relationship​

🧪 Comparative Results​

💡 Bigger Implications: What This Tells Us About AI Music​

The Fine-Tuning Question​

🎯 Practical Takeaways​

🎶 Closing Thoughts​