Skip to main content

I Tested AI Music Models on Progressive House Lead Layering — Here's What Actually Worked

· 27 min read
Eduardo J. Barrios
AI & Software Engineer · Music Producer (EyeMad)

There is a very specific kind of tension that builds in a progressive house track — the kind that grabs your chest somewhere around bar 28 of the break, when the filter opens halfway and the reverb tail of the lead swells. Producers like Alesso, Martin Garrix, and Axwell /\ Ingrosso have turned that tension into a craft. I have spent years studying it, replicating it, and eventually producing it myself under my alias EyeMad.

But recently, as an AI engineer, I started asking a different question: could an AI music generation model — without fine-tuning, using only prompt engineering — accurately describe, architect, and guide the creation of those layered progressive house lead sounds that define the genre's signature drops?

This article is my attempt to answer that question. It is long, technical, and deeply personal. It lives at the intersection of two things I love most.


🎵 First: What Are We Actually Talking About?

Before we go deep into AI territory, let me establish what "progressive house lead layering" actually means from a production standpoint. Let's anchor the conversation in real music.

Progressive House Playlist — My Vision for the Genre


🎛️ The Anatomy of a Progressive House Lead Layer Stack

A "lead" in progressive house is not a single synth sound. It is a composite texture assembled from multiple individual layers, each serving a distinct psychoacoustic role. When done right, you cannot perceive the individual components — only the unified, enormous result.

Here is the typical architecture:

Layer 1 — The Foundation Saw (Sub-Octave)

This layer sits one octave below the main melody. It is usually a clean or lightly saturated sawtooth running through a low-pass filter set to around 2–4 kHz with moderate resonance. Its job is to add weight and body without muddying the mid-range. In Sylenth1 (the go-to synth for this era of progressive house), this means:

  • OSC: Sawtooth, 1 voice, no detune
  • Filter: LP24 at ~2.5 kHz, resonance 15–25%
  • Envelope: Fast attack (2ms), sustain at 100%, medium release (~300ms)
  • Volume: Sitting ~6–9 dB below the main layer

Layer 2 — The Supersaw Core

This is the centerpiece. The supersaw is a waveform made of multiple detuned sawtooth oscillators summed together. Pioneered by the Roland JP-8000 in the late 1990s, the supersaw achieved its defining status in trance and progressive house because of the natural chorus effect created by oscillator beating.

Key parameters:

  • Unison voices: 7–8 (odd numbers tend to produce more centered stereo images)
  • Detune amount: 0.15–0.35 semitones (measured in cents: 15–35 cents total spread)
  • Filter cutoff: ~5–8 kHz, LP24 mode
  • Filter resonance: 0–8% (high resonance in the main layer fights the mix)
  • Envelope: Very fast attack (~1ms), full sustain, medium release

The "width" of the supersaw is controlled by the detune spread. Too narrow and it sounds thin; too wide and it loses pitch definition — particularly destructive when live audiences are hearing the song through a festival PA system.

Layer 3 — The Noise/Air Component

A surprisingly critical layer that most listeners never consciously hear: a narrow-band noise burst or high-passed whitenoise through a bandpass filter centered somewhere between 8–16 kHz. This adds perceived brightness and "air" that makes the lead cut through a loud festival mix.

In Serum (the modern standard), this is typically an N+S oscillator (noise + sawtooth) run through a band-pass filter, with the cutoff modulated by an LFO synced to a quarter note — creating a subtle shimmer.

Layer 4 — The Pluck/Attack Transient

Humans perceive pitch largely from the attack transient of a sound. To make the melody of a progressive house lead feel defined and punchy (especially on compressed festival sound systems), producers add a short pluck layer on top:

  • A triangle or sine wave oscillator with very fast decay (~80–150ms)
  • Sometimes routed through a short reverb to smear the transient slightly
  • Pitched exactly to the melody, often one or two octaves up

This layer fires on every note but has almost no sustain — it exists purely to telegraph the pitch at the moment of each note onset.

Layer 5 — The Pitch-Shifted Harmony

The most ear-catching component of the Alesso/Garrix formula is the presence of a parallel harmony layer — usually a fifth or a minor third above the melody, running at ~10–15 dB lower volume. This is what creates the "massive" quality of the sound without requiring the main layer to be louder.

Parallel fifths are a production shortcut: they are acoustically ambiguous (a fifth does not imply a major or minor third), which means the harmony layer does not fight the chord progression underneath.

The Sidechain — The Pulse That Makes It All Breathe

Every layer above is run through a sidechain compressor keyed to the kick drum. The result is the characteristic pumping/breathing quality of all progressive house — the lead "ducks" on every kick hit, which creates the illusion of rhythmic space even when the mix is fully saturated.

The sidechain parameters in these tracks are typically:

  • Attack: 1–5ms (fast enough to catch the kick transient)
  • Release: 100–250ms (controls the "pump" shape and feel)
  • Ratio: 4:1 to 8:1
  • Threshold: Set so the lead ducks roughly 6–9 dB on each kick

This combination — layered supersaws, harmonic transient clarity, sidechain breathing — is what listeners feel as much as they hear.


🔬 Why This Is Hard to Replicate with AI

Now let's put on the AI engineering hat.

The Problem Is Not Generation — It's Architecture

Current commercial AI music systems like Suno, Udio, and AudioCraft can generate music that sounds broadly "progressive house." They have clearly ingested enough of the genre to mimic its surface texture. But there is a fundamental difference between sounding like progressive house and architectural accuracy in lead layering.

The challenge is hierarchical:

  1. Global coherence: Does the track have a coherent structure (intro, build, drop, break, drop)?
  2. Drop energy accuracy: Is the drop sonically heavier than the build? By how much?
  3. Layer interdependence: Does the sidechain pumping match the kick pattern? Does the sub-layer octave relationship to the main layer hold across the melody?
  4. Spectral occupancy: Does each layer occupy its intended frequency band without masking another layer?
  5. Temporal micro-dynamics: Is the attack transient of the pluck layer phase-aligned with the main layer attack?

Current auto-regressive audio models generate audio token-by-token (or frame-by-frame in the case of diffusion models). This means they have a limited "look-ahead" for global musical structure. A model generating audio at 44.1 kHz with 50ms audio frames is making ~20 decisions per second — and the relationships between those decisions at the 32-bar scale are difficult to encode in the model's context window.

The Annotation Problem

Another fundamental challenge: the training data is not annotated at the layer level.

A model trained on finished audio files (even high-quality ones) cannot directly observe the production structure underneath. It sees the composite waveform — not the five layers that compose it. Without layer-level supervision, the model cannot learn "this is a supersaw core layer" versus "this is the pluck transient layer."

This is analogous to training a language model only on printed books, with no sentence-level parsing, and asking it to understand dependency grammar. It might learn statistical correlations that capture some of the structure, but it cannot explicitly represent the underlying architecture.

Fine-tuned models with stem separation pre-processing (using tools like Demucs or Spleeter on training data) could partially address this — but that is a significant data engineering effort, and most commercial systems have not gone that route for this specific genre.


🧠 Prompt Engineering as a Compensator

This is the crux of my investigation: can we compensate for the lack of architectural understanding through carefully structured prompts?

My hypothesis was: yes, but only if the prompt mirrors the internal structure of the production decision tree — i.e., if we describe the music the way a producer thinks about it, not the way a listener experiences it.

Approach 1: The Naive Prompt (Baseline)

Generate a progressive house track with a massive lead synth 
drop like Alesso or Martin Garrix.

Result: The models I tested (tested via API against Suno v4 and Udio's latest model) produced tracks that were tonally recognizable as progressive house but lacked:

  • Clean spectral separation between the layers
  • Accurate sidechain depth (the pumping was either absent or over-exaggerated)
  • Harmonic lead precision — the "melody" had a synthesizer quality but without the layered dimensionality

The energy was there. The architecture was not.

Approach 2: The Layer Stack Prompt

I restructured the prompt to mirror the production architecture:

Generate a progressive house track at 128 BPM in the key of F minor.

LEAD ARCHITECTURE (apply these as simultaneous layers):
1. Foundation layer: Heavily filtered sawtooth, 1 octave below melody,
LP filter at 2.5kHz, minimal resonance, lower in volume than main layer
2. Supersaw core: 7-voice unison sawtooth, 25 cents total detune spread,
LP filter at 6kHz, full sustain envelope
3. Pluck transient: Short-decay triangle wave on every melody note,
80ms decay, adds pitch definition
4. Parallel harmony: A perfect fifth above the melody at -12dB,
same timbre as the supersaw core
5. Air layer: High-passed noise or bright shimmer above 8kHz

SIDECHAIN: All lead layers sidechain-compressed by kick,
7dB of ducking, 150ms release time.

ARRANGEMENT: 16-bar build with filter sweep, then 16-bar drop.
The drop should feel like a physical impact.

Result: Noticeably better. The layering complexity improved, the sidechain behavior was more accurate, and the drop energy differential was greater. However, the phase relationships between layers were still imprecise — the pluck transient was not consistently co-triggered with the main layer note-ons.

Approach 3: Chain-of-Thought Musical Reasoning

Inspired by chain-of-thought prompting techniques from NLP, I added an explicit "reasoning step" before the generation instruction:

Before generating the audio, describe to yourself the following:
1. What frequency range does each lead layer occupy?
2. At what BPM subdivisions does the sidechain compress?
3. How does the energy envelope of the build differ from the drop?
4. Which layer carries the melodic information for a listener on
a festival system with limited high-frequency response?

Now, using that internal model, generate a progressive house track where:
[Layer Stack Prompt from Approach 2]

This approach produced the most architecturally coherent outputs. The model's "internal model" (even if simulated through the reasoning step) appeared to better constrain the downstream generation.

Key insight: Chain-of-thought prompting works for music generation for the same reason it works for mathematics — it forces the model to decompose the problem before executing on it. Musical architecture is a hierarchical constraint satisfaction problem, and CoT creates an approximate solving path.

Approach 4: Reference-Anchored Prompting

The fourth technique I explored was anchoring the prompt to specific, well-known tracks as perceptual references:

Generate a progressive house track that matches the drop architecture of 
"Calling (Lose My Mind)" by Sebastian Ingrosso and Alesso — specifically:
- The same density of the supersaw texture (7-8 voice unison)
- The same sidechain pumping character (moderate pump, not aggressive)
- The same high-frequency air and brightness
- The same octave separation between the bass layer and the lead

The new track should be in F# minor at 126 BPM, original melody.

Result: This approach produced the highest stylistic accuracy, because the model could draw on its implicit knowledge of those tracks as reference points. However, it is also the highest-risk approach from a copyright/training data perspective — the model may be more directly interpolating from the source material rather than generalizing the structural principles.


📐 Technical Deep Dive: Spectral Layer Mapping

Let me go even deeper for the producers reading this.

Frequency Band Allocation

In a well-constructed progressive house lead stack, each layer occupies a distinct region of the spectrum:

LayerPrimary Frequency RangeRole
Sub-bass kick40–80 HzRhythmic foundation
Foundation saw80–500 HzBody and weight
Supersaw core200–8,000 HzCentral texture
Pluck transient1,000–5,000 HzPitch definition
Harmony fifth300–6,000 HzWidth and space
Air/noise8,000–18,000 HzBrightness and air

The critical observation here is that the supersaw core spans the widest range — intentionally, because it is the carrier for the melody. But its filter is set to allow the pluck transient to exist in the upper mid-range without being masked.

This is why simply generating a "loud synth" fails: it is not about volume, it is about spectral architecture — each layer having a defined, non-masking frequency zone.

Stereo Field Distribution

The stereo image of a progressive house drop is not random:

  • Mono: Sub-bass, kick, bass line, lead melody (center-weighted)
  • Wide stereo: Supersaw unison spread, reverb tails, delay repeats
  • Mid-side: The core melody stays in the mid (mono-compatible), while the width comes from the unison detune

This mono-compatibility is critical for festival PA systems, where the sub-bass is almost always mono (single subwoofer array on the center axis). A producer who places the low-frequency lead content in the sides will lose it entirely at 80% of festivals.

When I described this mono/stereo allocation explicitly in my prompts, the generated audio showed meaningfully better translation to different listening environments.

Envelope Timing Mathematics

The sidechain "pump" in progressive house is not just an aesthetic choice — it is a tempo-synchronized event. At 128 BPM:

  • Quarter note = 468.75ms
  • The sidechain release should complete within ~70-80% of the quarter note: ~330–375ms
  • This ensures the lead is fully restored before the next kick hit

Setting the release shorter (< 200ms) creates an aggressive "EDM" pump. Setting it longer (> 400ms) causes the lead to still be ducked when the next kick fires — resulting in a muddy, low-energy perception even at high loudness.

This mathematical precision is exactly the kind of relationship that is difficult for auto-regressive audio models to maintain globally across a 4-minute track. A diffusion model generating the entire track holistically might handle this better — but current diffusion models at audio quality still lag behind auto-regressive approaches for complex, structured music.


🔧 My Prompt Engineering Framework for Layer Accuracy

After many iterations, here is the framework I found most effective for guiding commercial AI music generation toward progressive house lead accuracy. I call it the SALT framework (Spectral, Attack, Layer, Timing):

S — Spectral Description

Always specify frequency ranges explicitly for each element. Do not leave them implicit. Models have trained on text that describes music emotionally ("big," "massive," "energetic") rather than spectrally. Compensate by adding the technical language.

The lead should have:
- Deep, filtered weight in the 200-500Hz range
- Core harmonic density from 500Hz to 5kHz
- Brilliant, airy extension above 8kHz
The low-end below 200Hz should be clean and reserved for kick and bass.

A — Attack Character

Describe the perceptual onset of every main element. "Attack" is not just a synthesizer ADSR parameter — it is how the listener experiences the beginning of each musical event.

Each note of the melody should begin with a sharp, bright transient 
(as if plucked), followed immediately by a sustained supersaw texture.
The first 50-100ms of each note should be brighter and punchier than
the sustained portion.

L — Layer Interdependence

Explicitly state which layers should react to which other layers. The sidechain relationship is the most critical.

Every lead layer should duck by approximately 7-8dB every time the 
kick drum hits. The recovery should be smooth and complete within
350ms. This creates a rhythmic "breathing" quality.

T — Timing Anchors

Give explicit BPM-synchronized timing references for all dynamic events.

BPM: 128. 
The build begins at bar 1 and reaches its filter-sweep apex at bar 14.
The drop begins at bar 17 with full supersaw density.
The break begins at bar 33 and strips back to bass and kick only.

When all four components are present in a single prompt, I consistently observed:

  • Better spectral separation between elements
  • More accurate sidechain timing
  • More convincing drop/build contrast

🎨 Layer Timbral Vocabulary: The LTV Method

The SALT framework handles structure — where things sit in the mix, when they happen, and how they relate to each other. But there is a second dimension that is equally important and almost always missing from AI music prompts: timbre.

Timbre is the characteristic "color" or "texture" of a sound that persists regardless of pitch or volume. It is what makes a supersaw sound like a supersaw and not a piano. And it is almost impossible to describe accurately without a dedicated vocabulary.

I developed the Layer Timbral Vocabulary (LTV) method specifically to address this gap. The core principle is: for each layer in the stack, write a timbral descriptor block using a standardized schema. The schema has five fields:

LAYER NAME:
texture: [1–3 adjectives describing the physical sensation of the sound]
brightness: [where on the spectrum the sound "lives" — dark / warm / neutral / bright / brilliant]
movement: [how the sound changes over time — static / breathing / evolving / aggressive]
density: [how "thick" or "sparse" the sound is — sparse / moderate / dense / massive]
edge: [how defined the attack and note boundaries are — soft / medium / sharp / percussive]

Why These Five Dimensions?

They map directly to the psychoacoustic parameters a listener processes in the first ~200ms of hearing a sound:

  • Texture → spectral complexity (harmonic content, noise component, modulation)
  • Brightness → spectral centroid (where the dominant energy lives)
  • Movement → temporal modulation (LFOs, filters, tremolo, vibrato)
  • Density → unison count, reverb wash, polyphonic weight
  • Edge → envelope attack slope and transient sharpness

These are not DAW parameters. They are perceptual primitives that a language model can map to synthesis decisions, because they match the vocabulary of the training data it has consumed — synthesizer reviews, production tutorials, sound design guides.

LTV Schema Applied to the "Calling" Drop

Here is the LTV descriptor block for the lead stack in "Calling (Lose My Mind)" as I would write it in a prompt:

LAYER: foundation_saw
texture: warm, woody, slightly hollow
brightness: dark
movement: static (no modulation)
density: sparse
edge: medium (notes have slight softness at onset)

LAYER: supersaw_core
texture: rich, glassy, chest-filling
brightness: bright
movement: breathing (sidechain pumping creates slow rhythmic swell)
density: massive (7–8 detuned voices)
edge: sharp (instant onset, full sustain)

LAYER: pluck_transient
texture: crisp, dry, percussive
brightness: brilliant (mostly upper midrange energy)
movement: static (decays immediately, no sustain modulation)
density: sparse
edge: percussive (attack defines the entire sound)

LAYER: parallel_harmony
texture: glassy, smooth, slightly ghostly
brightness: bright
movement: breathing (same sidechain as supersaw_core)
density: moderate
edge: sharp

LAYER: air_layer
texture: airy, silky, weightless
brightness: brilliant (exclusively 8kHz and above)
movement: evolving (subtle shimmer LFO at quarter-note rate)
density: sparse
edge: soft (no defined transient, continuous presence)

Combining LTV with SALT: The Full Descriptor Prompt

When you merge the SALT architectural framework with LTV timbral descriptors, you get the most complete prompt structure I have found for this genre. Here is a full worked example targeting a Sebastian Ingrosso & Alesso "Calling (Lose My Mind)"-style drop — that cold, melodic, euphoric progressive house sound that also defines tracks like Axwell's "Sun Is Shining" and Swedish House Mafia's "Leave the World Behind":

TRACK PARAMETERS:
BPM: 126
Key: A minor
Style: Melodic progressive house (Sebastian Ingrosso & Alesso / Axwell style,
circa 2012–2013) — cold, euphoric, emotionally driven

SALT FRAMEWORK:
Spectral: Sub below 200Hz is clean and reserved for kick+bass.
Lead occupies 200Hz–16kHz, distributed across layers as below.
Attack: Every melody note starts with a defined, bright transient
(pluck character), immediately followed by warm, sustained supersaw texture.
The melody is the emotional centrepiece — it must be clear and singable.
Layer: All lead layers sidechained to kick, 7dB ducking, 350ms release
(moderate pump — the lead recovers fully before the next kick,
preserving melodic continuity and emotional arc).
Timing: Build bars 1–15, filter sweeps from closed to fully open,
emotional tension building through a rising melodic motif.
Drop bar 17, full layer density, no filter, melody centred and dominant.
Break bar 33, strip back to a sparse melodic element (piano or arp)
and kick only — maximum emotional contrast before the second drop.

LAYER TIMBRAL VOCABULARY:
foundation_saw:
texture: warm, woody, slightly hollow
brightness: dark
movement: static
density: sparse
edge: medium

supersaw_core:
texture: rich, glassy, chest-filling
brightness: bright
movement: breathing (sidechain)
density: massive
edge: sharp

pluck_transient:
texture: crisp, dry, percussive
brightness: brilliant
movement: static
density: sparse
edge: percussive

harmony_fifth:
texture: glassy, smooth, slightly ghostly
brightness: bright
movement: breathing (sidechain)
density: moderate
edge: sharp

noise_air:
texture: airy, silky, weightless
brightness: brilliant
movement: evolving (quarter-note shimmer)
density: sparse
edge: soft

ARRANGEMENT INSTRUCTION:
The drop should feel euphoric and emotionally lifted — spacious and
longing rather than just physically heavy. The glassy supersaw texture
and the high-frequency air layer should create a sense of openness that
pulls the listener upward. The melody must be the emotional centrepiece:
clear, singable, and dominant in the mix at all times. The moderate
sidechain pump (7dB, 350ms) preserves melodic continuity across kick
hits, keeping the emotional arc flowing throughout the entire drop section.

Timbral Vocabulary Reference Table

To make LTV reusable across any prompt, here is a reference table of tested descriptor values and what synthesis parameter each maps to:

DimensionDescriptorApproximate Synthesis Meaning
texturewarmStrong even harmonics, low spectral centroid, no harsh peaks
textureglassyClean upper harmonics, minimal noise, thin modulation
texturegrittyOdd harmonics dominant, slight overdrive/saturation
texturesilkySmooth filter rolloff, no resonance peak
texturemetallicFM-like inharmonic partials, bell-like ring
texturewoollyHeavy low-pass filtering, slightly muddy, warm
brightnessdarkSpectral centroid below 1kHz
brightnesswarmCentroid 1–3kHz, no harsh peaks
brightnessneutralCentroid 3–5kHz, balanced spectrum
brightnessbrightCentroid 5–8kHz, presence peak
brightnessbrilliantSignificant energy above 8kHz, airy top-end
movementstaticNo LFO or modulation on main tonal parameters
movementbreathingRhythmically synced amplitude modulation (sidechain)
movementevolvingSlow (>1 beat) filter or pitch drift
movementaggressiveFast, sharp modulation (tremolo, hard filter LFO)
densitysparse1–2 oscillator voices, minimal reverb
densitymoderate3–4 unison voices or medium reverb wash
densitydense5–6 unison voices or long reverb tail
densitymassive7–8+ unison voices, wide stereo field
edgesoftSlow attack (>20ms), notes blend together
edgemediumAttack 5–20ms, notes distinct but smooth
edgesharpAttack <5ms, very defined note boundaries
edgepercussiveAttack <2ms, sound is mostly transient

Why This Works

The LTV method works because it translates producer intent into a language that sits at the intersection of music journalism, synthesis documentation, and perceptual audio description — all of which are heavily represented in a large language model's training data.

When a model reads "texture: rich, glassy, chest-filling / brightness: bright / density: massive", it can draw on thousands of Serum preset descriptions, synthesizer magazines, and sound design tutorials that use exactly this vocabulary to describe supersaw patches. The model has the knowledge. LTV is the key that unlocks it.

Compare this to a model reading "make a big synth sound" — it has no anchor in the space of synthesis decisions. LTV gives it coordinates.


🌡️ Honest Assessment: Where AI Still Falls Short

Even with the SALT framework, there are areas where commercial models consistently fail:

1. Phase Coherence Across Layers

When you layer sounds in a DAW, you can phase-align them precisely. A model generating audio holistically cannot guarantee that Layer 1 and Layer 5 are phase-coherent at the note onsets. This creates subtle comb-filtering artifacts in the composite sound — particularly audible on headphones, where stereo imaging is most precise.

2. Sidechain Consistency Over Time

The sidechain pump should be completely consistent from bar 1 to bar 128. Human-produced tracks achieve this through routing: one compressor, keyed to the kick, applied globally to the lead bus. AI-generated tracks show variable pump depth across the timeline, particularly when the model "re-imagines" the arrangement partway through.

3. Global Harmonic Locking

In "Heroes (we could be)" and "Calling," the harmony of the lead stack is locked to the chord progression underneath for the entire drop section. Every chord change triggers a global re-voicing of every layer simultaneously. This is a compositional constraint that spans tens of seconds — far beyond the temporal resolution at which current audio models make moment-to-moment decisions.

4. Sub-Bass / Lead Relationship

The sub-bass and lead melody must play the same note (in different octaves) at all times. This is a hard constraint. Deviations cause the low-end to "fight" the lead, creating a muddy, undefined bottom. Models trained on multi-instrument audio sometimes allow these to diverge, because statistically it is more common for bass and melody to be harmonically related but not identical.


🧪 Comparative Results

Here is a structured comparison of outputs across the approaches:

TechniqueSidechain AccuracySpectral SeparationDrop ImpactLayer ComplexityTimbral Accuracy
Naive promptLowPoorMediumLowPoor
Layer Stack PromptMediumGoodHighMediumMedium
Chain-of-ThoughtHighGoodHighHighMedium
Reference-AnchoredHighVery GoodVery HighHighGood
SALT FrameworkHighVery GoodVery HighHighGood
SALT + LTVHighVery GoodVery HighVery HighVery Good

The key takeaway: the more the prompt mirrors the internal production architecture, the more accurately the model generates that architecture. This is not surprising in hindsight — it aligns with the general principle that models perform best when the prompt's format matches the format of relevant training data. A music production tutorial or DAW template would describe a track in exactly the language of the SALT framework.


💡 Bigger Implications: What This Tells Us About AI Music

This investigation convinced me of something broader: the bottleneck in AI music quality is not model capacity, it is prompt vocabulary.

Most users of AI music tools describe music the way listeners describe it — "make it sound like [artist]" or "something energetic for the drop." This listener-vocabulary prompt produces listener-quality output: it sounds approximately right, but lacks the architectural precision of a produced track.

When you switch to producer-vocabulary prompting — describing spectral occupancy, envelope timing, layer interdependence — the output quality improves dramatically. The underlying model has the capacity. It just needs to be activated through the right framing.

This is not fundamentally different from how the same phenomenon plays out in code generation:

  • Listener equivalent: "make a login function"
  • Producer equivalent: "implement JWT authentication with RS256 signing, 15-minute access token expiry, sliding refresh tokens persisted in Redis with namespace isolation, and PKCE flow for public clients"

In both cases, the precision of the vocabulary in the prompt correlates directly with the precision of the output.

The Fine-Tuning Question

Could a specialized fine-tuned model do this better than prompt engineering alone? Almost certainly yes — and here I am specifically talking about fine-tuning open-source music generation models, not proprietary text-based LLMs.

The most relevant ecosystem right now is Meta's AudioCraft, which includes:

  • MusicGen — a transformer-based conditional music generation model that can be fine-tuned on your own audio dataset
  • AudioGen — focused on environmental and sound-effect generation
  • EnCodec — the neural audio codec underpinning both, which compresses audio into discrete tokens that the transformer operates on

Fine-tuning MusicGen on a curated progressive house dataset — stems, loops, or full tracks annotated with genre-specific metadata — would give you a model that natively understands the spectral vocabulary of the genre. Instead of prompting a general-purpose model to approximate layering logic, you would be sampling from a distribution that was explicitly trained on it.

The tradeoff is real though: you need a labeled audio dataset at the production-layer level (not just raw tracks), GPU infrastructure to run the fine-tuning job, and ongoing retraining as production styles evolve. AudioCraft's training scripts are open and well-documented, but the data curation is the hard part. Getting clean, properly tagged stems — kick, bass, lead layers separated and annotated — is a significant effort that most solo producers cannot sustain.

That said, for anyone building a music-generation product or a studio-grade AI tool, fine-tuning an open-source model like MusicGen is the right path. The SALT framework for prompt engineering remains the practical starting point for creative exploration, closing a meaningful quality gap without any model infrastructure — but a domain-adapted AudioCraft model is where the production ceiling sits.


🎯 Practical Takeaways

If you're a music producer exploring AI tools, or an AI engineer building music generation systems:

  1. Describe architecture, not feeling. Replace "massive drop" with specific layer specifications.
  2. Specify frequency bands explicitly. Models do not default to spectral precision; you have to ask for it.
  3. Use BPM-synchronized timing. State exactly when events happen in bar numbers or milliseconds.
  4. Declare sidechain relationships explicitly. It is the single most impactful technique for making AI progressive house sound authentic.
  5. Use chain-of-thought for complex arrangement decisions. Ask the model to reason about structure before generating.
  6. Reference real tracks for style anchoring. Providing exact track references activates the model's implicit genre knowledge.
  7. Apply the LTV method for each layer. Describe texture, brightness, movement, density, and edge for every element in the stack. This closes the timbral gap that architecture prompts alone cannot address.

🎶 Closing Thoughts

Progressive house lead layering is, in some ways, the perfect test case for AI music generation. It is technically precise enough to reveal the gaps in model architecture, yet popular enough to have significant training data representation. It requires global structural coherence (the arrangement) and local signal precision (the layer phase relationships) simultaneously — a combination that exposes the weaknesses of current auto-regressive audio systems.

What surprised me most in this investigation was not where the AI failed — it was how close the SALT framework got to bridging the gap. A year ago, I would not have believed that a text prompt could meaningfully guide the spectral architecture of a synthesizer layer stack. Now I think the ceiling of what prompt engineering can achieve in this domain is still being discovered.

The fact that I can sit down at a DAW to produce a progressive house track and simultaneously design prompt frameworks that help AI understand the same production — that is the kind of cross-domain insight that I find genuinely exciting. Two passions, one engineering problem, a lot still to figure out.

The drop hits differently when you understand every layer that builds it. 🔊


Eduardo J. Barrios García is a Software & AI Engineer and electronic music producer (EyeMad). This article reflects personal experimentation and informal findings, not peer-reviewed research.