ACE-Step on Apple Silicon: The Install Guide That Actually Works

Most ACE-Step tutorials tell you to follow the official start_api_server_macos.sh script and walk away. On a 25 GB M1 Max, that script’s default LM choice will crash the process before the first generation finishes. The fix is one line in a config — and nobody flags it.

ACE-Step 1.5 is a free, open-source AI music generator with Apache 2.0 weights. It runs on your Mac. No credits, no monthly fee, no usage caps, full commercial rights. The catch is that it ships with a model loadout sized for a 24 GB CUDA card, not Apple’s unified memory, and the difference matters.

This guide covers the install, the memory trap, and a verified prompt recipe for hip-hop instrumentals — the genre I tested most heavily on the local install.

Why Run ACE-Step Locally on a Mac

Three reasons local beats cloud here.

Cost. Suno’s Pro plan is $10/month for 2,500 credits. Udio’s Pro is $10/month for 4,800 credits. ACE-Step on your own hardware is $0/month forever, with the option to publish, sell, or sync the output without a license review.

Privacy. No prompts leave the machine. For commercial sound design or client work where the brief is confidential, that matters.

Determinism. Same seed, same prompt, same output. Cloud providers swap model versions silently. A track you can re-generate two months from now is a track you can iterate on.

The trade is generation time. On an M1 Max with the configuration below, a 60-second clip takes about two minutes once the models are loaded. That’s slower than Suno’s 30-second turnaround, but it’s the price for owning the pipeline.

Try it free: Studio AI’s hosted music generator if you don’t want to manage the install yourself — Lyria-powered, runs in the browser. No subscription needed to start. Generate music free →

Hardware Check Before You Install

Per the official INSTALL.md, ACE-Step needs:

Python 3.11–3.12 (stable, not pre-release)
~10 GB disk for the core models
Apple Silicon M1 or newer (the MLX backend requires arm64)

The repo’s “Which Model Should I Choose?” table is written for discrete CUDA cards. On Apple’s unified memory architecture, the practical ceiling is different. Tested on a 25 GB M1 Max:

Configuration	Result on 25 GB M1 Max
`acestep-v15-xl-turbo` + `acestep-5Hz-lm-0.6B`	Runs clean
`acestep-v15-xl-turbo` + `acestep-5Hz-lm-1.7B`	OOM (peak ~42 GiB allocated across MLX + PyTorch fallback)
`acestep-v15-base` + `acestep-5Hz-lm-1.7B`	Runs (smaller DiT compensates)

The first row is the recommended Mac config for hip-hop and other groove-locked genres. The second row is what the default macOS launcher script actually loads — and it’s the failure mode this article exists to prevent.

Install Steps from the Repo

The official install path uses uv, the Astral package manager. Run these in order.

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone and sync dependencies

git clone https://github.com/ACE-Step/ACE-Step-1.5.git
cd ACE-Step-1.5
uv sync

uv sync reads pyproject.toml, resolves the dependency graph, and creates a .venv in the repo root. First-time sync downloads about 3 GB of wheels.

3. Launch with the MLX backend

ACE-Step ships a Mac-specific launcher that sets the backend correctly:

chmod +x start_gradio_ui_macos.sh start_api_server_macos.sh
./start_gradio_ui_macos.sh

This script auto-sets ACESTEP_LM_BACKEND=mlx and passes --backend mlx. MLX is Apple’s native ML framework — it talks directly to the unified memory pool and the Neural Engine, which is the only configuration that’s worth running on a Mac.

The Gradio UI opens at http://localhost:7860. Models download from HuggingFace on first generation, not at launch — expect a one-time ~10 GB pull when you fire the first prompt.

The 0.6B vs 1.7B LM Trap

Here’s the part that breaks installs.

The default start_gradio_ui_macos.sh config block looks like this:

CONFIG_PATH="--config_path acestep-v15-turbo"
LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-1.7B"

The DiT (acestep-v15-turbo) is the audio diffusion model. The LM (acestep-5Hz-lm-1.7B) is the language model that handles caption parsing and chain-of-thought metadata. On a 24 GB CUDA card with vLLM, this combination is fine. On an M1 Max with MLX, peak allocation across the MLX path and the PyTorch fallback will hit ~42 GiB and crash.

The fix is one edit. Open start_gradio_ui_macos.sh and change the LM line:

LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-0.6B"

The 0.6B model is roughly a third of the parameter count, parses captions slightly less accurately on long-tail genres (west-coast g-funk, drum-and-bass subgenres), and runs without OOM on every M-series Mac with 16 GB or more.

If you have a 32 GB or 64 GB M2/M3/M4, the 1.7B may fit — test it once with a short generation and watch Activity Monitor’s memory pressure. If you see swap activity above 1 GB, drop to 0.6B.

For most hip-hop, lo-fi, trap, boom-bap, and pop generations, the 0.6B is the right model on Apple Silicon. The accuracy delta only shows up on rare genre combinations the LM has weak training coverage for.

The Verified Hip-Hop Instrumental Prompt Recipe

ACE-Step’s defaults are optimized for songs with vocals. Hip-hop instrumentals need a different prompt shape, and the docs bury this in a tutorial nobody reads.

After running about 200 test generations across boom-bap, trap, and lo-fi at 75–145 BPM, this is the recipe that produced clearly-on-genre keepers most consistently:

Caption (prose, 50-90 words, no BPM in the text):
classic boom-bap hip-hop track with a steady drum-machine groove,
crisp snare on 2 and 4, prominent funky bassline, atmospheric
synth pads in the background, jazzy chord progressions, an
introspective and contemplative mood, late-night studio
atmosphere, dusty textures throughout

Lyrics field: [inst]

Inference parameters:
  config_path: acestep-v15-xl-turbo (or acestep-v15-turbo for low VRAM)
  lm_model_path: acestep-5Hz-lm-0.6B
  thinking: false
  use_cot_metas: false
  use_cot_caption: false
  use_cot_language: false
  shift: 3.0
  inference_steps: 8
  infer_method: sde
  duration: 30 (groove-locked genres) or 60 (genres with intro/build/drop)

Four things make this work:

Prose, not tags. Comma-separated tag-style captions activate the wrong attention pattern. Every shipped example JSON in the repo uses prose. Match what the model trained on.

No BPM in the caption. The repo’s Tutorial.md is explicit: tempo is a soft hint set via metadata, not a number you write in the description. Repetition fights the LM’s metadata pass.

[inst] as the entire lyrics field. Not a song-structure skeleton, not blank, not “instrumental track no vocals.” Just [inst]. Better still: pass instrumental: true in the API call to override lyrics entirely.

Kill the LM with thinking=false. The LM is tuned for full songs with vocals and is the unreliable middle stage for instrumentals. Bypassing it sends your caption straight to the DiT — more literal, more on-genre. This is the highest-EV single tweak.

The shift=3.0 parameter is the second-highest-EV tweak. Per the official Tutorial.md: “clearer, richer timbre” on hip-hop. Default turbo is jointly distilled across shifts 1, 2, and 3 — bumping to 3.0 at inference time moves the output toward the cleaner end of that range.

When ACE-Step Beats Suno or Udio

ACE-Step is not a Suno replacement. It is the right tool for a specific job.

Use case	Better tool
One-off song with vocals for social posting	Suno or Udio
Reproducible drum and instrumental loops for a sample pack	ACE-Step (deterministic seeds + free unlimited generations)
Client work where the prompt is confidential	ACE-Step (local, nothing leaves the machine)
Long-form ambient/scoring with custom LoRAs	ACE-Step (fine-tunable, weights are yours)
You need a track in 30 seconds and don’t care about ownership	Suno’s free tier

The decision rule: if you’re going to generate the same prompt-shape ten or more times — sample packs, scoring beds, branded sonic identities — ACE-Step pays for itself in week one. If you’re generating one-off finished songs, Suno or Udio’s hosted UX wins.

For everything in between, the install is a weekend.

Start Generating Local AI Music

The Apple Silicon install is one uv sync and one config edit away from running. Free, commercial-use, no subscription. The 0.6B LM is the move on M1 Max; revisit the 1.7B once you’ve upgraded to 32+ GB unified memory.

If the install isn’t the path for you this week, Studio AI’s hosted music generator runs the same class of model in the browser with no setup.

Generate AI music free →

Frequently Asked Questions

Does ACE-Step run on Intel Macs?

No. ACE-Step’s MLX backend requires Apple Silicon (arm64). Intel Macs would have to use the PyTorch CPU backend, which the repo explicitly recommends against for inference and forbids for training. If you’re on an Intel Mac, use Studio AI or Suno’s hosted free tier.

How much disk space does ACE-Step need on a Mac?

Around 10 GB for the core model bundle (VAE, Qwen3-Embedding-0.6B, acestep-v15-turbo, and one LM). Add ~6 GB if you also pull acestep-v15-xl-turbo for higher-quality output, and another ~1 GB for the 0.6B LM if you’re swapping off the default 1.7B. Per the official INSTALL.md, models download to ./checkpoints on first run.

Why does the default macOS launcher OOM on my M1 Max?

The default start_gradio_ui_macos.sh loads acestep-5Hz-lm-1.7B, which is sized for a 24 GB discrete GPU. On Apple’s unified memory, the MLX path plus the PyTorch fallback peak around 42 GiB during initialization. Edit the script’s LM_MODEL_PATH line to acestep-5Hz-lm-0.6B and the install runs clean on any 16+ GB M-series Mac.

Can I sell music made with ACE-Step?

Yes. ACE-Step’s weights are released under Apache 2.0, and the generated audio carries no per-track license restriction from the model authors. You own what you generate. Note that platform-side rules still apply — DistroKid, Spotify, and YouTube each have their own AI-music disclosure policies independent of the model license.

What’s the difference between the turbo, base, and SFT models?

Turbo is distilled for speed (~8 inference steps, no CFG). Base is the foundation DiT — slower, accepts CFG guidance scale. SFT is the supervised-fine-tuned variant — slower still, but the highest fidelity for final keepers. On a Mac, start with turbo for iteration and only switch to SFT once you have a prompt worth the extra wait. The XL variants (4B parameters) need ≥16 GB unified memory to run without paging.