Blog
AIOpen Source

Programming languages were invented for humans. I wanted to see if AI still needs them.

Programming languages were invented for humans. I wanted to see if AI still needs them.

Thoughts and ideas expressed here are my own and do not reflect the views of my employer.


Here's a thought that's been sitting with me for a while:

Every programming language, every compiler, every build system — all of it exists because humans need to read and write source code. The machine doesn't care about Python or C or Rust. It cares about bytes. We invented the entire toolchain as a bridge between human intent and machine execution.

LLMs don't need that bridge.

An LLM doesn't need readable syntax. It doesn't need a language that's easy to reason about. It just needs a way to express intent at whatever level of abstraction gets the job done. And the job — ultimately — is bytes that run on a processor.

So I asked: how feasible is it today to go from natural language directly to a native binary, skipping the high-level compiler entirely?

BinaryVibes is my answer to that question. Less a finished product, more a working experiment.


The test

I wanted to know two things:

  1. Can an LLM reliably generate working x86_64 assembly for non-trivial programs — not just hello world, but real things like HTTP requests, file I/O, and GUI dialogs?
  2. Where does it break down, and what does it need to succeed?

The pipeline I built:

Your description in plain English
  → LLM outputs x86_64 assembly
  → Keystone assembler converts to machine code bytes
  → custom PE/ELF/Mach-O builder wires up OS imports
  → native executable

No C. No Python. No build system. The LLM skips straight to assembly, and the rest is mechanical.


What I found

The LLM knows what to do. It struggles with how.

For anything involving Windows API calls — HTTP, file handles, console output — the LLM consistently gets the logic right and the mechanics wrong. Wrong register state after an API call. Stack not 16-byte aligned before a function. That kind of thing. Subtle, hard to debug, completely reproducible.

The fix was pre-baking 14 helper routines — tested, correct assembly for the common operations — and telling the LLM to call them by name instead of implementing them from scratch. __bv_http_get. __bv_print_str. __bv_msgbox. Once those existed, reliability jumped dramatically.

The self-correcting loop mattered more than I expected.

Assembly errors are precise — Keystone tells you exactly which line is wrong. I feed that back to the LLM and it fixes it almost every time. Runtime crashes are harder, but sending the exit code and stdout back as context works surprisingly well too. The LLM sees what it generated, sees how it failed, and produces something better.

For programs in the 2–4KB range, it works.


What it can build today

All of these work on the first attempt:

  • "print hello world" → 2KB console app
  • "count down from 5 to 1 with 1 second pauses" → 3KB timer
  • "read USERNAME env var and print it" → 3KB env reader
  • "open input.txt and print its contents" → 3KB file reader
  • "copy source.txt to dest.txt, show a dialog when done" → 3KB GUI util
  • "fetch weather for Seattle from wttr.in and print it" → 4KB HTTP app
  • "fetch weather for Seattle, London, and Tokyo and print each" → 4KB multi-city
  • "fetch weather, write a styled HTML page, open it in the browser" → 4KB dashboard

These are real, standalone native executables. No runtime. No dependencies. Send them to anyone.


Try it yourself

git clone https://github.com/bryhaw/BinaryVibes
cd BinaryVibes
pip install -e ".[dev]"

# Uses your existing GitHub Copilot subscription — no separate API key
gh auth login

bv build "fetch weather for Seattle and print it" -O weather.exe
.\weather.exe

Add --run-verify to have it run the binary and self-correct on crashes:

bv build "show computer name and process ID" -O sysinfo.exe --run-verify

Cross-compile with one flag:

bv build "hello world" --format elf   -O hello    # Linux
bv build "hello world" --format macho -O hello    # macOS

More examples in examples/ in the repo.


Where this goes

This is a capability test, not a production tool. But what it tells me is that the feasibility question has a real answer: yes, LLMs can do this today, with guardrails. The helpers, the feedback loop, the format builders — those are the guardrails.

What's missing is scale. Right now it handles programs in the 2–4KB range with a fixed set of pre-wired API imports. Expanding that surface area — more APIs, ARM64, richer intent parsing — is the obvious next step.

But the core thesis holds. The bridge we built for humans — source code, compilers, build systems — is not a permanent feature of how software gets made. It's an artifact of who was doing the writing.

GitHub →