GCN Assembler for AMD GPUs

An assembler/compiler for AMD's Graphics Core Next assembly language

by Ryan Scott White — Feb 17, 2015 (updated April 2016)

71,916 views on CodeProject(2015-2023)

Introduction

This GCN GPU assembly compiler converts human-readable assembly into binary machine code for AMD GCN GPUs. Assembly is a readable abstraction of machine code; most assembly statements map directly to single hardware instructions. Performance-critical software or code that needs special hardware features often uses assembly. A typical example is Bitcoin mining: the workload is compute-bound and benefits from small gains, and assembly can access instructions and options that are otherwise unavailable. Assembly also lets you combine instructions creatively. The trade-offs are maintainability and portability; see Disadvantages section below.

The project includes three Visual Studio projects that wrap one another:

Asm4GCN Assembler — converts assembly statement blocks to binary; manages variables, labels, and registers; reports warnings and errors.

OpenCLwithGCN — injects assembled binaries into dummy OpenCL kernels and runs the program.

Asm4GCNGUI — a Windows editor with syntax highlighting and code completion, useful for experimenting and for demonstrating OpenCLwithGCN.

Background

On NVIDIA, CUDA supports inline PTX via asm, which provides a PTX-level "almost assembly" experience. In OpenCL for AMD, there is no straightforward way to inline GCN assembly.

The ideal would be an asm function in OpenCL. Some users have partially made this work, but without variable in/out passing it is not useful, and register clobbering is a risk.

I pursued inline assembly by generating a dummy kernel of the right size and capturing used registers. Reliability was poor, so I pivoted: a cl::program can mix OpenCL __kernel and GCN __asm4GCN kernels. Inline assembly infrastructure remains half-built; capturing registers for dynamic dummy kernels is still the blocker.

A related Windows app is HetPas by Realhet. It produces ELF images loadable into OpenCL and runs GCN assembly kernels with Pascal-like host code. It works well; I struggled with Pascal, which motivated this project.

Asm4GCN Assembler Sub-Project

The solution has three projects: Asm4GCN (assembler), OpenCLwithGCN, and Asm4GcnGUI. The Asm4GCN core converts assembly to raw binary.

Example: s_mov_b32 v1, s0 → 7E020200

Beyond opcode encoding, the assembler handles labels, jumps, variables, and register management. The following flow abstracts the core passes (simplified):

Assembly Flow - Pass 1

1. Input a block of GCN assembly.

2. Iterate line by line.

3. Strip whitespace and comments.

4. Record label headers like myLabel:.

5. Process #s_pool, #v_pool, and #define.

6. Apply #define replacements.

7. Split multiple statements by ;.

8. For each statement on the line:

9. Tokenize: V_Mov v2, v4 → {"V_Mov","v2","v4"}.

10. Handle variable allocate/free.

11. Substitute variable names with assigned registers.

12. Parse literals: hex, bin, octal, scientific.

13. If a label is referenced, defer final encoding.

14. Encode instruction to binary if fully resolved.

15. Append to the instruction list.

Assembly Flow - Pass 2 (Resolve unknown instruction sizes)

16. Collect instructions with unresolved size.

17. Iterate them.

18. If min/max jump distances agree, size is fixed.

19. Set byte size accordingly.

20. If even the farthest jump fits 4 bytes, fix to 4.

21. Mark as 4-byte op.

22. Repeat until all sizes are resolved.

23. Encode remaining label-based instructions.

24. Concatenate 4- and 8-byte ops into one byte array.

25. Return the binary and metadata (register usage, etc.).

Labels

Branches may encode to 32 or 64 bits depending on jump distance. Distances depend on sizes of intervening instructions, but those sizes depend on whether intervening branches are 32 or 64 bits. To break the cycle, compute lower and upper bounds: assume unresolved ops are 32-bit for a shortest path, and 64-bit for a longest path. Whenever both bounds agree on 32 or 64 bits, lock that size. Iterate over unresolved branches until all sizes stabilize, then substitute concrete label distances.

A simpler but less efficient fallback is always using the 64-bit form when unsure.

Label features:

Labels can stand anywhere constants can. s_mov s2, MyLabel moves the byte distance to MyLabel into s2. Beware instructions that cannot take two constants (e.g., s_sub s4, MyLabelA, MyLabelB).

Duplicate label names are allowed; the nearest matching label is used. This eases copy-paste, but a warning is emitted to catch unintended jumps.

Variables

Assembly beyond ~20 lines quickly becomes hard to track. Variables solve two problems: naming intent and safe register reuse.

Example:

v4i myVar1, myVar2, mySum;

                        v_mov_b32 myVar1, 10;

                        v_mov_b32 myVar2, 20;

                        v_add_i32 mySum, myVar1, myVar2

                        free myVar1, myVar2

The variable system has three parts: declaration, use, and freeing.

Declaring a Variable

Use short 3-character types: first char is space (S=scalar, V=vector), middle is size in bytes, last is intended data kind (F/I/U/B). On declaration the allocator reserves the first free register(s) from the allowed pool.

Examples:

v4i _my_int_vector;

                        v8u myUnsignedLongVector;

                        s8f myDoubleScalar1, myDoubleScalar2;  // multiple per line

                        s4u myAddrForcedToS10 s10;             // force s10 (e.g., incoming params)

                        v8f myDoubleForcedTo2 v[2:3];          // force v[2:3] or just v2

First character:	Description
`s` or `v`	selects the register file.

Size (middle, updated):

Byte size controls how many consecutive registers are reserved and their alignment. Sizes 1/2/4 still occupy one dword register; 8 uses two consecutive registers, aligned by 2; 16 uses four, aligned by 4; 32 uses eight, aligned by 4.

Size	Regs used	Alignment
1 byte	1	any
2 byte	1	any
4 byte	1	any
8 byte	2	begin on reg divisible by 2
16 byte	4	begin on reg divisible by 4
32 byte	8	begin on reg divisible by 4

Data kind (last):

F float, I signed int, U unsigned, B bits (catch-all). Currently informational; future checks could warn on mismatched ops.

Forcing Register Numbers

Two ways to pin a declaration to specific registers:

1. Fixed register: v4u myLaneID v0

2. Copy from an earlier variable (optionally with an index): s4u myNewVar myPast16SizedVar[3]

Use fixed registers early in kernels to avoid allocator conflicts. Reusing a past variable's register lets you "rename" without a move, reduces peak registers, and avoids extra instructions.

Inline Variable Declarations

You can declare inline at the point of first use:

Without inline:

v4u vLocalSize

                        v_mov_b32     vLocalSize, localSize

                        v_mul_i32_i24 vLocalSize, groupId, vLocalSize 

                        v4u localSizeIdx

                        v_add_i32     localSizeIdx, vcc, laneId, vLocalSize 

                        v4u vGlobalID

                        v_add_i32     vGlobalID, vcc, baseGlobalId, localSizeIdx

                        v4u vGlobalOffset

                        v_lshlrev_b32 vGlobalOffset, 2, vGlobalID

With inline:

v_mov_b32     v4u vLocalSize, localSize

                        v_mul_i32_i24     vLocalSize, groupId, vLocalSize

                        v_add_i32     v4u localSizeIdx, vcc, laneId, vLocalSize 

                        v_add_i32     v4u vGlobalID, vcc, baseGlobalId, localSizeIdx

                        v_lshlrev_b32 v4u vGlobalOffset, 2, vGlobalID

Since variables auto-free at last use, the allocator can recycle within the same instruction. Example: the register backing localSizeIdx can be reused for vGlobalID on the same line if lifetimes allow.

Using a Variable

Use variable names instead of raw registers. The assembler looks up the binding and substitutes the concrete s/v register(s):
v_add_i32 v3, v4, myInt; → v_add_i32 v3, v4, v7.

Variable Indexing

For multi-register variables, access specific lanes with [index]. Example: adding two 64-bit values:

v8i myInt1, myInt2

                        ... // assign values

                        v_add_i32 myInt1[0], vcc, myInt1[0], myInt2[0]  // low 32

                        v_add_i32 myInt1[1], vcc, myInt1[1], myInt2[1]  // high 32

Free a Variable

Prefer automatic freeing for cleaner code and better register pressure.

Automatic:

A variable is freed at its last use. Lifetimes are tracked, and same-instruction recycling is possible.

Manual:

Use free name1, name2 to extend or end lifetimes explicitly.

v4u myVar1, myVar2

                        ...

                        v_add_u32 myVar1, myVar1, myVar2   // myVar2 last used → auto-freed here

                        ...

                        free myVar1                        // explicitly free later

Manual freeing helps when:

Using GPR indexing (v_movrels / v_movreld), which does not reference the register directly and would otherwise look unused.

You jump back to earlier code that expects a variable still live. Manually delaying free preserves the binding.

Flexible Constants

Instructions that support inline literals accept many forms: decimal, hex, octal, binary, scientific, and labels (distance in bytes).

Examples:

s_add_i32    s3, s4, 12

                        s_add_i32    s3, s4, -12

                        s_min_u32    s5, s6, 0xabcd

                        s_min_u32    s5, s6, 10e2

                        s_min_u32    s5, s6, -10e2

                        s_mov_b32    s4, 2.

                        s_mov_b32    s4, -20.0

                        s_mov_b32    s4, .5

                        s_mov_b32    s4, -.5

                        s_mov_b32    s4, 343.432

                        s_mov_b32    s4, 3.4e4

                        s_mov_b32    s4, -34.4e-4

                        s_mov_b32    s4, 0o7654

                        s_mov_b32    s4, 0b0011111111

#define Support

C-style #define is supported via textual substitution. Use distinctive names (often braced with underscores) to avoid accidental matches. Macro parameters are supported.

#define _hw3_ Hello World!

                        #define _hw1_(opt0) Hello opt0 World!

                        #define _hw2_(opt0,opt1) Hello opt0 World from opt1!

#s_pool and #v_pool

Constrain the allocator to specific registers with #S_POOL and #V_POOL. Place near the top for clarity.

#S_POOL s22, s23, s24, s27, s29, s30, s31, s33, s34, s35, s36, s37

                        #V_POOL v11, v12, v13, v14, v15, v17, v19, v20, v21, v23, v24, v25

Multiple Statements per Line and Semicolons

Most instructions and commands support multiple statements per line, separated by ;. #v_pool, #s_pool, and #define must be on their own lines.

s8u myScalar; v8u myVector; v_add_i32 myVector, myVector, myScalar   // ok

                        v_add_i32 v0, v1, v2                                                 // ok

                        v_add_i32 v0, v1, v2;                                                // ok

                        v_add_i32 v0, v1, v2; #define _myDef_ 12345678                       // not ok

Asm4GCN Project Files

Approximate line counts in parentheses:

GcnISA.cs (1756) — ISA data: instruction tables, register aliases, enums.

Encoder.cs (1404) — Static GcnParcer class; one method per encoding format; converts a statement to opcode bits.

GcnBlock.cs (679) — GcnBlock class; converts a block to byte[].

DataStructs.cs (50) — GcnStmt, Define, AsmVar, etc.

Labels.cs (109) — Label tracking and jump distances.

ParseOperand.cs (346) — Parses operands and validates datatypes; hex/octal/bin to constants.

Program.cs (208) — Command-line entry; library use preferred.

RegPool.cs (356) — Register allocator and pools.

RegUsage.cs (111) — Usage counters and maximums; helps estimate inline needs.

Log.cs (122) — Logging to StringBuilder or console.

Tools.cs (76) — Small helpers like IsBetween().

TestInput.txt (166) — Examples and tests.

Variables.cs (409) — Variable handling; owns RegPool and RegUsage.

Point of Interest

Inline Assembly

A possible approach:

1. Locate and extract inline Asm4GCN blocks.

2. Assemble them in "usage only" mode to get byte size and s/v register counts at max-usage points.

3. Replace inline text in the OpenCL kernel with generated dummy OpenCL that matches byte size, register usage, and params, fenced with barriers to prevent reordering.

4. Compile the dummy kernel; record which registers the driver used.

5. Re-assemble the inline block, now constraining allocation via #S_POOL and #V_POOL to those registers.

6. Find and patch the dummy binary in the program binary with the final assembly.

Status: (3) partially works (FillerKernelAttempts.cl: DummyFillerCode()), but sizing and counts are imperfect. (6) not finished; the dynamic dummy's starting bytes were hard to locate reliably.

Smart Register Packing/Reservations

Naive "first free" allocation wastes space. Think of lifetimes as rectangles: width is register width (1,2,4,8,… regs), height is lifetime in instructions. The goal is to place rectangles to minimize total width (peak registers), respecting alignment (2-wide on even starts; 4-wide on multiples of 4).

                        R0  R1  R2  R3  R4  R5  R6  R7  R8  R9

                        Inst 1  A   A

                        Inst 2  A   A   B       C   C   C   C

                        Inst 3  A   A   B       C   C   C   C   D   D

                        Inst 4              B       C   C   C   C   D   D

                        Inst 5              B                               D   D

                        Inst 6  E       B           F   F

                        Inst 7  E                       F   F

                        Inst 8  E       G   G       F   F

Differences from classic rectangle/bin packing: lifetimes are mostly fixed vertically; widths are powers of two; and alignment constraints apply. A simple greedy heuristic works well in practice: place each block into the smallest gap it fits (perfect fits preferred), scored by wasted space. Asm4GCN uses this style of heuristic rather than pure "first fit."

OpenCLwithGCN Sub-Project

OpenCLwithGCN makes Asm4GCN usable from C#. It replaces Asm4GCN kernels with dummy OpenCL kernels, assembles, then patches the dummy binary with the Asm4GCN binary. OpenCL __kernel and Asm4GCN __asm4GCN kernels can coexist in one cl::program.

Features of OpenCLwithGCN

Built-in Text Template Engine

Assembly benefits from programmable text (e.g., unrolling). A simple C#-based template engine runs before assembly. Use [[ ... ]] blocks; print values with [[=expr]]. More details: CodeProject Template Engine Article

Example

                        [[for(int i=3; i<7; i++) {]]

                          v_mov_b32 v[[=i]], v[[=i+4]]

                        [[}]]

Expands to:

v_mov_b32 v3, v7

                        v_mov_b32 v4, v8

                        v_mov_b32 v5, v9

Mixed OpenCL and Asm4GCN kernels in one cl::program

Use both __kernel and __asm4GCN in the same program. Streams let you compose them. Note: early versions supported only a single __asm4GCN; see Limitations section.

Using OpenCLwithGCN

From Example1.cs:

Compose the source:

string source = 
                            @"

                        __asm4GCN myAsmFunc ( float*, float* )

                        {

                          #define _32Float_    0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]

                          s_buffer_load_dword  s0, s[4:7], 0x04

                          s_buffer_load_dword  s1, s[4:7], 0x18

                          s_waitcnt            lgkmcnt(0)

                          s_min_u32            s0, s0, 0x0000ffff

                          s_buffer_load_dword  s4, s[8:11], 0x00

                          v_mov_b32            v1, s0

                          v_mul_i32_i24        v1, s12, v1

                          v_add_i32            v0, vcc, v0, v1

                          v_add_i32            v0, vcc, s1, v0

                          v_lshlrev_b32        v0, 2, v0

                          s_load_dwordx4       s[12:15], s[2:3], 0x60

                          s_waitcnt            lgkmcnt(0)

                          v_add_i32            v1, vcc, s4, v0

                          tbuffer_load_format_x v1, v1, s[12:15], _32Float_

                          s_buffer_load_dword  s0, s[8:11], 0x04

                          s_load_dwordx4       s[4:7], s[2:3], 0x68

                          s_waitcnt            lgkmcnt(0)

                          v_add_i32            v0, vcc, s0, v0

                          s_waitcnt            vmcnt(0)

                          v_add_f32            v1, v1, v1

                          tbuffer_store_format_x v1, v0, s[4:7], _32Float_

                          s_endpgm

                        };

                        __kernel void myOpenClFunc ( __global float* cl_input, __global float* cl_output )

                        {

                          size_t i = get_global_id(0);

                          cl_output[i] = cl_input[i] + cl_input[i];

                        }; ";

Compile and keep the default environment:

OpenClWithGCN gprog = new OpenClWithGCN();

                        OpenClEnvironment env = gprog.env;

                        bool success = gprog.GcnCompile(source, out log);

Create the kernel:

Kernel kernel = env.program.CreateKernel("myAsmFunc");

Allocate buffers and fill data:

Mem cl_input = env.context.CreateBuffer(MemoryFlags.ReadOnly, dataSz);

                        Mem cl_output = env.context.CreateBuffer(MemoryFlags.WriteOnly, dataSz);

                        // random data

                        var random = new Random();

                        const int count = 1024 * 1024;

                        const int dataSz = count * sizeof(float);

                        float[] data = (from i in Enumerable.Range(0, count)

                                       select (float)random.NextDouble()).ToArray();

                        env.cmdQueue.EnqueueWriteBuffer(cl_input, true, 0, dataSz, data);

Set args and enqueue:

                        kernel.Arguments[0].SetValue(cl_input);

                        kernel.Arguments[1].SetValue(cl_output);

                        env.cmdQueue.EnqeueNDRangeKernel(kernel, count, 256);

                        env.cmdQueue.Finish();

                        float[] results = new float[count];

                        env.cmdQueue.EnqueueReadBufferAndWait(cl_output, results, dataSz);

OpenCLwithGCN Project Files

OpenClEnvironment.cs (97) — OpenCL environment classes.

OpenClWithGCN.cs (615) — Core logic.

TextTemplate.cs (105) — Single static Expand() for template transformation.

Asm4GcnGUI Windows Interface Sub-Project

A small GUI with syntax highlighting, code completion, and an output panel. Useful for quick kernel tests and teaching. The three panes are: C# host code, GCN assembly code, and compiler output.

The GUI keeps GCN assembly in a separate tab for highlighting. Internally, that text is emitted into a .cs file as:

namespace GCN_NS {

                          static class Code {

                            public const string DevCode = "...";

                          }

                        }

The output panel shows assembler messages first, then any C# compile errors. Program output goes to a console window.

Features of the GUI Interface

Directly runnable in Visual Studio — Saves files in a VS-friendly layout. Add references and build.

Syntax highlighting — Using Pavel Torgashov's FastColoredTextBox; helps readability and catches typos.

Code completion — Assembly autocompletion for faster authoring.

Separate host and device panes — Clean separation for language-specific highlighting and organization.

General Topics on GPU Assembly

This section covers GPU assembly broadly.

The Good & the Bad of Programming in GPU Assembly

Advantages

Potentially faster and smaller kernels in expert hands; 2–4× speedups are plausible for hot spots.

Access to instructions/options and special registers unavailable at higher levels.

Understanding ISA details improves OpenCL/CUDA coding.

Humans can sometimes pack data/control more cleverly than a compiler.

Disadvantages

Slower to write than OpenCL/CUDA.

Harder to maintain and read at scale.

Tied to GPU generations; driver updates can break binaries depending on how they're loaded.

OpenCL kernels are portable; assembly is GCN-specific.

More bug-prone; compilers avoid common pitfalls.

Beating a modern compiler on complex kernels is not guaranteed.

Rule of thumb: keep assembly small and surgical. 1–50 lines for critical paths is manageable; compilers regain advantage as complexity grows.

Preload Register Values

Some registers are preloaded when a GCN kernel is launched. These may be driver-dependent.

Reg	Name
s[2:3]	UAV Table Pointer
s[2:3]+0x60	base_resource_const1 (#T)
s[2:3]+0x68	base_resource_const2 (#T)
s[4:7]	Imm Const Buffer 0
s[4:7]+0x00	Grid Size
s[4:7]+0x04	Local Size
s[4:7]+0x18	Base Global ID
s[8:11]	Imm Const Buffer 1
s[8:11]+0x00	param1 offset
s[8:11]+0x04	param2 offset
s[8:11]+0x08	param3 offset
s12	Group ID
v0	Local ID

GCN Assembly Code Writing Tips

Plan ahead. Prototype in a higher-level language first to validate logic and shape the assembly.

Know the ISA. Study AMD's GCN ISA manual for helpful instructions and options.

Unroll to reduce branches. Fewer jumps often means fewer stalls.

Fit in I-cache. Keep kernels within the CU's shared instruction cache (~32 KB, ≈4–8k instructions).

Minimize registers. More live registers reduce occupancy. Move computations earlier or later to shorten lifetimes.

Practice. Explore, iterate, and read community posts and articles.

Original lifetimes: A,B created at 10, first used at 20 (C=A+B), C used at 30
Usage = (2 vars * (20-10)) + (1 var * (30-20)) = 30
Move C=A+B to 11 → (2*(11-10)) + (1*(30-11)) = 21 → one fewer live register from 11–20.

Limitations

No support for 3rd-gen "Volcanic Islands" GPUs (Radeon R9 280, Fury, Nano).

Sensitive to some driver versions (see System Requirements below).

Only AMD GCN 1.0/1.1 GPUs.

Future Wish List

Allow reuse of the same variable name safely.

Support GCN generation 3.

More OpenCL 2.0 support.

Broader driver compatibility.

Inline assembly via robust dummy kernel generation.

Friendlier infix syntax, e.g., varA = varB * varC mapped to the right opcode via types.

Type-aware warnings (I/U/F/B) for mismatched ops.

Videos

Tutorial video(s) are available; some parts may be out of date.

YouTube search for Asm4GCN videos

System Requirements

Download notes: This project requires specific hardware. See System Requirements section below.

AMD GPU with GCN 1.0 or 1.1. 3rd-gen (Volcanic Islands: Radeon R9 280, Fury, Nano) not supported.

Supported AMD drivers: 13.251, 14.501, 15.200, 15.201, 15.300, 16.150. Others may not work.

History

Other GCN Assemblers

cmingcnasm — Minimal GCN assembler in C for GCC/Linux by Sylvain Bertrand. Links: GitHub, GoogleCode

gcnasm — Open-source GCN assembler in C by Daniel Bali. Efficient low-level C, very fast. Links: GitHub, OpenWall

HetPas Assembler (Windows) — By Realhet. Full, feature-rich Windows assembler with Pascal-like host code. Requires disabling Data Execution Prevention; see instructions on the site. Updated frequently; now includes variables. Website

A Special Thanks To

AMD for publishing GCN ISA manuals.

Daniel Bali for an excellent open-source GCN assembler that informed this work.

Derek Gerstmann for a clear OpenCL example; adapted to C# with NOpenCL as the GUI default.

Pavel Torgashov for FastColoredTextBox and Autocomplete menu.

Realhet for HetPas and many insightful posts on the AMD forums and WordPress.

Tunnel Vision Laboratories for the NOpenCL wrapper by Sam Harwell.