CudaPAD - CUDA PTX/SASS Assembly Viewer

What is CudaPAD?

Paste your CUDA C++ code in the left window and cleaned-up assembly will show up on the right side. Every time a small change is made to the source code window it automatically starts re-compiling (with an indicator) and displays the updated assembly on the right. In the top menu, the compiler options can be tinkered with to see how it impacts the assembly.

It also has some visual helpers. It will do a "diff" on the code with each change so the user can see what exactly changed. Also, matching the source code on the left window to the assembly on the right can be a daunting task so I added little visual lines that connect the two. Colorful highlighting was added to make the code and assembly easier to read. It also had register highlighting. A user could click on a register, and it would show up on the right. Also, raw assembly is pretty messy, so it strips out the junk for you.

It also has some tools. I made it so when you click on the error it would show you right where the issue is with the source code. Also, when right clicking it would do a Google search on the error.

Introduction

What is PTX or SASS anyway? nVidia's PTX is an intermediate language for nVidia GPUs. It is more closely tied to pure GPU assembly (SASS) but slightly abstracted. PTX is less tied to the specific hardware or a hardware generation which makes it more useful in most cases when compared to assembly. One item it abstracts is physical register numbers which make it easier to use than the raw assembly. PTX instructions are usually translated into one or more actual SASS hardware instructions. SASS is hardcore assembly. It is what the GPU actually runs and is directly translated into machine code. Viewing SASS code is more difficult but it does show exactly what the GPU will do. As mentioned, SASS code also works with the registers directly so there is more control where registers are stored but it's another item that the programmer needs to keep track of and makes SASS more difficult to work with.

Often when programming in CUDA, there is a need to view what a kernel's PTX/SASS might look like and CudaPAD helps with this. There might be a need to view PTX/SASS for debugging, understanding what's happening, to squeeze a little more performance out of a kernel, or just for curiosity. To use the application, simply type or paste a kernel in the left panel and then the right panel will display the corresponding disassembly information. Visual informational aids like visual CUDA-to-PTX code matching lines, PTX cleanup, WinDiff, and quick register highlighting are built-in to help make the PTX easy to follow. Other on-the-fly information is also displayed like register counts, memory usage, and error information.

With any piece of code, there are often several ways to perform the same thing. Sometimes, just modifying a line or two will lead to different machine instructions with better registers and memory usage. Have fun and make some changes to a kernel in the left window and watch how the PTX/SASS changes on the right.

Just as a quick note: CudaPAD does not run any code. CudaPAD is only for viewing PTX, SASS, and register/memory usage.

Background

Like most of my projects, this one was grown out of a personal need. For some algorithms I develop, GPU efficiency is essential. One way to help with this is by understanding the low-level mechanics and making any necessary adjustments. Before creating this app, I would often get in this loop where I would write a critical performance kernel then view the PTX/SASS over and over using command line tools. Doing this repetitively was time-consuming so I decided to build a quick C# app that would automate the process.

It started out as a simple app that would take a kernel in the left window and then output the PTX to the right side window. This was accomplished by basically running the same command line tools as before, mainly nvcc.exe, but now in an automated fashion in the background. I got carried away however and within a short period of time, I started adding several features including automatic re-compiling, WinDiff, visual code line markers, compile errors, and register/memory usage.

AMD used to have a similar tool for Brook++ and this gave me the idea of having the two window app back in 2009 when I first built it. The tool had a left window where a Brook+ kernel could be added and a right window where the assembly would output to. A button could be clicked to update the output window. AMD has had a couple of these over the years but it has since been replaced with AMD's CodeXL.

AMD's CodeXL and nVidia's NSight have since replaced many tools like these however CudaPAD still has its place for quick, on the fly viewing of low-level assembly and experimentation. Both CodeXL and NSight are professional grade free tools and are a must-have for GPU developers.

Using CudaPAD

Requirements

CudaPAD is simple to use. But before running it, make sure these system requirements are met:

Visual Studio 2017/2019 (Express/Community editions are okay)

nVidia's CUDA 10

A dedicated GPU is not required since we are only compiling code and not running anything.

If the requirements are met, then simply launch the executable. When CudaPAD loads, it will have a sample kernel. The sample provides a quick place to start playing around or even a starting framework for a new kernel. Whenever the kernel on the left is edited, it will update the PTX or SASS on the right. If there is a compile error, it will show that near the bottom.

There are several features that can be enabled/disabled. All are on by default (also see Features section).

PTX/SASS View Modes

Change the drop-down textbox between PTX, SASS or SOURCE views.

PTX view

Shows the PTX intermediate language output of the kernel. PTX is close to SASS hardware instructions but is slightly higher level and is less tied to a particular GPU generation. Usually, PTX instructions translate directly to SASS however sometimes there are multiple SASS instructions per PTX instruction.

SASS view

These are true assembly instructions. These types of instructions are executed directly on the GPU. The amount of visual information supplied when viewing SASS is less than PTX – like the visual code lines do not show.

Raw code view

This view is mostly for debugging CudaPAD itself. Behind the covers, this app does not re-compile after every change. It only re-compiles when the code is modified and not comments or whitespace. The raw code is a stripped down version of the real code. The reason this was added was that I did not want it to keep compiling when I was adding/editing comments or adding/removing whitespace. This would not be resource friendly and would also throw off the WinDiff feature.

In the background, CudaPAD simply compiles the kernels with CUDA tools. The CUDA compiler then, in turn, calls a C++ compiler like Visual Studio. So to run CudaPAD, CUDA needs to be installed and most likely a C++ compiler like Visual Studio.

Enabling/Disabling Features

Disabling the auto-compile is useful for making multiple changes before a compile. This can help show the changes in the diff (differencing) output over several changes. To do a manual compile, just click the green 'start' in the top right corner.

Features

Visual Code Lines

These lines match up the CUDA source code to the PTX output. They help the programmer quickly identify what CUDA code matches up with what PTX. This function can be enabled or disabled by clicking the lines icon in the top of the PTX window.

Auto Assembly Refresh

Tip: When Auto PTX is turned off, a recompile can be forced by clicking "Ready".

When needed, the application will automatically re-generate the PTX code. It does not do this on each text change in the source window but rather when the stuff that matters changes. Many items are stripped from the source text that do not impact the output such as comments or spaces. The Auto Update function can be enabled or disabled by clicking the auto update icon in the top of the PTX window.

Built-in Diff Utility

Each time the output window updates, this will automatically run a differencing algorithm each time the PTX output changes. The notes are added in such a way that it does not impact the runnability of the code. I decided to add the diff information inside of comments in the event the user wants to copy and paste the code. I came up with a system of using

//

style comments on deleted lines and a

/*new*/

comment for new lines. The

//

comments disable the entire line while the

/*new*/

does not.

Tip: To view differencing over several changes, disable the auto compile, make the changes, then click "Ready" to force a new re-compile.

Single-Click Multiple Highlighting (new in 2016)

Just click on any register or word in the PTX window and it will highlight all instances of that item. Click on another and it will highlight those as well with a different color. Click on any highlighted item and it will un-highlight all instances of that item. With just three clicks the following can be achieved:

Syntax Highlighting and Output Formatting

The ScintillaNET textbox control by Jacob Slusser has some convenient text highlighting abilities that visually help when viewing code. Originally, this started out as a plain textbox, then moved to another 3rd party control and then finally to the ScintillaNET control. This results in more colorful and cleaner looking code.

Besides the text highlighting, the text in the output window is formatted so it's a little cleaner. Things like compiler information and header information are removed:

Remove unneeded comments

Remove unneeded id: comments

Remove empty "//" comments

Shorten __cudaparam_

Shorten labels

Remove .loc 15 lines (i.e. ".loc 3 3431 3")

Remove "%" in front of registers (New as of Jan. 2016)

Remove "// Inline" lines (New as of Jan. 2016)

Remove .file 1 "C:\\....." (New as of Jan. 2016)

Example of highlighted and cleaned up output formatting is as follows:

Note: Since 2009, nVidia has done a good job on cleaning up the PTX output. Labels are cleaner, padding has been added to make the registers line up and more.

Online Error/Warning Search

Often when running across an error, it is helpful to do a quick online search. I found I was often opening a browser and then copying and pasting the error into a search box. This was not efficient so I added a search online function. At the time, I think this was one of the first of its kind but since it was released in 2009, I have seen other IDEs have this.

Under the Hood

Let's take a look at how this application works. I will present what happens when the left window is edited. This triggers a recompile and then updates the right PTX/SASS window. Here it is in steps:

Step-by-Step Process

Step	Description
1	The user enters CUDA code in the left window.
2	The textbox change kicks off a short term timer. If the user types more text before that timer finishes, then the timer is reset. This system prevents the compile process from firing on every keystroke and lets the user finish typing before it automatically starts.
3	When the timer completes an event is raised. In this event, we check to see if there were any changes that would require a re-compile. Obviously, if a user is just editing some comments or adding/removing whitespace, then we don't need to recompile. If there are no "code" changes, then we stop here. In the dropdown box, CODE can be selected to see what this cleaned up code looks like.
4	We save the CUDA textbox to a file. This will be needed later when the CUDA compiler compiles it.
5	We then clear any lines on the screen as we are going to draw new ones soon.
6	We then call a batch file that does most of the compiling. This batch file is generated based on the options selected in CudaPAD.

Batch File Process:

Perform cleanup in the temp folder from the last compile

Calls nVidia's CUDA compiler with options:

nvcc.exe -keep -cubin --generate-line-info ...

This command compiles the CUDA file into a cubin file (device code). We also use the -keep option and keep the PTX files as well as the --generate-line-info so we know the line numbers of the source file so we can draw the lines.

If SASS is selected from the dropdown, then we run CuObjDump.exe to disassemble the cubin device file into SASS code.

Lastly, we capture any output messages from these commands to info.txt.

Step	Description
7	Next, we fill the info textbox that has the registers and memory utilization information.
8	Next, any errors/warnings are captured from the rtcof.dat file and are then formatted and placed in the error window.
9	We then grab the text from the outputted data.ptx (from nvcc.exe) and compare it to the PTX already in the window using a diff algorithm.
10	Next, we store the position of the scrollbars and caret location for the PTX/SASS window.
11	Next, we grab the line information from the PTX and store that. The line numbers will be needed later to draw the connecting lines.
12	Do some cleanup on the PTX to make it look nice.
13	Next, we draw the visual code lines using textbox dimensions, scroll positions, indentation and line numbers.
14	Finally, we restore the scroll positions and caret location.

Advantages of Viewing PTX/SASS

Here are some advantages of viewing PTX:

Curiosity

Software Debugging

Performance Optimization

Changing up a line or two often produces different results. When there exists a kernel that might need some performance optimization, toying with different ways of doing the same thing can produce more efficient code. One example that comes to mind was I found that using a union the PTX would always result in local memory. This was a while ago so it might not be true anymore but here is the example:

local .align 4 .b8 someLocMem[4];
....
st.local.s32      [someLocMem], someIntReg;  <--very expensive
ld.local.f32      someFloatReg, [someLocMem];  <--very expensive

However, when using something like:

"int strangeInt = *(int*) &somefloat;"
the output looks like this:
mov.b32 someFloatReg, someIntReg;

This is easily spotted in CudaPAD because of the quick feedback and visual markers.

Dead Code Detection

Just as a word of caution, try not to go optimization crazy. Optimization does have its place for particular functions that get run often however optimization can make code less readable, awkward, and more difficult to maintain. Also, time should only be spent on code where a performance increase would have a large impact. There is much more on this subject that I will not get into.

Similar Programs

Godbolt Compiler Explorer

In 2019, I discovered a project very similar to this one. Matt Godbolt created software called GCC compiler explorer. Some people call this process "Godbolting" code now. I use Godbolting because that is the term everyone knows these days. Matt's program created around 2012 was web-based though but shared many features:

Both have...

Enter your code on the left window, and the ASM shows up on the right.

Automatically starts re-compiling each time a change is made in the left window.

Indicates a status that it is working in the top right.

Menus at the top to select different compiler options to compare.

Does a "diff" on the output to see what changed.

Both worked on C++ though one was CUDA C++ and the other GCC C++.

Has visuals that help the user match up the source on the left to the assembly on the right.

Used to optimize code, view what happens, and see what happens with different compiler options.

Uses colorful text highlighting.

Cleans up the assembly output code (removes assembly junk to clean it up).

When clicking a warning/error in the compiler output window it shows the user directly where the line is.

Allow users to highlight all matching registers to see all the matches quickly. (Note: added Jan. 2016 by me)

A tool in the ATI/AMD Brook++ GPGPU toolkit in 2008

There is not much online anymore but I did find a screenshot. I can't find much about this anymore but it was a C++ modified language for GPU development and a set of tools created by a university as I recall. One of the tools in it was a small app where a user can enter in code on the left side and in the middle there was a button that can be pressed that updated the assembly on the right side. This is where I got my idea from. It didn't however have highlighting, compiler options (as I recall), automated updates, visual lines to match up the code on the left to the assembly on the right, diff, cleanup (as I recall), etc.

Videos
updated in 2016

Below is a quick tutorial video. The sub-menu options did not show properly in the video but I explain what I am clicking on so hopefully you can still follow along.

CudaPAD won a poster spot at the 2016 GPU Technology Conference. Even better than that it was also selected as one of the top 20! At the conference, I presented it to an audience of around 100-150 people on April 4th, 2016.

Points of Interest

I had a little fun creating this. This is probably why so much time was put into this.

Getting the code lines to work was exciting for me. I believe the visual code lines might have been one of the first of their kind when I built this in 2009 but I am not sure. This was a wild idea I had and I was not sure if I could get it working. Drawing moving lines on the screen is not that easy as I found out as there always seemed to be some side effects. Drawing the spline was the easy part but all the miscellaneous stuff like cleaning it up was more difficult. Another difficult part was calculating the location in the text box. The textbox line height and line number must be known for each spline drawn. I'm not a graphics developer so I am just happy to get it to work! The visual lines turned out better than expected and are fun to play with.

Wish List

At the time, I dreamed up many different "line" ideas to help break down the assembly but none of the others have been implemented yet:

Note: These other features have NOT been added to CudaPAD (at least not at this time)

Draw curved lines that show jumps

Register Impact Tracing

Register Usage Timeline

Here are some wish list items I have that may or may not be added in the future:

Isolate the implementation code from interface code using the bridge pattern. While the GUI and code are somewhat split into different files right now, they are not really separable. It's often good practice to split this up.

Add the ability to execute the code for timing purposes. Right now PTX can be visually looked at but not benchmarked.

Add a per-line register usage counter. Basically what this would require is to keep track of how many variables are being used on each PTX line. A GPU has a fixed number of registers and knowing where the register pressure is highest can help programmers balance their code. This is something I added into my AMD GPU compiler, ASM4GCN, but have not added it here.

Add jump lines to the PTX so one can easily see where jump statements land.

A Special Thanks to...

Diff functionality - This is a nice drop-in C# file that provides quality diff functionality. Originally created by Eugene Myers in 1986; Converted into C# by Matthias Hertenstein. The mostly un-edited source is in the file Diff.cs.

ScintillaNET - This nice tool provides the text highlighting for this project. It is a Windows Forms control, wrapper, and bindings for the versatile Scintilla source code editing component. It really adds a lot of life to this project.

nVidia - In 2016, CudaPAD won a spot on a CudaPAD poster at the 2016 GPU Technology Conference. Moreover, it was selected as an honorable mention (top 20). I presented it to an audience of around 100-150 people on a super large projector screen. It was a wonderful experience - one of the best I ever had.

History

Date	Update
Dec 2008	Initially built and it has remained mostly unchanged since 2020.
Aug 2009	Built Cudapad.com website as a group project while at CSUEB - it remained up one year until it expired.
Jan 2013	Changed the code textbox to use ScintillaNET for better syntax highlighting
Nov 2014	Updated for nVidia CUDA 6.0/6.5
June 2015	Code released to the public; changed to MIT License; updated for CUDA 6.5/7.0
Jan 2016	Added a single-click multiple highlighting search feature; Updated for CUDA 7.0/7.5.
Jan 2017	Verified okay with CUDA 8.0
Jun 2019	Updated for CUDA 10 and Visual Studio 2017/2019

Contents

CudaPAD - CUDA Assembly Viewer

What is CudaPAD?

Introduction

Background

Using CudaPAD

Requirements

PTX/SASS View Modes

Enabling/Disabling Features

Features

Visual Code Lines

Auto Assembly Refresh

Built-in Diff Utility

Single-Click Multiple Highlighting (new in 2016)

Syntax Highlighting and Output Formatting

Online Error/Warning Search

Under the Hood

Advantages of Viewing PTX/SASS

Similar Programs

Godbolt Compiler Explorer

A tool in the ATI/AMD Brook++ GPGPU toolkit in 2008

Videos
updated in 2016

Points of Interest

Wish List

A Special Thanks to...

History

Contents

CudaPAD - CUDA Assembly Viewer

What is CudaPAD?

Introduction

Background

Using CudaPAD

Requirements

PTX/SASS View Modes

Enabling/Disabling Features

Features

Visual Code Lines

Auto Assembly Refresh

Built-in Diff Utility

Single-Click Multiple Highlighting (new in 2016)

Syntax Highlighting and Output Formatting

Online Error/Warning Search

Under the Hood

Advantages of Viewing PTX/SASS

Similar Programs

Godbolt Compiler Explorer

A tool in the ATI/AMD Brook++ GPGPU toolkit in 2008

Videos updated in 2016

Points of Interest

Wish List

A Special Thanks to...

History

Videos
updated in 2016