The Dynamic Cluster Compiler (DCC)

Published: 2/16/2023

Overview

The Dynamic Cluster Compiler (DCC) allows you to compile your program across a cluster of servers, or nodes, rather than being forced to compile it on your local machine or a singular remote host. This, potentially, gives you access to more compute resources than you otherwise would have on a single machine. As a result, this would allow you to achieve faster compile and build times. Assuming, of course, that your code base is large enough to take advantage of a distributed compiler and that the initial overhead can be made up by the even more parallelized compilation via the processing power of multiple servers. This compiler works with any programming language that can be compiled with a makefile.

Requirements

Nodes

Each node that will be doing the source code compilation MUST have the following:

SSHFS installed
Swapped its public ssh key with the server were the source code lives
Accessable via Ansible
The compiler required to compile the given project
- Ex. gcc, g++, javac, clang, cargo, gccgo, etc.

Source Code

Each project that will compile via DCC MUST contain the following files:

distributed.cfg
- This file is required to be in the root of the directory.
inventory.ini
mnt.cfg
distributed_makefile

The source code must also be centrally located and accessible via SSH. There is no restriction stating that this has to be on a dedicated machine. For example, one of the nodes in the cluster could house it, the driver machine running DCC could house it, or it could be located on a centrally located NAS (Network Attached Storage). As long as the machine it is housed on can be accessed via SSH, you are good to go.

NOTE:Examples for each of these configs can be found in the examples directory of the GitHub Repo

Driver Machine

The driver machine could be one of the nodes in the cluster, the storage server, or any other machine. The only requirement is that it has the DCC program and Ansible installed on it. In cases where Ansible's connection is via SSH (i.e., every one except possibly Windows), the driver machine must have swapped its public SSH key with each node in the cluster for automatic login.

How it works

Preping the source code

The Dynamic Cluster Compiler starts by parsing the distributed.cfg file. This file must be in the root of the given directory. This can be done by either running the compiler in your current directory and having a distributed.cfg file present, or by passing in a command-line argument for a directory that contains the distributed.cfg file.

The compiler then creates folders for each node in the cluster. It pulls this information from the inventory.ini file. By default, the compiler will look for that file in the root of the directory, just like the distributed.cfg file. However, where the compiler looks for the inventory.ini file can be configured in the distributed.cfg file. The compiler parses the inventory.ini file and pulls out the worker nodes as well as the master node. It then checks to make sure each host is online via a ping check. If a host fails the ping check, it is assumed to be down, and a folder will not be created for it. If the master node fails the ping check, the compiler will exit.

Once the folders are created, it then sifts through all of your source code and evenly distributes it amongst the various folders it created for each node. Which files are source code is denoted by the fileTypes variable in the distributed.cfg file.

Compiling across a cluster

After the source code is distributed, it then kicks off an Ansible playbook to orchestrate the compiling process across all the nodes. Using information found in the mnt.cfg file, it mounts a network share on each of the nodes. Like the other files mentioned, by default the compiler looks for it in the root of the directory, but where the compiler looks can be configured in the distributed.cfg file. Once the share is mounted, each node will compile its assigned source code before the master node links everything together to create the executable.

Custom makefiles

Your standard makefile will not work with this compiler; you will have to modify your makefile slightly in order for your code to compile correctly. This is because the purpose of makefiles is to compile all of your code to make your executable. The problem is that one node can't compile all the code because 1. that defeats the purpose and 2. it doesn't have access to all the code in its folder, as we discussed earlier. This means you have to modify your makefile to have a rule that only compiles the source code in its given directory, even if that means not compiling all of the source files. In the GitHub repository for DCC, there are examples of projects that compile with DCC. In those examples, you'll be able to see how a standard makefile compares to a distributed one.

The most notable changes are having two variables to track output files required to build your application and having defined rules for compiling only the source code in the current directory and the executable. In regards to the variables used to track output files, the first must include all of the files required to build your application, while the second should only contain the files that will be generated by the current directory. This is because the source code is distributed, and you only want nodes compiling the code assigned to them. I.e., only compile the code in their "select" folder. However, you also need to make sure that when the master node takes all the output files to make the application, all of the required files are accounted for.

By default, the compiler will look for your distributed_makefile in the root of the directory, but where the compiler looks can be configured in the distributed.cfg file. This makefile will be copied to each folder for each node that will do work.

To better illustrate this, let's take a look at a side-by-side comparison of a standard makefile vs. a distributed one. In this case, we'll use a modified version of my Universal Countdown server as an example.

              
# Standard makefile

TARGET = timeRemainingServer
CC = gcc
CFLAGS = -Os
LIBS = -lpthread

GENERAL_UTILS_HEADERDIR = headers/general_utils
SERVER_UTILS_HEADERDIR = headers/server_utils

SRCDIR = src
GENERAL_UTILS_SRCDIR = $(SRCDIR)/general_utils
SERVER_UTILS_SRCDIR = $(SRCDIR)/server_utils

INCLUDES = -I $(GENERAL_UTILS_HEADERDIR) -I $(SERVER_UTILS_HEADERDIR)

OBJDIR = obj
BINDIR = bin

.PHONY: default all clean debug

default: $(BINDIR)/$(TARGET)
all: default

debug: CFLAGS = -g -Wall
debug: $(BINDIR)/$(TARGET)

SRCS = $(wildcard $(SRCDIR)/*.c $(GENERAL_UTILS_SRCDIR)/*.c $(SERVER_UTILS_SRCDIR)/*.c)
OBJECTS = $(patsubst %.c, $(OBJDIR)/%.o, $(SRCS))
HEADERS = $(wildcard $(GENERAL_UTILS_HEADERDIR)/*.h, $(SERVER_UTILS_HEADERDIR)/*.h)

$(OBJDIR)/%.o: %.c $(HEADERS)
  @mkdir -p $(@D)
  $(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@

$(BINDIR)/$(TARGET): $(OBJECTS)
  @mkdir -p $(@D)
  $(CC) $(OBJECTS) $(CFLAGS) $(LIBS) -o $@
  @cp configs/server.cfg $(BINDIR)

clean:
  -rm -rf $(OBJDIR) $(BINDIR)

              
# Distributed makefile

TARGET = timeRemainingServer
CC = gcc
CFLAGS = -Os
LIBS = -lpthread

ROOT = ..
GENERAL_UTILS_HEADERDIR = $(ROOT)/headers/general_utils
SERVER_UTILS_HEADERDIR = $(ROOT)/headers/server_utils

SRCDIR = src
GENERAL_UTILS_SRCDIR = $(SRCDIR)/general_utils
SERVER_UTILS_SRCDIR = $(SRCDIR)/server_utils

OBJDIR = $(ROOT)/obj
BINDIR = $(ROOT)/bin

.PHONY: default all clean executable build

# Rule to build the executable
executable: $(BINDIR)/$(TARGET)
default: executable
all: default

# Having the includes directory, or directories
# is pivotal in ensuring your code compiles correctly. 
# Since we will be moving source files all over the place, 
# if you do not specify where the includes are on your compilation 
# and builds will fail
INCLUDES = -I $(GENERAL_UTILS_HEADERDIR) -I $(SERVER_UTILS_HEADERDIR)

ROOTSRCS = $(wildcard $(ROOT)/$(SRCDIR)/*.c $(ROOT)/$(GENERAL_UTILS_SRCDIR)/*.c \
    $(ROOT)/$(SERVER_UTILS_SRCDIR)/*.c)

SRCS = $(wildcard $(SRCDIR)/*.c  $(GENERAL_UTILS_SRCDIR)/*.c $(SERVER_UTILS_SRCDIR)/*.c)
# Get only the object files that will be generated
# by compiling the current directory
DIRFILES = $(patsubst %.c, $(OBJDIR)/%.o, $(SRCS))
# Get all the object files required to build the
# executable
OBJECTS = $(patsubst $(ROOT)/%.c, $(OBJDIR)/%.o, $(ROOTSRCS))
HEADERS = $(wildcard $(GENERAL_UTILS_HEADERDIR)/*.h, $(SERVER_UTILS_HEADERDIR)/*.h)

$(OBJDIR)/%.o: %.c $(HEADERS)
  @mkdir -p $(@D)
  $(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@

.PRECIOUS: $(BINDIR)/$(TARGET) $(OBJECTS) $(DIRFILES)

$(BINDIR)/$(TARGET): $(OBJECTS)
  @mkdir -p $(@D)
  $(CC) $(OBJECTS) -g -Wall $(LIBS) -o $@
  @cp $(ROOT)/configs/server.cfg $(BINDIR)

# Rule to only compile the source files in the worker
# nodes directory
build: $(DIRFILES)

clean:
  -rm -rf $(OBJDIR) $(BINDIR)

Figure 1. Comparison of a standard makefile vs. a distributed one

Here, you can see they are both very similar. The first key difference is that the distributed_makefile has $(ROOT) prefixes. This is done so that 1. all of the object files required to create the final executable can be identified, and 2. the compiler can include any header files from the larger project that it requires.

Another distinction is that instead of a SRCS variable, we have ROOTSRCS and SRCS variables, as well as DIRFILES and OBJECTS variables. As the names imply, the ROOTSRCS contains the source files for all of the source files for the entire project. While SRCS only contains source files in the current nodes directory. Similarly, OBJECTS contains all of the object files required to build the executable, while DIRFILES only contains the object files that will be produced by the source files in the current nodes directory.

Lastly, you'll notice we have a new rule in our distributed_makefile called build. This is what we will call on each of our nodes to only compile the source code files in its directory.

Visualizing DCC

That was a lot of information to take in, so let's walk through an example with some visuals about how DCC works in a real-world scenario. Take the following image below:

Figure 2. Example scenario for using DCC

In this example, we assume that we have some project we want to compile with DCC on a centralized storage server, and the root of the project is located at Code/Project. We also see that we have a cluster of servers we want to use to compile our project using DCC.

Figure 3. Example scenario for using DCC during and after compiling

To begin, we assume the developer has mounted the source code directory on their local machine at /mnt/Code/Project. To start the compile process, the developer will run DCC, passing in the directory of the source code as an argument: ./dcc.py /mnt/Code/Project. DCC will then find the distributed.cfg file in the root of the /mnt/Code/Project directory and use that to begin the compiling process. We can see that DCC created multiple new folders with names that equate to the IP addresses of the servers in our cluster. We also see that it added source code as well as a makefile to each of those folders.

Once it has distributed all the source code amongst the folders, it then kicks off the Ansible playbook to tell the cluster to compile the code. The Ansible playbook first mounts the Code/Project directory to each of the servers in our cluster using SSHFS. In order for this to work, we assume each server has SSHFS installed and that its Ansible user has copied its public ssh key to the storage server so it can automatically login and mount the folder, per the requirements. With the Code/Project directory mounted, each server then navigates to the directory corresponding with its IP address and runs the makefile to compile the source code assigned to it. It writes all of its resulting object files to the project directory's root so that all of the object files needed to build the executable are in one place.

After each node has completed compiling their assigned source code, the designated master node will then take all of the object files generated and build/link them together to form the final executable.

Lastly, each node will run a "clean up" phase to remove their corresponding folder from the project directory and unmount it since it is no longer needed.

Final thoughts

The question now has to be asked: do I recommend you use DCC to compile your programs? In short, no, at least not in the vast majority of cases, for a few reasons. The first, and most obvious, is overhead, especially on Windows clusters. In my experience, using Ansible with Windows is very slow, meaning the overhead accrued using a cluster of Windows nodes will be significantly higher than that of macOS or GNU/Linux clusters, for example. As mentioned earlier, regardless of the cluster you have, there will be significant overhead because Ansible is being used to orchestrate the process. Meaning, unless your project is absolutely massive, using DCC will just slow down your compile times, not speed them up.

The next problem is that you have to have the infrastructure at your disposal in order to take advantage of DCC. If you don't have a cluster of computers just lying around, or if your cluster isn't as powerful as a single machine you have, you obviously won't be able to take advantage of DCC. Like, yeah, sure, you could have a cluster of five Raspberry Pis or some other low-end machines you play around with Kubernetes on, but if your main dev machine outperforms your entire Pi cluster, obviously it's not worth using.

Now, all of that being said, if you have a massive code base and a cluster of high-performance servers/computers at your disposal, then yeah, it's worth using since you could see some serious improvements in compile times by taking advantage of all that extra horsepower. Alternatively, if you just want to have the cool factor and bragging rights of knowing you used a cluster of servers to compile your basic Hello World program, then by all means knock yourself out. I can attest that that is an incredible feeling.

If you are interested in checking out the compiler and/or some examples using it, you can view them on my GitHub.