Experimenting with the Synth Tool

Introduction
Overview
Background
Synth Tweaks
Verilog Full Adder
Bluespec Full Adder
Verilog Wrapped with Bluespec
Next Time

Introduction

In a series of upcoming posts, I will be presenting worked Bluespec and Verilog examples of different adders for eventual use in my RISC-V processor project. I’ll be using these adders to replace both existing adders and as components in future functional units like my integer multiplier or floating point unit.

Before all that, I need to perform some tests and set up some infrastructure. It’s no good to blindly implement components, so I spend this post experimenting with synth to identify quirks and see how it interacts with Bluespec and Verilog when we involve wrappers, which are required to import Verilog into Bluespec.

I also tweak synth to accept Verilog directly, which will be helpful to evaluate Verilog components in the same way I evaluate Bluespec components. Some upcoming posts will see whether we actually get any performance gains from implementing modules in Verilog rather than Bluespec.

This post also serves as a visual walkthrough of using the Minispec synth tool. There’s sparse documentation anywhere on its use, so I figured I may as well write some here.

Overview

I begin by discussing some tweaks I made to my fork of Daniel Sanchez’s synth tool for Minispec. These tweaks enable the rest of this post.

Then, I demonstrate the use of synth on a Verilog implementation of a full adder. I also show an example using boolean and bitwise operators where quirks in our downstream synthesis tools can create suboptimal circuits, so we should take synthesis results with a grain of salt. Because a full adder creates only a simple circuit, I also include gate-level logic circuit visualizations created using synth with several cell libraries.

I also demonstrate the use of synth on Bluespec implementations of full adders, including showing the resulting Verilog files from compilation and some strange properties that emerge when we nest Bluespec wrappers, including losing and gaining efficiency in the resulting circuits.

Afterward, I demonstrate Bluespec’s ability to directly use Verilog implementations in Bluespec designs, which will be helpful if we find Verilog implementations to be more efficient than our Bluespec ones. However, I found no performance difference with simple circuits like full adders, so that would require more testing with more complex circuits to see whether implementing in Verilog is worth the trouble. We’ll explore these things and more next time.

Background

For the past couple weeks, I’ve slowed down on technical blogging because I’ve been practicing my Verilog with the wonderful exercises on HDLBits. I’m starting to exhaust their Verilog material, so it’s about time to apply what I’ve learned. With all this practice, I’m now able to do two things:

I can now inspect and understand the .v that result from compiling my .bsv files. Simple Bluespec modules can give us legible Verilog. With complex modules, it takes more effort but can be done, especially when side-by-side with the Bluespec source code.
I can now write .v files directly and import them as IP blocks into my Bluespec designs through the import "BVI" feature. This works best for simple modules that are done more efficiently in Verilog.

I like Bluespec for its high-level constructs and abstractions. One common criticism of the language is that the Verilog outputted by the Bluespec compiler might not be performant enough to supplant writing Verilog by hand. The trade-off is acceptable for complex top-level modules that can’t be prototyped quickly in Verilog, but in small, reusable components like adders and FIFOs, it can make sense to go lower in abstraction.

(In this post, I found no evidence with the simple full adder example that Bluespec produces any less performant circuitry than Verilog. It’s too soon to draw conclusions on this front, since we’d need more complex modules.)

This is especially the case when the optimizing compiler isn’t mature enough. There was probably a point in history when C compilers didn’t produce performant enough assembly for developers to program exclusively in C. Bluespec may very well be at that point right now with producing performant Verilog. In an ideal world, the Bluespec compiler should be able to automatically make the same optimizations a human designer would.

To understand how our Bluespec turns into Verilog, we can refer to the BSC User Guide. People interested in greater detail should check out the chapter “Verilog back end” and especially the subsection “Bluespec to Verilog mapping”, which describes how .bsv files are transformed into Verilog .v files.

You can also read the chapter “Embedding RTL in a BSV design” in the BSV Reference Guide where they discuss importing Verilog modules into Bluespec for use in the Verilog backend. As per the User Guide, the Bluesim backend is currently incapable of using Verilog directly. When we import, we’d need to use Verilog simulators or write Bluespec implementations for simulation in Bluesim. This makes it a little less convenient to import Verilog when we use Bluesim for simulation, like I currently do.

Synth Tweaks

The synth synthesis tool we use from Daniel Sanchez’s Minispec compiles our Bluespec .bsv files into Verilog .v files, then does a bunch of processing with yosys and ABC to determine our area and critical-path delay.

It’s a nicely designed tool, but I need to make a series of tweaks to make it work better for my purposes. The main change is that I’d like to be able to synthesize Verilog files directly, but I also make a bug fix and a cosmetic change. You can see my modified version on my fork on GitHub. I don’t know how widely applicable my changes are, so I don’t plan on making a pull request.

Accepting Verilog Inputs

The synth tool was built to consume Minispec and Bluespec, but internally it compiles both into Verilog .v files for synthesis with downstream tools with yosys and ABC. There might be established tools for generating area and delay numbers for Verilog designs, but I both like Minispec’s synth and I have trouble finding off-the-shelf synthesis tools. (I suspect many of them are proprietary.)

I modified synth to be able to accept Verilog modules directly for synthesis. It’s just a matter of being able to skip the Minispec/Bluespec compilation step of the synth tool and using the Verilog .v files directly.

It’s also a matter of moving the .v files in the current directory into the synthDir so that they can be consumed as needed by other modules (specified by the .use files). This is especially important for Bluespec import "BVI" statements because the .v files from the compilation will assume that the imported .v files will be available for synthesis.

When we eventually do Verilog simulation, we’ll also need to ensure that our .v files are moved to build for simulation.

Alternatives

When I was thinking about how to measure the performance of both Bluespec and Verilog modules, I briefly considered using the wrapper-only route. I wouldn’t need to modify the synth tool as long as all my Verilog modules were presented as Bluespec modules.

I decided that it would be a little too roundabout to need to wrap all my Verilog modules in Bluespec just to synthesize them. I may want to synthesize separately even before importing these modules into a Bluespec design. It’s not much trouble, but it requires writing a bit of boilerplate.

It wasn’t so hard to modify the synth program. It’s written in Python, so I just needed to read through it and figure out what to change.

Buffer Configurations

I had already tweaked my installation of synth during my processor project. During the step where the program synthesizes with three buffer configurations, one of them would suddenly require much, much, much more computation than the other two. It’s no problem for small designs, but it would take so much computation for synthesizing my L1 caches that the synthesis would crash.

To locate the issue, I looked at the different output logs from synth to see where the tool was stalling. I found that synth would generate several configurations and select the best one. synth would crash because one of these sets would stall.

Debug output showing two normal critical paths and one very large critical path. — There are 6 different outputs because the tool tries 3 buffer configurations and both -O0 (`ox`) and -O1 (`ob`) optimization parameters.

I “fixed” the issue by making synth skip the configuration prone to stalling. I don’t know whether it’s a true fix because it might result in worse generated circuits for some designs. I checked it makes no difference for my full adder implementations.

SVG Tweaks

I also adjusted the color scheme of the svg generator to output dark mode circuit visualizations, just because that’s what I use for everything, including this blog.

If I was submitting a pull request, I would want to make it configurable from the command line. But because I only ever need dark mode, I just changed the color values in the svg file in synth.

Verilog Full Adder

In this section, I wanted to test out my changes to synth by synthesizing a simple full adder module written in Verilog.

I also run a little experiment with using different operators. Below, I choose to use boolean operators (e.g., &&, ||) even though I could use bitwise operators (e.g., &, |). I explain more soon.

The synth tools customarily requires us to have every module accept a CLK, which can remain unused.

module FullAdder(
    input CLK, a, b, c_in,
    output sum, c_out
    );
    always @(*) begin  // generally I would prefer always_comb in SystemVerilog
        sum = a ^ b ^ c_in;
        c_out = (a&&b) || (a&&c_in) || (b&&c_in);
    end
endmodule

With my above tweak, I can run synth FullAdder.v FullAdder to generate synthesis logs.

Basic Cell Library

Synthesizing FullAdder from file FullAdder.v as a Verilog module.
Synthesizing circuit with std cell library = basic, O1, target delay = 1 ps

Gates: 14
Area: 10.11 um^2
Critical-path delay: 51.75 ps (not including setup time of endpoint flip-flop)

Critical path: b -> sum
               Gate/port   Fanout        Gate delay (ps)  Cumulative delay (ps) 
               ---------   ------        ---------------  --------------------- 
                       b        3                    7.6                    7.6 
                   NAND2        3                   14.3                   21.9 
                     INV        1                    8.4                   30.3 
                    NOR2        1                    6.1                   36.4 
                   NAND2        1                    8.6                   45.0 
                   NAND2        1                    6.7                   51.7 
                     sum        0                    0.0                   51.7 

Area breakdown:
               Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
               ---------    -----       ----------------       ----------------
                     INV        4                  0.532                  2.128
                   NAND2        8                  0.798                  6.384
                    NOR2        2                  0.798                  1.596
                   Total       14                                        10.108

The synth tool includes an svg diagram visualizer for circuits made with the standard (basic) cell library. We get that by using the -v flag, e.g., synth FullAdder.v FullAdder -v.

Let’s see what this looks like.

SVG diagram of the Verilog full adder with std cell library

Notice the synthesis mostly uses INV, NAND2 and a couple NOR2 gates, whereas a textbook full adder might only use NOR2, AND2, and an OR2. Modern physical design (or at least the kind that they teach in schools) preferentially uses NAND gates because they result in an overall cheaper circuit.

Boolean Quirks

By accident, I noticed there’s a quirk that happens when I use bitwise versus boolean operators. I think it must be an issue with the downstream optimization because semantically, it shouldn’t matter whether we’re using boolean operators or bitwise operators when each operand is a single bit. Indeed, we’ll see later that the downstream gate placement can vary unpredictably.

We get a different circuit when we use c_out = (a&b) | (a&c_in) | (b&c_in);, even if semantically we should get the same thing.

SVG of a Verilog full adder circuit with a couple extra gates.

It’s technically up to the engineer whether this circuit is better or worse. It results in 16 rather than 14 gates, but we shave off half a ps of delay. I would probably go with the original 14-gate circuit since it’s only 0.7% faster (51.4 ps vs 51.7 ps) but 15% larger (11.704 um^2 vs 10.108 um^2).

Critical-path delay: 51.39 ps (not including setup time of endpoint flip-flop)
  Gate/port   Fanout        Gate delay (ps)  Cumulative delay (ps) 
  ---------   ------        ---------------  --------------------- 
          a        4                    9.8                    9.8 
      NAND2        2                   12.2                   22.0 
        INV        1                    7.7                   29.7 
       NOR2        1                    6.3                   36.0 
      NAND2        1                    8.6                   44.6 
      NAND2        1                    6.8                   51.4 
        sum        0                    0.0                   51.4 

  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
        INV        4                  0.532                  2.128
      NAND2       10                  0.798                  7.980
       NOR2        2                  0.798                  1.596
      Total       16                                        11.704

In some cases, we can use the --retime flag with synth to re-generate a more efficient and logically equivalent circuit. For whatever reason, it didn’t work with this one.

Extended Cell Library

We can also get different results with different cell libraries. I generally stick with basic, but there’s no reason why we can’t use the other ones. They just give us different gates. The main difference with this library for the full adder is that we gain access to NAND3 gates, which we use for c_out.

I synthesize using the -l option with a cell library name, e.g., synth FullAdder.v FullAdder -l extended -v. I trimmed the following log for conciseness.

[Extended]
Critical-path delay: 49.98 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
        INV        3                  0.532                  1.596
      NAND2        8                  0.798                  6.384
      NAND3        2                  1.064                  2.128
      Total       13                                        10.108

SVG diagram of the Verilog full adder with std cell library

Multisize Cell Library

Here, we use a few different gates other than NAND2, but we still stick mostly with NAND2.

[Multisize]
Critical-path delay: 48.84 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
     INV_X1        1                  0.532                  0.532
   NAND2_X1        5                  0.798                  3.990
   NAND3_X1        1                  1.064                  1.064
     OR2_X2        1                  1.330                  1.330
   XNOR2_X1        1                  1.596                  1.596
      Total        9                                         8.512

SVG diagram of the Verilog full adder with std cell library

Full Cell Library

We can synthesize with a more diverse full cell library, but synth doesn’t currently support generating circuit diagrams for it. It’s probably just a matter of adding in the svg components for all the different gates.

[Full]
Critical-path delay: 47.63 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
    AND2_X1        1                  1.064                  1.064
     INV_X1        1                  0.532                  0.532
   NAND2_X1        2                  0.798                  1.596
   NAND3_X1        1                  1.064                  1.064
    NOR2_X1        1                  0.798                  0.798
   OAI21_X1        2                  1.064                  2.128
     OR2_X2        1                  1.330                  1.330
      Total        9                                         8.512

Bluespec Full Adder

In this section, I wanted to synthesize a simple Bluespec full adder and inspect the resulting Verilog files and synthesis outputs. I also wanted to test whether the choice in boolean or bitwise operators made a difference in the resulting circuit like it did for the Verilog full adder.

Implementing in Bluespec gives us some more design choices. Bluespec’s richer type system distinguishes between booleans Bool and bits Bit#(1). Typically, I would prefer the bitwise implementation because semantically, the bits of a full adder generally represent parts of larger bit vector operands and sums.

But like in the above Verilog case, there may be performance implications in our downstream tools for using boolean versus bitwise operators. Until such a time that the performance quirk gets optimized out, I need to weigh the trade-offs between a more performant circuit with the boolean implementation, versus semantic accuracy with the bitwise implementation.

It may even turn out that it’s easier to work with the bitwise implementation, or that the quirk only appears when we’re synthesizing the full adder directly and not as a component. Because it’s only two gates, I’m leaning toward using the bitwise implementation for future components. In this section, we test both.

Switching between boolean and bitwise in Bluespec is a little trickier than in Verilog because I need to not only change the operators, but also the types. If you want the bitwise implementation, just replace Bool with Bit#(1) and the operators !=, &&, and || with ^, &, and |.

typedef struct {
    Bool sum;
    Bool c_out;
} FullAdderResult deriving (Bits, Eq);

interface FullAdder;
    method FullAdderResult exec(Bool a, Bool b, Bool c_in);
endinterface

(* synthesize, always_enabled, no_default_reset *)
module mkFullAdder(FullAdder);
    method FullAdderResult exec(Bool a, Bool b, Bool c_in);
        return FullAdderResult {
            sum : a != b != c_in,  // no logical xor
            c_out : (a&&b) || (a&&c_in) || (b&&c_in)
        };
    endmethod
endmodule

For such a simple design, the Bluespec generates identical circuits as the corresponding (bitwise or boolean) implementations in Verilog, so I don’t bother reproducing the synthesis logs.

There are some minor differences in the visualizations:

The ordering of the operands (doesn’t matter in a full adder),
The {sum, c_out} are bused into a 2-bit output, and
If we don’t include no_default_reset and always_enabled attributes, there would be an unused RST_N and RDY_exec driver on the visualization.
- In the following visualizations, I omitted the attributes, so they don’t correspond exactly with the above excerpt. So, imagine there’s only the synthesize attribute.

Notice that the operands are prefixed with exec. That’s because this whole circuit corresponds to the exec method of the module. We’d have a different looking circuit if we had other methods or rules to synthesize.

Bitwise Implementation

SVG diagram of bitwise Bluespec full adder

Boolean Implementation

SVG diagram of boolean Bluespec full adder

You may also notice the unused RDY_exec. We can remove it by adding the always_enabled attribute next to the synthesize attribute, and it’ll be gone. It wouldn’t change the resulting circuit’s delay or area, since the unused RDY_exec signal gets optimized out anyway.

We could further remove the unused CLK and RST_N ports with the attributes no_default_clock and no_default_reset. We won’t remove the clock since the synth tool requires a clock port to synthesize a module. But there’s no reason why we can’t remove the RST_N.

I add the no_default_reset and always_enabled attributes into the Bluespec excerpt above, but I’ve kept the drivers in the visualizations so you can see what I’m talking about.

Resulting Verilog Files

For the above visualizations, I didn’t add any attributes other than synthesize. To generate the following Verilog, I added the always_enabled, no_default_reset attributes (just like the Bluespec excerpt above).

These Verilog files are generated by the Bluespec compiler for use in downstream tools like synth, other Verilog synthesis tools, or Verilog simulators.

Note that I present these files in the reverse order as the visualizations above.

Boolean Implementation

The compiled Verilog for such a simple circuit as the boolean implementation of the full adder is very legible, though it uses Verilog 1995 style declaration. The calculation of the carry also uses a boolean simplification.

module mkFullAdder(CLK,

		   exec_a,
		   exec_b,
		   exec_c_in,
		   exec);
  input  CLK;

  // value method exec
  input  exec_a;
  input  exec_b;
  input  exec_c_in;
  output [1 : 0] exec;

  // signals for module outputs
  wire [1 : 0] exec;

  // value method exec
  assign exec =
	     { (exec_a != exec_b) != exec_c_in,
	       exec_a && (exec_b || exec_c_in) || exec_b && exec_c_in } ;
endmodule  // mkFullAdder (boolean implementation)

Bitwise Implementation

Unfortunately, the bitwise implementation doesn’t result in as legible a Verilog file. The compiler makes liberal use of internal signals and wire instantiations.

There’s no boolean simplification like above. I would’ve originally guessed the lack of simplification is why the design costs more gates, but we saw earlier that this happens even when we write directly in Verilog, and we’ll see later that we sometimes regain efficiency with some strange wrapping.

module mkFullAdder(CLK,

		   exec_a,
		   exec_b,
		   exec_c_in,
		   exec);
  input  CLK;

  // value method exec
  input  exec_a;
  input  exec_b;
  input  exec_c_in;
  output [1 : 0] exec;

  // signals for module outputs
  wire [1 : 0] exec;

  // remaining internal signals
  wire x__h20, x__h37, x__h40, x__h52, x__h54, y__h53, y__h55;

  // value method exec
  assign exec = { x__h20, x__h40 } ;

  // remaining internal signals
  assign x__h20 = x__h37 ^ exec_c_in ;
  assign x__h37 = exec_a ^ exec_b ;
  assign x__h40 = x__h52 | y__h53 ;
  assign x__h52 = x__h54 | y__h55 ;
  assign x__h54 = exec_a & exec_b ;
  assign y__h53 = exec_b & exec_c_in ;
  assign y__h55 = exec_a & exec_c_in ;
endmodule  // mkFullAdder (bitwise implementation)

Wrappers around Bluespec

In Bluespec, we can wrap a module’s implementation in another module. It looks like this:

(* synthesize *)
module mkFullAdderWrapper(FullAdder);
    FullAdder _adder <- mkFullAdder;
    return _adder;
endmodule

The underlying Verilog instantiates the inner module and connects its ports with the external module’s ports. It’s all done in wires, so we might expect no difference in the resulting circuit.

In this section, I investigate whether there’s any overhead in synthesizing wrapped Bluespec. For thoroughness, I check nested wrappers too, like when we wrap a wrapper.

Losing Efficiency

When I experimented using synth, I saw using a wrapper can (but might not) affect the resulting circuit. Wrapping our boolean implementation gives us a 16-gate circuit (like with the bitwise implementation) instead of our original 14-gate circuit.

We might chalk this up to overhead from wrapping, but we shouldn’t be getting any overhead from just connecting wires.

It must do with the downstream tools. Similar to the boolean versus bitwise case, there’s something preventing the synthesis tool from optimizing the resulting gate placements.

module mkFullAdderWrapper(CLK,
			  RST_N,

			  exec_a,
			  exec_b,
			  exec_c_in,
			  exec,
			  RDY_exec);
  input  CLK;
  input  RST_N;

  // value method exec
  input  exec_a;
  input  exec_b;
  input  exec_c_in;
  output [1 : 0] exec;
  output RDY_exec;

  // signals for module outputs
  wire [1 : 0] exec;
  wire RDY_exec;

  // ports of submodule _unnamed_
  wire [1 : 0] _unnamed_$exec;
  wire _unnamed_$exec_a, _unnamed_$exec_b, _unnamed_$exec_c_in;

  // value method exec
  assign exec = _unnamed_$exec ;
  assign RDY_exec = 1'd1 ;

  // submodule _unnamed_
  mkFullAdder _unnamed_(.CLK(CLK),
			.exec_a(_unnamed_$exec_a),
			.exec_b(_unnamed_$exec_b),
			.exec_c_in(_unnamed_$exec_c_in),
			.exec(_unnamed_$exec));

  // submodule _unnamed_
  assign _unnamed_$exec_a = exec_a ;
  assign _unnamed_$exec_b = exec_b ;
  assign _unnamed_$exec_c_in = exec_c_in ;
endmodule  // mkFullAdderWrapper

I also tried adding a second layer of wrapper. If the first wrapper reduced performance (for unknown reasons), maybe a second wrapper would reduce performance even more.

(* synthesize *)
module mkFullAdderWrapper2(FullAdder);
    FullAdder _adder <- mkFullAdderWrapper;
    return _adder;
endmodule

But we didn’t lose performance! The resulting circuit is back to 14-gate, which is the same as the unwrapped boolean implementation.

At first, I found that wrapping three times gets us the 16-gate, and wrapping four times gets us the 14-gate. There was a cycle of gaining and losing performance, even when the Verilog for each layer of wrapper was practically identical to the last.

(When I went back to verify, the results changed, which I soon discuss.)

Gaining Efficiency

I ran the same wrapper experiment with the bitwise implementation. If synth gave us the 16-gate for bitwise, maybe we’d get 16-gate no matter the wrapper.

Surprisingly, adding a wrapper actually gave us the 14-gate circuit. The tool was telling us that our full adder was more performant with a wrapper. Adding more wrappers resulted in several 14-gate, and one 16-gate. There didn’t seem to be any pattern.

This is only if we don’t specify no_default_reset; otherwise they’re all 16-gate. (Don’t ask me why.)

Nondeterminism

The day after, I found that each arrangement of wrappers didn’t necessarily result in the same circuit as the day before. I don’t believe I really changed anything, so I wonder if it’s a nondeterministic bug.

It’s interesting that the mere action of adding more wrappers can be enough to massage the synthesis tool into giving us the more efficient 14-gate circuit. It shows that the downstream bug isn’t just restricted to the kind of operator you use.

The main takeaway is that we should be wary about how much stock we put into our synthesis numbers. Even for a circuit as simple as a full adder, there seems to be inefficient gate placement. For much more complex designs, we should consider the synthesis numbers to be only approximate, at least until we secure more sophisticated downstream synthesis tools.

Verilog Wrapped with Bluespec

Wrapping Bluespec modules in Bluespec can be useful, but the real use comes with wrapping other languages in Bluespec.

Bluespec offers support for bindings between Bluespec modules and Verilog modules (going down in abstraction, at the cost of productivity) or Bluespec functions and C functions (going up in abstraction, at the cost of performance).

For us, I’m focusing on wrapping Verilog because it might allow us to write more performant components to use in our Bluespec, like adders.

According to the BSC User Guide:

Using the import "BVI" syntax, a designer can specify that the implementation of a particular BSV module is an RTL (Verilog or VHDL) module, as described in the BSV Reference Guide. The module is treated exactly as if it were originally written in BSV and then converted to hardware by the compiler, but instead of the .v file being generated by the compiler, it was supplied independently of any BSV code. It may have been written by hand or supplied by a vendor as an IP, etc.

The main thing I’d like to see is whether the synthesis of a Bluespec-wrapped Verilog module is identical to a Verilog module synthesized directly. Given the above description, it should be, since it’s exactly what we practiced by playing with synth and Bluespec-wrapped Bluespec.

Let’s take our Verilog full adder and wrap it in Bluespec. Remember that each of the boolean implementations, in Verilog and in Bluespec, resulted in a 14-gate circuit. But with the capriciousness of the downstream synthesis, I would accept a 16-gate circuit too. This is especially true because we got 16-gate circuits from wrapping implementations that would’ve given us 14-gate circuits.

An import "BVI" statement also requires us to declare the mappings between the Bluespec interface and the Verilog ports. I’ve modified my Verilog full adder to output {sum, c_out} as a single reg [1:0] to be consistent with my Bluespec exec method, which packs the two values together. In Bluespec, FullAdderResult is a struct, but we implicitly pack/unpack to bits as necessary when we’re working with foreign modules.

module FullAdderVerilog(
    input CLK, a, b, c_in,
    output [1:0] out
    );
    always @(*) begin
        out[1] = a ^ b ^ c_in;
        out[0] = (a&&b) || (a&&c_in) || (b&&c_in);
    end
endmodule

import "BVI" FullAdderVerilog =
module mkFullAdderVerilog(FullAdder);
    method out exec(a, b, c_in);
endmodule

We can’t directly synthesize foreign modules, but we can wrap them and synthesize the wrapper.

(* synthesize *)
module mkFullAdderVerilogWrapper(FullAdder);
    FullAdder _adder <- mkFullAdderVerilog;
    return _adder;
endmodule

After synthesis, I found there’s no overhead to wrapping the Verilog, but the the same quirks from wrapping Bluespec reappeared to give us either 14-gate or 16-gate circuits. We should be good to go in terms of embedding Verilog into our Bluespec designs.

The main drawback of this is that while importing Verilog is fine for using the Verilog backend for Bluespec (e.g., to run simulations with Verilog tools), it doesn’t work for using the Bluesim backend, which requires all modules to be implemented in Bluespec and compiled into .ba files. We would need to either re-implement the Verilog modules in Bluespec with conditional compilation, find a Verilog simulator, or not use Verilog implementations at all.

If using the Bluespec-recommended method of conditional compilation, we need to be extra careful that our Verilog implementation of a module is cycle-equivalent to our Bluespec implementation of that same module. Otherwise, we may run into trouble with correctness when we simulate with Bluesim and find our results to be different than our results in, say, Vivado. However, I think whatever can be implemented in Verilog can usually be implemented more easily in Bluespec.

If it turns out that implementing in Verilog gives us no benefit over implementing in Bluespec, then I might just stick with Bluespec implementations for use in Bluesim. The full adder example gave no evidence of greater overhead in Bluespec, so at least it’s clean enough for simple modules.

Next Time

This time, we tweaked synth to work better for our goals, and we did some investigation on the interplay between Bluespec, Verilog, and the synth tool.

Next time, we can see about implementing adders in both Bluespec and Verilog, which synth allows us to quantitatively evaluate. For correctness, we’ll check against the built-in + operator (it looks like Bluespec’s + just wraps around Verilog’s +) as we implement a simple ripple-carry adder and several types of carry-lookahead adder.

As we implement adders, I’ll continue to evaluate synthesis differences between Bluespec and Verilog. If performance permits, we might end up not actually needing to use any Verilog implementations in our processor, allowing us to maintain a strictly Bluespec code base.

We’ll see about using these adders later on in our multiplication unit and in other places.