tinygrad-notes

Recall our dot product example,

from tinygrad import Tensor

a = Tensor([1,2,3,4])
b = Tensor([5,6,7,8])
a = a.dot(b)
a.numpy()

I want to illustrate one of the common optimization that made the generated kernel code go from this (when run with NOOPT=1 DEBUG=5 python script.py):

kernel void r_4(device int* data0, const device int* data1, const device int* data2, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int acc0 = 0;
  for (int ridx0 = 0; ridx0 < 4; ridx0++) {
    int val0 = *(data1+ridx0);
    int val1 = *(data2+ridx0);
    acc0 = ((val0*val1)+acc0);
  }
  *(data0+0) = acc0;
}

to this (when run with DEBUG=5 python script.py):

kernel void r_4(device int* data0, const device int* data1, const device int* data2, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int acc0 = 0;
  int val0 = *(data1+0);
  int val1 = *(data1+1);
  int val2 = *(data1+2);
  int val3 = *(data1+3);
  int val4 = *(data2+0);
  int val5 = *(data2+1);
  int val6 = *(data2+2);
  int val7 = *(data2+3);
  *(data0+0) = ((val3*val7)+(val2*val6)+(val1*val5)+(val0*val4)+acc0);
}

The NOOPT flag tells tinygrad not to do any optimization, so we get a plain loop construct, whereas by default it will see that having the loop is not not necessary for a plain addition of only 4 elements.

If you recall, the operation is converted into the IR UOps before being converted into GPU code. In fact, the IR is where the optimization takes place. When we run the script without optimization, this is the IR:

UOps.DEFINE_GLOBAL  : ptr.dtypes.int            []                               (0, 'data0', True)
UOps.DEFINE_GLOBAL  : ptr.dtypes.int            []                               (1, 'data1', False)
UOps.DEFINE_GLOBAL  : ptr.dtypes.int            []                               (2, 'data2', False)
UOps.DEFINE_ACC     : dtypes.int                []                               0
UOps.CONST          : dtypes.int                []                               0
UOps.CONST          : dtypes.int                []                               4
UOps.LOOP           : dtypes.int                [4, 5]                           None
UOps.LOAD           : dtypes.int                [1, 6]                           None
UOps.LOAD           : dtypes.int                [2, 6]                           None
UOps.ALU            : dtypes.int                [7, 8]                           BinaryOps.MUL
UOps.ALU            : dtypes.int                [9, 3]                           BinaryOps.ADD
UOps.PHI            : dtypes.int                [3, 10, 6]                       None
UOps.ENDLOOP        :                           [6]                              None
UOps.STORE          :                           [0, 4, 11]                       None

You can find more details on what they mean in this and this guide, but the key here is we see the LOOP and ENDLOOP instruction. In the second version, however, we don’t see it:

UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (0, 'data0', True)
UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (1, 'data1', False)
UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (2, 'data2', False)
UOps.DEFINE_ACC     : dtypes.float              []                               0.0
UOps.CONST          : dtypes.int                []                               0
UOps.CONST          : dtypes.int                []                               3
UOps.LOOP           : dtypes.int                [4, 5]                           None
UOps.LOAD           : dtypes.float              [1, 6]                           None
UOps.LOAD           : dtypes.float              [2, 6]                           None
UOps.ALU            : dtypes.float              [7, 8]                           BinaryOps.MUL
UOps.ALU            : dtypes.float              [9, 3]                           BinaryOps.ADD
UOps.PHI            : dtypes.float              [3, 10, 6]                       None
UOps.ENDLOOP        :                           [6]                              None
UOps.STORE          :                           [0, 4, 11]                       None

The uops are generated via the .linearize() method invocation in the Linearizer instance, in fact, we can construct this directly:

from tinygrad import Tensor
from tinygrad.engine.schedule import create_schedule
from tinygrad.codegen.linearizer import Linearizer
a = Tensor([1,2,3,4]).realize()
b = Tensor([5,6,7,8]).realize()
c = a.dot(b)
s = create_schedule([c.lazydata])[0]
k = Linearizer(*s.ast)
k.linearize()
k.uops.print()

We have to add .realize() to the input to have their data ready in this simplified version which doesn’t have the helper code that handle them. Run this script directly without any flags python script.py, and you get the same uops output:

UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (0, 'data0', True)
UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (1, 'data1', False)
UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (2, 'data2', False)
UOps.DEFINE_ACC     : dtypes.float              []                               0.0
UOps.CONST          : dtypes.int                []                               0
UOps.CONST          : dtypes.int                []                               3
UOps.LOOP           : dtypes.int                [4, 5]                           None
UOps.LOAD           : dtypes.float              [1, 6]                           None
UOps.LOAD           : dtypes.float              [2, 6]                           None
UOps.ALU            : dtypes.float              [7, 8]                           BinaryOps.MUL
UOps.ALU            : dtypes.float              [9, 3]                           BinaryOps.ADD
UOps.PHI            : dtypes.float              [3, 10, 6]                       None
UOps.ENDLOOP        :                           [6]                              None
UOps.STORE          :                           [0, 4, 11]                       None

Now if you want to see the generated kernel code on Metal (this part was briefly covered in here and here):

from tinygrad.renderer.cstyle import uops_to_cstyle, MetalLanguage
print(uops_to_cstyle(MetalLanguage(), 'codegen', k.uops.uops))

The trick to unroll the loop is called upcast:

k = Linearizer(*s.ast)
k.upcast() # ---> insert this
k.linearize()

and the generated uops and kernel code become:

UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (0, 'data0', True)
UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (1, 'data1', False)
UOps.DEFINE_GLOBAL  : ptr.dtypes.float          []                               (2, 'data2', False)
UOps.DEFINE_ACC     : dtypes.float              []                               0.0
UOps.CONST          : dtypes.int                []                               0
UOps.LOAD           : dtypes.float              [1, 4]                           None
UOps.CONST          : dtypes.int                []                               1
UOps.LOAD           : dtypes.float              [1, 6]                           None
UOps.CONST          : dtypes.int                []                               2
UOps.LOAD           : dtypes.float              [1, 8]                           None
UOps.LOAD           : dtypes.float              [2, 4]                           None
UOps.LOAD           : dtypes.float              [2, 6]                           None
UOps.LOAD           : dtypes.float              [2, 8]                           None
UOps.ALU            : dtypes.float              [5, 10]                          BinaryOps.MUL
UOps.ALU            : dtypes.float              [7, 11]                          BinaryOps.MUL
UOps.ALU            : dtypes.float              [9, 12]                          BinaryOps.MUL
UOps.ALU            : dtypes.float              [13, 3]                          BinaryOps.ADD
UOps.ALU            : dtypes.float              [14, 16]                         BinaryOps.ADD
UOps.ALU            : dtypes.float              [15, 17]                         BinaryOps.ADD
UOps.STORE          :                           [0, 4, 18]                       None

kernel void codegen(device int* data0, const device int* data1, const device int* data2, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int acc0 = 0;
  int val0 = *(data1+0);
  int val1 = *(data1+1);
  int val2 = *(data1+2);
  int val3 = *(data1+3);
  int val4 = *(data2+0);
  int val5 = *(data2+1);
  int val6 = *(data2+2);
  int val7 = *(data2+3);
  *(data0+0) = ((val3*val7)+(val2*val6)+(val1*val5)+(val0*val4)+acc0);
}

How the internal works

So how does it work? The upcast method increments an attribute called .upcasted and when the linearize method is called, it checks for that attribute and manipulate the uops.

  def upcast(self):
    check(self.full_shape[-1] != 1, "can't upcast a dimension with size 1")
    self.upcasted += 1

The controller is the upcasted attribute, you can actually increase the number of elements to a crazy amount and see that if you just call upcast(), it will unroll. A heuristic check for the size of such a loop must be checked before you call upcast().

Let’s see the relevant portion of linearize() that acts on this attribute. Recall that we have a reduce op SUM, to combine the multiplication result (you may want to refer to the scheduleitem post for more details). And the part of the linearize that handles it is here:

    fake_reduce_idxs: List[Variable] = []
    for reduceop in [self.reduceop] if self.reduceop is not None else []:
      acc,loaded_buffers,fake_reduce_idxs,local_idxs,upcast_idxs = self.render_reduceop(reduceop,loaded_buffers,global_idxs,local_idxs,upcast_idxs)

Up until this point, our uops contain only the DEFINE_GLOBALs:

The render_reuceop method will complete the entire loop construct, you can see that after its execution, our uops will have everything except the STORE:

Let’s look inside render_reduceop:

  def render_reduceop(self, reduceop: LazyOp, loaded_buffers:Dict[Union[MemBuffer, ConstBuffer, LocalBuffer], List[UOp]], \
                      global_idxs, local_idxs, upcast_idxs):
    acc = self.global_load(out_buf, global_idxs+local_idxs+fake_reduce_idxs+upcast_idxs, self.get_reduce_acc(reduceop))
    loop_ctx = self.render_loop(reduce_idxs)

    loaded_buffers = {}
    for i, b in enumerate(self.bufs):
      if b in self.earlybufs:
        _loaded_buffer = self.global_load(
          self.bufs.index(self.local_alias[i]) if i in self.local_alias else i,
          global_idxs+local_idxs+reduce_idxs+full_upcast_idxs
        )
        loaded_buffers[b] = _loaded_buffer

    self.ast_parse(reduceop, acc, self.acc_offsets(self.full_buf_index), loaded_buffers, do_reduce=True, loop_ctx=loop_ctx)

The snippet is pseudocode but it outlines the major events that happen in the method. We first define an accumulator, which will add the “DEFINE_ACC” to list of uops. Then we call render_loop that actually inserts the “LOOP” uop. After that we enumerate through the buffers and find the one that’s in “earlybufs”. What is earlybufs? It was defined when the linearizer was constructed:

    self.earlybufs = [x.arg for x in self.reduceop.lazyops if x.op in BufferOps] if self.reduceop else []

It is essentially the buffers that serve as input to the SUM operation, in which case, the two vectors we pass in, and are defined as data1 and data2 in the form of “DEFINE_GLOBALS”. So the for loop will add two “LOAD” uops, with their input coming from the “LOOP” and “DEFINE_GLOBAL” uops.

By this point, our uops look like this, note the breakpoint is at self.ast_parse.

The ast_parse method will take the reduce op, our sum operation, and walk down the AST tree to start inserting ALU ops:

  def ast_parse(self, x:LazyOp, acc: List[UOp], offs:Optional[List[int]], loaded_buffers:Dict[Union[MemBuffer, ConstBuffer, LocalBuffer], List[UOp]], do_reduce=False, loop_ctx=tuple(), cache=None) -> List[UOp]: # noqa: E501
    values = [self.ast_parse(v, acc, offs, loaded_buffers, loop_ctx=loop_ctx, cache=cache) for v in x.src]
    ops = {ReduceOps.SUM:BinaryOps.ADD, ReduceOps.MAX:BinaryOps.MAX}
    if x.op in ops:
      #...
    else:
      ret = [self.uops.add(UOps.ALU, dtypes.bool if x.op in {BinaryOps.CMPLT, BinaryOps.CMPEQ} else val[-1].dtype, val, x.op) for val in zip(*values)]
    return ret

Here I outlined the pseudocode for a base case (note that it is a recursion), you can see that if it’s an ADD operation, then an ALU node will be inserted.

The interesting part now comes when we have called the upcast() method. Let’s see how things change.

We start at the same for loop as before

    for reduceop in [self.reduceop] if self.reduceop is not None else []:
      acc,loaded_buffers,fake_reduce_idxs,local_idxs,upcast_idxs = self.render_reduceop(reduceop,loaded_buffers,global_idxs,local_idxs,upcast_idxs)

reduceop will be the same, with the identical AST for this SUM operation. The difference is in the upcast_idxs, where previously it was empty, now it has the value of [NumNode(0)].

  def render_reduceop(self, reduceop: LazyOp, loaded_buffers:Dict[Union[MemBuffer, ConstBuffer, LocalBuffer], List[UOp]], \
                      global_idxs, local_idxs, upcast_idxs):
    # define indicies
    full_upcast_idxs = [Variable(f"_uidx{i}", 0, s-1) for i, s in enumerate(self.full_shape[self.shape_len-self.upcasted:])]
    reduce_idxs = [Variable(f"ridx{i}", 0, self.full_shape[i]-1) for i in range(self.first_reduce+self.group_for_reduces, self.shape_len-self.upcasted)]  # noqa: E501
    fake_reduce_idxs = [x*0 for x in reduce_idxs]

In our previous run, the full_upcast_idxs would be empty, and the reduce_idxs would have one element. Now it is the opposite, full_upcast_idxs has the value [Variable('_uidx0', 0, 2)] and reduce_idxs is empty. What the content mean will become clearer later, but do note that the length of it controls whether the loop form or not. Having an empty reduce_idxs will not result in any “LOOP” uop being added.

In the next step, the buffers are enumerated the same way, except full_upcast_idxs have content now:

      loaded_buffers = {}
      for i, b in enumerate(self.bufs):
        if b in self.earlybufs:
          _loaded_buffer = self.global_load(
            self.bufs.index(self.local_alias[i]) if i in self.local_alias else i,
            global_idxs+local_idxs+reduce_idxs+full_upcast_idxs
          )
          loaded_buffers[b] = _loaded_buffer

I didn’t go into details of what global_load does in the loop version, because it is conceptually straightforward. In each loop iteration, we need to access the data from input buffers, by indexing into the current loop index. However, in the upcasted version, the access pattern is more complex, and the entrypoint to that is by passing in the full_upcast_idxs and global_load will branch into a different execution path. Let me outline the major step of execution in global_load:

  def global_load(self, i:int, idxs:List[Node], acc=None, barrier:Optional[UOp]=None) -> List[UOp]:
    e_idxs, e_valids = expand_node(g_idx, expand_vars), expand_node(g_valid, expand_vars)
    for idx, valid, rep_idx in zip(e_idxs, e_valids, iter_idxs(expand_vars)):
      buf_uop = self.buf_uops[i]
      assert buf_uop is not None, f"buffer {i} wasn't UOped"
      rendered_idx = idx.render(self.render_ops, self)
      valid_tuple = (valid.render(self.render_ops, self), self.const(invalid_value, localtype)) if valid.min == 0 else tuple()
      self.load_cache[key] = self.uops.add(UOps.LOAD, localtype, (buf_uop, rendered_idx) + valid_tuple + ((barrier,) if barrier else ()))

We can work backward to see that in the first version, the for loop is executed only once, so we get a single LOAD. Now, we have to unroll this loop so all of the content has a LOAD, our input has 3 elements, so the for loop must be executed three times. You can see that the length of e_idxs is in fact 3:

And the number also tells you that for each iteration, the idx is the position of the input array that you should index into, so we will get the data at position 0, 1 and 2.

If you look at the argument passed to self.uops.add(UOps.LOAD, ...), you should see that there’s a rendered_idx, this would be the input to the LOAD operation, and the rendered_idx is in fact the instance of the uop. It is fetched by the idx.render() method:

  def render(self, ops=None, ctx=None) -> Any:
    if ops is None: ops = render_python
    assert self.__class__ in (Variable, NumNode) or self.min != self.max
    return ops[type(self)](self, ops, ctx)

This is its definition, ops is a list of predefined instruction for how to render certain stuff, but the important thing is ctx, which is our linearizer instance, and it contains all its existing uops on the .uops attribute. If you look at one of the render instruction, you can see that it is indeed indexing into this attribute:

    NumNode: lambda self, ops, ctx: ctx.const(self.b)

ctx.const is a method on the linearizer:

  def const(self, b:ConstType, dtype:DType=dtypes.int32, insert_before=None) -> UOp:
    return self.uops.add(UOps.CONST, dtype, tuple(), b, insert_before=insert_before)

and there we get the CONST uops. Iterate it three times, we will have written all the LOAD ops, and by the time we get to ast_parse, our uops look like this:

We have six LOADs, corresponding to the six elements in the input array. And three CONST, corresponding to the three index positions (0, 1, 2) to access the element from. And the AST is now more straightforward, you simply traverse down the tree, and their input will lead you to the correct uop and things will just keep appending to the list.

The key takeaway here is that the loop unrolling is controlled by setting the upcasted attribute. Then, the reduce_idxs and full_upcasted_idxs will be determined accordingly, which control whether a LOOP is constructed or not, and how to access the element in the inputs.