Each uop represent an operation in tinygrad’s intermediate representation, also known as the graph. They are inserted as follows:
from tinygrad.codegen.uops import UOpGraph, UOps
g = UOpGraph()
c0 = g.add(UOps.CONST, dtypes.int, arg=0)
and can be rendered into target platform’s code as such:
s = uops_to_cstyle(MetalLanguage(), 'tester', g)
The main way to modify the graph is by calling the add
method with the
following arguments:
dtype
: this specifies the data type. E.g. dtypes.int
, dtypes.float
vin
: this specifies the input to the UOp. For example, in a multiplication
operation, the vin
would be a tuple containing the two input operands.arg
: this is the argument to the UOp. Its content is specific to each
uop and is utilized during the code generation process. For example,
a global variable may specify a tuple, with one of the elements being
the name of the variable, then the code generation process will extract
that element and use it to generate the variable name.Uops.CONST declares a constant variable. There are two required parameters
in add
: dtype
and arg
.
Example:
c0 = g.add(UOps.CONST, dtypes.int, arg=10)
c0
can then be used as inputs for other UOp.
UOps.DEFINE_GLOBAL declares a global variable. It is used as the parameter list for the function.
arg
: a three element tuple:
0
: the index of the parameter in the parameter list1
: the name of the parameter2
: whether the parameter is mutablevin
: omitted or an empty tupleFor example, when declaring a parameter that will be passed to the kernel:
c1 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(1, "data0", True)
As its name suggests, they set up loop.
LOOP:
vin
:
0
: The start value of the loop (must be CONST UOp)1
: end value (must be CONST UOp)ENDLOOP:
vin
:
0
: The loop UOp
Example:c0 = g.add(UOps.CONST, dtypes.int, arg=0)
c1 = g.add(UOps.CONST, dtypes.int, arg=10)
loop = g.add(UOps.LOOP, dtype=dtypes.int, vin=(c0, c1))
endloop = g.add(UOps.ENDLOOP, vin=(loop,))
The rendered loop looks like this:
for (int ridx0 = 0; ridx0 < 10; ridx0++) {
}
STORE is for writing value to the output, which comes in the form of a parameter passed to the kernel function.
dtype
: Nonevin
: Nonearg
: Three values must be UOp instance
0
: the UOp for the output1
: the index position in the output to store the value in2
: the value to storeExample
c1 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(0, "data0", True))
c2 = g.add(UOps.CONST, dtype=dtypes.int, arg=0)
c3 = g.add(UOps.CONST, dtype=dtypes.int, arg=10)
store = g.add(UOps.STORE, vin=(c1, c2, c3))
and it will render:
kernel void tester(constant int& data0, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
*(data0+0) = 10;
}
This allows for indexing a value from the input given the offset
vin
:
0
: The input value1
: The offsetExample (see the ALU example for the generated code):
input_value = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(2, "data2", False))
position = g.add(UOps.CONST, dtype=dtypes.int, arg=0)
loaded = g.add(UOps.LOAD, dtype=dtypes.int, vin=(input_value, position))
ALU is for arithmetic, logical, and bitwise operations.
vin
:
0
: the zeroth operand1
: the first operandarg
:
0
: the operation typeIt is usually used in conjunction with other ops, for example, to load the first element from two input arrays and add them together:
c1 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(0, "data0", True))
x1 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(1, "data1", False))
x2 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(2, "data2", False))
pos_input = g.add(UOps.CONST, dtype=dtypes.int, arg=0)
x1_loaded = g.add(UOps.LOAD, dtype=dtypes.int, vin=(x1, pos_input))
x2_loaded = g.add(UOps.LOAD, dtype=dtypes.int, vin=(x2, pos_input))
c4 = g.add(UOps.ALU, dtype=dtypes.int, vin=(x1_loaded, x2_loaded), arg=BinaryOps.ADD)
pos = g.add(UOps.CONST, dtype=dtypes.int, arg=0)
store = g.add(UOps.STORE, vin=(c1, pos, c4))
GPU kernels are usually executed in SIMT fashion, meaning each thread will need to identify itself among all the other threads, such that it can fetch the correct data. In the ALU example above, we are explicitly fetching the zeroth element via the CONST UOp, but we might want to declare a UOp that fetches element based on the threadID.
arg
:
0
: incremental index among all the special uop1
: name of the index2
: Upper limit (exclusive)Example:
position = g.add(UOps.SPECIAL, dtype=dtypes.int, arg=(0, "gidx0", 10))
This means the thread is launched in a group containing ten threads, and each thread will get the value by iterating from 0 to 10 (exclusive). We can now modify the ALU example:
c1 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(0, "data0", True))
x1 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(1, "data1", False))
x2 = g.add(UOps.DEFINE_GLOBAL, dtype=dtypes.int, vin=(), arg=(2, "data2", False))
pos_input = g.add(UOps.SPECIAL, dtype=dtypes.int, arg=(0, "gidx0", 10))
x1_loaded = g.add(UOps.LOAD, dtype=dtypes.int, vin=(x1, pos_input))
x2_loaded = g.add(UOps.LOAD, dtype=dtypes.int, vin=(x2, pos_input))
c4 = g.add(UOps.ALU, dtype=dtypes.int, vin=(x1_loaded, x2_loaded), arg=BinaryOps.ADD)
pos = g.add(UOps.CONST, dtype=dtypes.int, arg=0)
store = g.add(UOps.STORE, vin=(c1, pos, c4))
and the generated code becomes:
kernel void tester(constant int& data0, constant int& data1, constant int& data2, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
int gidx0 = gid.x; /* 10 */
int val0 = *(data1+gidx0);
int val1 = *(data2+gidx0);
*(data0+0) = (val0+val1);
}