View on Github | View on Website
A series of tutorials/study-notes that help you understand the internals of tinygrad and equip you to start contributing to it.
Recent updates:
Computer algebra study notes, this may not be directly related to tinygrad, but a lot of the optimization in tinygrad deals with computer algebra, feel free to check this out.
Fundamentals (better read in orders):
Miscellaneous topics:
~~1. Shapetracker allows for zero cost movement ops ~~
~~ 1. How dimension merging works~~
1. Loop unrolling (upcast) and the underlying Symbolic library
Tinygrad stands out as a deep learning framework, akin to Pytorch, XLA, and ArrayFire, yet it distinguishes itself by being more user-friendly, swifter, and less presumptive about the specifics of your hardware.
Mirroring Pytorch’s user-friendly frontend, Tinygrad enhances model training and inference efficiency by employing lazy evaluation on the GPU. This approach compiles your model into highly optimized GPU code, capable of extending across multiple devices, thereby optimizing both time and financial resources.
Moreover, it offers a significant benefit by separating the machine learning software from the computing hardware. Many ML frameworks are designed primarily for CUDA, implying an expectation of execution on Nvidia GPUs. This assumption can hinder the transition to alternative hardware in the future. Given the rapid advancements and competitive pricing strategies employed by numerous GPU manufacturers to offer comparable computing power at lower costs, ensuring your software stack is hardware-agnostic becomes an essential strategy for future-proofing.
This is where tinygrad truly shines. Our approach involves compiling machine learning models into a highly optimized Intermediate Representation (IR), which we then translate directly into GPU-specific instructions. Our goal is to drill down to the lowest possible level of instruction: PTX for Nvidia, KFD for AMD, and Metal for Apple devices. By targeting the foundational layers of the stack, we not only enhance compatibility across various hardware platforms but also unlock significant performance improvements. Additionally, this strategy leads to enhanced system stability and a reduction in the ongoing maintenance efforts.