.. raw:: html This tutorial is incomplete .. _tutorial-kernels: =========================== Custom Kernels (Incomplete) =========================== This tutorial is first going to discuss compiling C++ into a Python module. Then we're going to talk about using C++ to do PyTorch computations, and then we're going to discuss using CUDA to do PyTorch computations. In most any program you care to write, a small part of the code will make up the overwhelming majority of the runtime. The idea behind megastep is that you can write *almost* all of your environment in PyTorch, and then write the small, majority-of-the-runtime bit in CUDA. While megastep's :func:`~megastep.cuda.render` and :func:`~megastep.cuda.physics` calls make up the slow bits of the environments I've been prone to write, it's not likely they cover all of your use-cases. In fact, if you're reading this it probably means you've decided that they *don't* cover your use cases. So this tutorial is about writing your own. There is not much in this tutorial that isn't in `the official PyTorch extension tutorial `_. If you find yourself confused about something written here, you can get another perspective on it there. However that tutorial spends a lot of time discussing things like gradients that aren't as interesting to us. Prerequesites ************* TODO-DOCS Explain the prerequesites While I usually do my Python development in a Jupyter notebook, when messing with C++ I'd recommend running most of your tests from the terminal. In a notebook, a failed compilation can sometimes be silently 'covered' by torch loading an old version of your module, and that way madness lies. Better to run things in a terminal a la .. code-block:: shell python -c "print('hello world')" and never have to worry about restarting the kernel after every compilation cycle. Turning C++ into Python *********************** For our first trick, we're going to send data from Python to C++, we're going to do some computation in C++, and then we're going to get the result back in Python. Now make yourself a ``wrappers.cpp`` file in your working directory with the following strange incantations: .. code-block:: cpp #include int addone(int x) { return x + 1; } PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("addone", &addone) } Let's work through this. * ``#include``: This header pulls in a lot of PyTorch's `C++ API `_, but more importantly it pulls in `pybind `_. Pybind is, in a word, magic. It lets you package C++ code up into Python modules, and goes a long way to automating the conversion of Python objects into C++ types and vice versa. * ``addone``: Next we define a function that we'd like to call from Python. * ``PYBIND11_MODULE``: Then we invoke `pybind's module creation macro `_. It takes the name of the module (``TORCH_EXTENSION_NAME``, which evaluates to a specific torch-provided name) and which provides a variable - ``m`` - that'll be used to identify which bits of C++ need to be hooked up to Python. * ``m.def``: Finally, we specify the address of the thing we want to call from Python - ``&addone`` - and we give specify the name that thing should be known by on the Python side - ``"addone"``. Now, the Python side. Make an ``compiler.py`` file in the same directory containing :: import torch.utils.cpp_extension import sysconfig [torch_libdir] = torch.utils.cpp_extension.library_paths() python_libdir = sysconfig.get_config_var('LIBDIR') libpython_ver = sysconfig.get_config_var('LDVERSION') cuda = torch.utils.cpp_extension.load( name='testkernels', sources=['wrappers.cpp'], extra_cflags=['-std=c++17'], extra_cuda_cflags=['-std=c++14', '-lineinfo'], extra_ldflags=[ f'-lpython{libpython_ver}', '-ltorch', '-ltorch_python', '-lc10_cuda', '-lc10', f'-L{torch_libdir}', f'-Wl,-rpath,{torch_libdir}', f'-L{python_libdir}', f'-Wl,-rpath,{python_libdir}']) Almost all of this is boilerplate C++ compilation voodoo; the only really important bits to note are the name - which is what our new C++ module will be added to the import system under - and the list of source files. I explain the rest of the options :ref:`below ` if you're interested, but frankly you can skip reading it until such time as compilation is giving you trouble. With this file defined, we can test things out! Find yourself a terminal and run >>> from compiler import * >>> two = cuda.addone(1) >>> print(two) 2 It should hang for a while while it compiles in the background, then print 2! If it does, congrats - you're sending data over to Python, doing some computation, and getting it back! If for some reason it *doesn't* work, the first thing to do is to add a ``verbose=True`` arg to the ``load()`` call. That'll give you much more detailed debugging information, and hopefully let you ID the problem. Adding In PyTorch ***************** For our next trick, let's do the same again with a pytorch tensor rather than a simple integer. All we need to do is to update our ``addone`` function to take and return tensors rather than ints: .. code-block:: cpp using TT = at::Tensor; TT addone(TT x) { return x + 1; } The ``at::Tensor`` type we're defining here is pytorch's basic tensor type. It's going to show up all over the place in our code, which is why we're aliasing it as ``TT``. This time, test it with >>> import torch >>> from compiler import * >>> one = torch.as_tensor(1) >>> two = cuda.addone(one) >>> print(two) tensor(2) If that works, hooray again - you're sending a tensor to C++, doing some computation, and getting it back in Python! All the Way to CUDA ******************* .. _switches: Compilation Switches ******************** TODO: Check how minimal these compilation switches actually are. To save some scrolling, here's the compilation snippet from earlier:: import torch.utils.cpp_extension import sysconfig [torch_libdir] = torch.utils.cpp_extension.library_paths() python_libdir = sysconfig.get_config_var('LIBDIR') libpython_ver = sysconfig.get_config_var('LDVERSION') cuda = torch.utils.cpp_extension.load( name='testkernels', sources=['wrappers.cpp'], extra_cflags=['-std=c++17'], extra_cuda_cflags=['-std=c++14', '-lineinfo', '--use_fast_math'], extra_ldflags=[ f'-lpython{libpython_ver}', '-ltorch', '-ltorch_python', '-lc10_cuda', '-lc10', f'-L{torch_libdir}', f'-Wl,-rpath,{torch_libdir}', f'-L{python_libdir}', f'-Wl,-rpath,{python_libdir}']) And the notes: * ``[torch_libdir]``: Find the path to the directory of Torch C++ libraries we need to link against. * ``python_libdir``: Find the path to the directory of Python C libraries we need to link against. * ``libpython_ver``: We specifically want the Python C library corresponding to the version of Python we're running right now. * ``cuda = torch``: We're going to get torch to compile our C++ code for us, link it against a bunch of libraries and then stuff it into the ``cuda`` variable. * ``name='testkernels``: Our library is going to be loaded into Python as the 'testkernels' library. That is, as well as it being the ``cuda`` variable, we can also access our C++ code through ``import testkernels``. * ``sources``: This is the list of files to compile; in our case, just our ``wrappers.cpp``. * ``extra_cflags``: Here we say we want the C++ side of things compiled as C++17 code. C++ has come a long way in the last few years, and compiling a modern version makes for a much more pleasant time writing C++. * ``extra_cuda_cflags``: And here we say we want the CUDA side of things compiled as C++14 code. Not quite as nice as C++17 code, but the best the CUDA compiler could support as of the time I wrote this. We also chuck in the ``-lineinfo`` switch, which will give us more useful debugging information when things go wrong, and the ``--use_fast_math`` switch, which lets the CUDA compiler user faster - but slightly less accurate - maths. * ``extra_ldflags``: And finally, we list off all the libraries that need to be included when linking the compiled code. The ``-l`` switches name specific libraries; the ``-L`` switches give the directories to look in for dynamic linking, and the ``-Wl,-rpath`` switches give the directories to look in for static linking. I think I have that right. TODO-DOCS finish the kernels tutorial