Performance Limitations of CircuitPython's DisplayIO Graphics
For my next project with the WaveShare Round LCD I want to create an animation of Dr Strange's Eye of Agamotto. Here is my first attempt in JS (looks more like the comics than the movies). Not bad, I think.
When it animates the eye lids come down. I built them with quadradic curves that change over time using Canvas. Canvas is fast enough to let me make the animation very smooth. Once porting it to CircuitPython on an embedded device, however, well.. it's pretty damn slow.
Animation in CircuitPython
For my first attempt I captured the JS animation as a video, scaled it down to 240 x 240, then converted it to an animated gif. On the device the gif renders at 2 or 3 fps. A far cry from the smooth 60fps of my browser.
For my second attempt I exported a background image of the disk, drew the eye on top, then animated the eye lids as rectangles that move from the top and bottom to the center. It was still rough, but seemed to be maybe 6 or 7 fps.
What's going on here? While this device is slow, it's stil a 133mhz process. I had far faster animation back on my old pentium, even when running Java code instead of optimized C. The problem must be in how the display is updated. After spending the next 8 hours researching how these sorts of displays work, and how CircuitPython exposes them to the programmer, I've come to this conclusion: Tiny TFT displays are just slow by nature, even with raw C code, but there are ways we can speed it up. That's what the rest of this post is about.
How CircuitPython Does Graphics
CircuitPython can be quite fast because all of the performance sensitive work is implemented in optimized C code. This includes drawing. The displayio system lets you use bitmaps with indexed colors to save memory, which means copying bitmaps back and forth can be fast. Drawing shapes can be almost as fast as C because we are just setting pixels in an in memory bitmap, and of course things like vertical lines and rectangles can be optimized using memcpy routines. Actually drawing to the screen is a different story. It can be very slow.
No matter how you draw into the framebuffer, that buffer has to be uploaded to the display a bitmap. Most of the little hobiest displays use a serial bus called SPI. Looking at the source to the driver for my board, the only thing custom was the init code. All uploading is a standard FourWire connection, which is a type of SPI connection.
I’m using a 240x240 16bit color screen connected over SPI. It has no indexed modes or accelerated drawing routines (as far as I can tell none of these hobbiest screens do). This means to do a full screen refresh requires sending two bytes for every pixel on the screen, over a serial port, one bit at a time. The DisplayIO routines are smart and try to use batches, but even if you set the entire screen to a single color, that is still a lot of data to transfer.
240 x 240 x 2 bytes per pixel equals 115200 bytes per screen, or 912000 bits per frame. For 30fps that’s over 27 million bits per second with zero overhead. In practice I’m guessing we get half of that. A lot of SPI ports only run at 10 or 20 MHz, so getting a consistent 30fps just isn’t going to happen. That’s just the nature of SPI. Desktop computer screens use protocols that can send more than one bit at a time, and operate much faster than 10Mhz. SPI just can't.
There is one saving grace, however. These screens have an internal memory buffer and support partial refreshes. This means you can update rectangular subsections of the screen containing just the changes between the last frame and the next. It won’t help for full screen animation, but for typical GUI work where only a few things change we can make it quite performant. We just have to go back to rendering algorithms from the 90s (like the Java Swing toolkit that I worked on for 10 years). This is in fact what displayio does.
The inability to do smooth full screen refreshes does mean certain types of effects, like slideshow animations and or fading the entire view to black are impossible. (Well, maybe we could animate the backlight?)
Earlier I said there are not acceleration commands, like RLE encoding or bitblts, or indexed color palette swaps. This is true. However, this particular chip does have vertical scroll offsets, which might make it possible to do smooth vertical scrolling of at least part of the screen. I’ll have to look into that later.
So to start with, the only perf improvement we can do is set the SPI bus to the fastest possible speed. In the sample micropython code that comes with this screen I *think* they are setting the speed to 100mhz but the C code seems to be 40mhz and the Arduino code 66mhz. 🤷 Switching from 20mhz to 100mhz seems to speed up the animation I’m working on, but really I need a proper benchmark to measure it.
I created a little script to redraw a fullscreen bitmap over and over, with different SPI speeds. At 1mhz I’m getting an 690ms per frame or about 1.45fps. As I increase the speed the fps improves linearly until it tops out at 64 MHz for a frame time of 144ms or just under 7fps. Setting the speed any higher has no effect. Calculating the multiplying the size of a frame in bits times the fps gives me 6.4Mbps, which sounds correct for an uncompressed video stream. This seems to be the max we can get.
Incidentally if I leave off the speed setting it seems to default to 20Mhz. If I leave off the target framerate of 60 it drops to 3.66fps. If I set the target framerate to 10 (which is still higher than our max) it drops to 5.38fps. Setting the target to 60 lets get back to our max fps of 7. So I’m not sure what exactly the target framerate is doing but it seems to have some inaccuracies. Perhaps it’s allowing for more GCs? Setting it to 100 actually seems to increase the fps slightly to 7.09. Setting it to 1000 increases it some more, but only to 7.3. So clearly there is some tradeoff here but if it’s impacting battery or heat then it’s not worth it for that tiny speed improvement.
Now let’s try filling just part of the screen and see if we can get speedups?
240x240 = 7.09 fps
240x120 = 13.89 fps
120x120 = 25.48 fps
Sure enough: halving the pixels roughly doubles the frame-rate. Cutting them to a quarter gives us a 4x speedup. So 25fps seems pretty good, right? Yes and no. The numbers say it is pushing that many frames per second, but visually it does not seem to actually be rendering that fast. In fact there is a lot of tearing. Recording with my slow-mo camera shows that it takes about 70-100ms to render a frame, or in the range of 10-14fps, even though we are sending significantly more frame data than that. Technically we could refresh at 64x64 rectangle over 60fps, assuming we had no overhead, but that’s only 16th the number of pixels. And visually it doesn’t look that fast, plus processing that much data in code for something like a video would be insane.
Still, this does give us something to work with. A full screen refresh is around 7fps, but if we only want to refresh part of the screen at a time then we can get 15 to 30fps fairly easily.
Let’s try doing several circles rotating. I can get a calculated fps of over 30 fairly consistently. In fact, if I set the target framerate to 20 and the speed to 200Mhz I get an extremely consistent exactly 32fps. Something must be syncing up nicely here. I'm not doing anything clever in my code. I created some circles and move them every frame. DisplayIO must be doing some dirty tracking underneath and only sending the smallest number of changed pixels to the screen. That's how we get 30fps. But again, visually it does not seem as smooth and there is some tearing. I suspect the refresh rate of the physical LCD is lower than how fast we can pump out screens. It also looks a little choppy because my circles are restricted to integer coordinates. I can't move a circle to be on the edge of two pixels and get anti-aliased drawing like I would on HTML canvas. But for what I'm building I can live with these constraints.
So what have we learned?
- All of my research taught me that the SPI bus is the bottleneck, and there's no way to get more than 7fps for a full screen refresh. However, partial screen refreshes can quite fast and the displayio API was designed to enable this. They've done a great job. I don't think I could do better without hardware changes.
- Make sure you set the SPI bus to it's max speed. The default may be much lower than the component is capable of.
- Design your graphics an animation around updating the fewest number of pixels
- Use the backbuffer to your advantage. You can make complex graphics as long as the number of pixels changed per frame is small. Cool particle effects should be possible with this technique.
Next time I'll show you some graphics examples I've been working on as well as how to access the touch events to make a little painting app. All of the source code for this project, including my performance testing on in this github repo.