EVE3 - Asynchronous display-list updates

darkjezter · August 12, 2019, 08:57:47 PM

Hello all!

I'm in the process of evaluating and integrating the EVE3 into a commercial handheld remote for robotics and came up with some neat tricks for doing partial screen updates. I'm here to share what I've come up with, and am curious to find out if anyone is doing anything similar.

In a nutshell, I'm using RAM_G to store blocks of DL commands, which are assembled into a display list using CMD_APPEND. Nothing too fancy so far, however, wanting to still be able to take advantage of the coprocessor's widgets, these display list blocks are assembled in RAM_DL, and then transferred to RAM_G using CMD_MEMCPY. This allows display lists in RAM_G to be updated, and by using segmented blocks which are assembled for display, can tolerate blocks changing in size without too much difficulty.

The CMD_APPENDs to assemble the final display list are built in RAM_G during the block construction described above. Combined, these work out quite well as the coprocessor can feed it's own FIFO from RAM_G with some care.

Now the black magic that makes this method asynchronous. There is no way to know ahead of time to know exactly how many commands will appear in each display list block, so ultimately, the block must be built and then the size retrieved from RAM_CMD_DL, and this has to happen before we can build the CMD_MEMCPY to move the DL block to RAM_G. However, the coprocessor can also modify commands in it's FIFO, and this is essentially the trick:

Tell coprocessor what to display in the DL block
Tell coprocessor to copy the resulting block length, modifying the next instruction
Tell coprocessor to copy that many bytes to a set destination in RAM_G

By having the coprocessor modify the contents of its own FIFO, this can all be set up in advance with the host MCU only busy while setting up the request. Even if the coprocessor has to wait for a page-flip before it can build a new DL block, the only constraint in this case is available room in the FIFO.

I'd love to hear any thoughts, and am more than willing to share more specific detail to anyone interested.

Cheers!

Jesse

Rudolph · August 14, 2019, 07:35:11 PM

Hrmm, you know that the display-list is doubled buffered and only gets used to actually display anything if you tell it to?

You can use the co-processor to generate as many fragments of display lists as you like and take as much time
as you like to copy these around.
The display list is always updated asynchronously.

https://github.com/RudolphRiedel/FT800-FT813/blob/4.x/example_projects/EVE_Test_SAMC21_FT813_EVE2-35G/tft.c

This is just my ugly test-code and it is a bit outdated.
But I always have a function initStaticBackground() in which I define the portions of the screen that do not change.
and append this at the start when building the display list.
In my last project I expanded this to three segments, two additional ones to switch between two screens.

BRT Community · August 15, 2019, 10:15:16 AM

Hi,

Yes, the Append is very useful in reducing SPI traffic when you have many parts of the screen which don't change.

The display list in RAM_DL always defines the final display and is also double buffered.

As you mentioned Rudolph, you can append together several building blocks to make up the screen by creating the display list, storing it whilst noting the location and size, and then appending it in. Note that the total size must still fit in the 8K available.

We have some additional examples and documents below which may be useful:

Principles of using Append
https://brtchip.com/wp-content/uploads/Support/Documentation/Application_Notes/ICs/EVE/AN_340_FT800_Optimising-screen-updates-with-Macro-and-Append.pdf

Example of putting together several static sections
https://brtchip.com/wp-content/uploads/Support/Documentation/Application_Notes/Modules/EVE/AN_356-FT800-Interfacing-I2C-Sensor-to-VM800P.pdf

Simple example of append - see section 9
https://brtchip.com/wp-content/uploads/Support/Documentation/Application_Notes/ICs/EVE/BRT_AN_014_FT81X_Simple_PIC_Library_Examples.pdf

Some of our sample apps also use this technique
https://brtchip.com/softwareexamples-eve/

Best Regards,
BRT Community

darkjezter · August 15, 2019, 08:56:14 PM

Before going down this road, the first thing I did was examine the RAM_DL behavior around swap. It was interesting, but suggested that much of our legacy code would require a full rewrite to migrate to EVE. Yes, the display updates are asynchronous already, but building display lists in RAM_G using the co-processor, and subsequently appending them for display is not without this technique. At least, not as far as I can tell.

Co-processor widgets can only be built in RAM_DL, and if a swap is pending then those widget commands will wait in the FIFO until the RAM_DL becomes available. This implies that new display list block sizes may not be known for up to a full frame interval, so instead of having the MCU pull these block sizes from the EVE, the EVE builds the series of append commands right in RAM_G, which it can then copy into the FIFO and execute without added support by the MCU. This approach was preferable to either reverse engineering the list sizes produced by the co-processor, or ignoring the widgets and building the display lists directly increasing the amount of SPI traffic.

What this allowed for is to separate the display list into an arbitrary number of independently updated blocks. From the MCU side, the necessary blocks are updated, and then queued for assembly and display. The only data the MCU needs to read out to set this all up is REG_CMDB_WRITE, the rest is manipulated by the EVE as it executes the contents of the FIFO.

The only issue I've encountered with this approach is that CMD_MEMxxx commands do not wrap around within the command FIFO like SPI transfers do. Not entirely unexpected I'm sure, but does require some care when telling the EVE to manipulate the contents of it's own fifo.

While I have found the doc on macro and append, and did look at the PIC library examples, nowhere did I see any examples of using the EVE's co-processor commands to manipulate the command FIFO. Effectively, this technique relies on building the append list in RAM_G, self-modifying code fed into the FIFO, and feeding the FIFO with the contents of RAM_G. If there are any examples of these 3 techniques being used anywhere in the doc, I'd love to see them.

The two pushes that led me to explore this approach were porting legacy UI code, and keeping the MCU available to handle the real-time control of our robotic platform. Using the described technique, the EVE has become an asset, allowing for simpler migration and better runtime performance than we had with our black and white LCDs we're replacing.

It's a success story in progress, and I guess I just wanted to share.

BRT Community · August 16, 2019, 02:58:31 PM

Hi,

Thanks for your additional explanation and it's good to hear that you're having good success with EVE.

I'm not sure if we have any examples which store the co-processor commands (rather than the resulting display list commands) but we'll look further into this.

Best Regards, BRT Community

Rudolph · August 16, 2019, 05:11:33 PM

I still have no idea what "issue" you are trying to fix in the first place.

Given that the resulting display list is within 8kB - and you always have to make sure it does - I can think of no issue, even when you try to rebuild snippets for the display list over and over.

Even if you just placed your display-list for build from snippets thru the co-processor and immeadiately after this
issue more commands thru the co-processor to update the snippets, it does not matter since all commands are executed sequentially anyways.
You need to make sure that you are not overloading the FIFO but again, you always need to take care of that and since the co-processer is rather fast when processing the FIFO you really would need to put some effort into racing it.

So I guess you could write display-list snippets directly to RAM-GL but not only is this a bad idea on its own, you could easily check if the co-processor is still busy with the display list.
Since you must not try to swap the display list faster than it is displayed, you can easily refresh the snippets in the 17ms you are not refreshing.

The argument that a display list could be without an update for a whole whopping frame is void when you put the updating of the snippets right before building a new list every 17+ ms.

And on top, why do you refresh your snippets anyways?
Everytime you do you could as well send the exact same commands thru the co-processor instead of using cmd_append afterwards, the SPI traffic sure is a little lower, but only for all the frames you do not refresh that snippet.
It more or less only makes the amount of SPI traffic unpredictable and you need to make sure that your programm is still running as it is supposed to in the worst case.

Saving memory could be raised as reason to refresh the snippets.
But first off the display-list is only 8kB long.
How many times that do you need in snippets to make it a problem?
And since for a example a button uses a different amount of memory if displayed normal or flat, you would either allocate the maximum amount for the snippet or implement some memory management.
Investing a little memory to pre-calculate two different snippets would be easier in this case.
And avoid using cmd_button in favour of two button images would make both snippets significantly smaller.
Plus since we are talking about EVE3 you could as well pre-calculate the snippets, put them in the FLASH and then use cmd_appendf to place the snippets from flash in the display-list.

I am afraid that you spent a whole lot of time solving something that is not an issue in the first place.

darkjezter · August 20, 2019, 07:39:11 PM

Hey Rudolph, first off I'd like to thank you for engaging whether or not this trick is necessary or even useful.

The main problem this is meant to solve is prediction of the size of a snippet produced by a series of co-processor commands, or in my case, by higher-level API calls which make use of widgets plus our own display elements. In the source code you linked earlier, you're building a snippet once during initialization, and getting the size upon completion. This works well enough for static snippets, but generalizing this approach to snippets that are updated between frames requires synchronization between the EVE and MCU.

Writing the snippets direct to RAM_G as you mentioned would avoid the problem of RAM_DL contention, but as you point out, introduce a new problem of ensuring that the snippets are not modified during the CMD_APPEND calls. Though, this problem is simply mitigated by double-buffering the snippets on update. Moot point though, as this is not an option provided by the co-processor.

Pre-calculating multiple states of snippets, and appending the desired one based on MCU state is the closest to what I sought out to accomplish. The main issue here becomes that the API I'm building has more than two states per button as we also allow non-touch navigation using a d-pad. With a highlight state added to support the d-pad the number of per-button states increases from two to four, and quickly becomes impractical should other states be added. In our use-case, we also use different colors/blinking to indicate other state through the appearance of the buttons.

Re-issuing all the draw commands each frame where the screen is updated is probably the safest workaround. I'll admit, the main reason I didn't pursue this method is that we have some screens where very little changes per frame, and others where most of the screen content is dynamic (mostly text, but not all). Again, the goal is to provide a consistent API that handles these two cases in a uniform way. Also, in the case of text, the most SPI traffic efficient way to update a large number of text fields is using CMD_TEXT, provided the text strings are more than a few characters. Obviously, this is no longer true if the screen content is static and can be loaded to flash ahead of time.

With regards to allocating enough space for worst case size, that's exactly what I'm doing. But due to the desire to save these snippets to ram, and the structure of the append command, the actual size of the snippet is still required. It is required both for the copy to RAM_DL, and for the corresponding CMD_APPEND.

The primary advantage of this approach is the number of snippets can be arbitrarily large, with no impact on the amount of SPI traffic. In essence, it allows arbitrary granularity on how many elements are updated as a batch or snippet. Managing worst case performance can then be tackled by splitting larger snippets into smaller ones, updated less often. The commands sent by the MCU do not change regardless of how many snippets are created.

Perhaps your constraints aren't the same as the ones I'm managing in my own work?

For what it's worth, developing this trick and dealing with the FIFO wraparound didn't take much time at all.

Cheers!

darkjezter · August 21, 2019, 03:59:27 PM

To add... you're right about the frame interval and time to update the display. 17ms is more than enough time to rebuild the entire display, in my own setup, an entire 8K list could be sent in about 2ms without even using the co-processor widgets. However, doing so would occupy the MCU for that duration, and due to the realtime monitoring and control of our robotics platform this is a significant time slice, which would be taken up by code ideally 'hidden' in the UI framework. There are other ways that cost can be managed, but as EVE is the new-player in our platform, I sought a solution that could be isolated to the code handling EVE interaction.

After getting this solution together and getting some actual time measurements along with your criticism, I'm comfortable acknowledging that the chip is fast enough where this trick probably isn't necessary for most projects. Especially in the case where it's a project starting from zero with EVE integrated. For our purposes, I wanted to build a library that will be integrated into several real-time projects, allow us to provide a consistent UI and do so while these implementation details have as small an impact those projects as possible.

Time to update the display aside, this also leads to code where all display snippets are built and tracked by the EVE. The MCU doesn't need to know what the resulting sizes are, and aside from minor state updates and touchscreen events, virtually all data flow is one-directional, with no EVE->MCU flow of data that may require waiting. Maybe it's a small thing, but, IMO it makes for a much cleaner implementation, and one that lends itself better to being later adapted to buffering on the MCU side.

BRT Community · August 22, 2019, 10:29:33 AM

One thing to add regarding buttons is that we find using bitmap icons is often a good way where you have lots of buttons. If your icons are all the same size, using bitmap cells (where you create a large bitmap image which is one button wide and many buttons high) is a really handy way of doing this with minimal display list content plus you can give the icons your own desired appearance. You then place the icon by VERTEX2II which includes the handle and the specific cell number you want to place. Or by the CELL instruction followed by VERTEX2F.

There are several ways to implement this including:

Having one image per icon and use conditional code to do one of the following when a variable indicates the button should be 'pressed'

- using the bitmap transform to re-size slightly
- add a slightly larger rectangle behind to give a glow round the edges
- set COLOR_RGB before the icon is placed and re-color to indicate pressed

Having more than one image per icon which you select using the cell number using similar conditional code:

Code Select

TAG_MASK(1);
BEGIN(BITMAPS)
COLOR_RGB(255,255,255)  // white displays following bitmaps in their original colours
TAG(Button_0);
If(Button0_Pressed == false)  // Button0_pressed is a local Boolean in your code
VERTEX2II(100,100,5,0)  // display cell 0 which is your button 0 unpressed
Else 
VERTEX2II(100,100,5,1)  // display cell 1 which is your button 0 pressed

With most of these methods you can also keep the display list size the same regardless of button state.

As you mentioned, it is often quite efficient to re-send the entire list especially when using SPI burst writes when something on the screen changes if you have a lot of dynamic objects. For anyone looking to use the append in the original way (with RAM_DL content rather than commands) then you may also find using icons rather than buttons along with using multiple appended sections as smaller building blocks helps to minimise variation in length of the DL.

Thanks for your inputs Rudolph and Jesse, I'm sure your experiences here will help other users too,

Best Regards,
BRT Community

Rudolph · August 22, 2019, 06:26:40 PM

>Perhaps your constraints aren't the same as the ones I'm managing in my own work?

Most certainly, I do not even have the same constraints from one project to annother. :-)

>in my own setup, an entire 8K list could be sent in about 2ms

This implies that you either use more than the 30MHz allowed on the SPI or that you use dual or quad SPI.
A raw transfer of 8k at 30MHz without overhead would take 2.2ms.

In either case this implies that you are using a higher clocked 32 bit controller.
And if that is true - why not use DMA?

I am using plain single-line SPI transfers (at least for now), with the ATSAMC21 currently "restricted" to 12MHz since I have a problem with the MISO
line at 24MHz and did not bother to setup annother clock-source to allow for something in between.

In my last more "complex" GUI I had a single page with 26 buttons in 6 different shapes.
Each button had three states, off, active and touched, indicated by different colors for off/active and different images for touch/not touched.
The buttons were using images with 2 sub-images each.
The texts for the 12 main buttons were done with an image with 13 sub-images and which "text" is shown can actually be configured on the fly by CAN messages.
What I am using in total as buttons has 39 states so to speak.

In addtion there were a couple of graphical elements, like 12 indicator "leds", 6 moving "pointers", two logos, several lines and rectangles, some plain text, annother grafik to indicate something by changing color and a text-output status line.

The result was 1124 bytes to be send over SPI resulting in a 2872 byte long display list and composing the list from a precalculated snippet for the static parts plus everything else that could be changed takes 667Âµs - all with an empty status line.
The 1124 bytes are send with DMA in one large block which takes about 750Âµs

Just for reference, when you place 26 button widgets in EVE Screen Editor and nothing else, the RAM_DL is filled by 75%.

The attached image is from a way simpler GUI and only to show that I use a burst-mode, this is a single large transfer, followed by a small one to write REG_CMD_WRITE.

Yes, I could drive this further with splitting this up into snippets.
And since part of my optimisations was to sort thru the various elements in order to avoid sending commands like COLOR_RGB over and over again, using snippets would make the code more readable at the cost of a longer display list.
But my goal was to get below 1ms for the display refresh function that is called once every 20ms.
And I got there.

Next time I maybe pick up on the idea to use a whole lot more different snippets.
Just as a guess, I assume it would take 16 commands per button, should be less, that would be 64 bytes.
6 * 39 * 64 = 14976
Yes, why not, in my last project I used 3x 4k from the top for three snippets and 5056 bytes from the bottom for a single .xfont file.
This left 1031232 bytes of SRAM unused since all my images and the UTF-8 font were used directly from the external FLASH of the BT815.
I rather spend 64k on snippets than refreshing these.

News:

EVE3 - Asynchronous display-list updates