Saturday, May 29, 2010

Embracing the ARM

Last year, I worked on a project for visomat inc. which involved interfacing to a LED matrix display so that it could display some video. We had some latency problems and at one stage we decided to try out DMX instead of Ethernet as interface techniquce, hoping that we'd get better results that way. The idea was to feed the LED matrix, which consisted of 12 power supplies/controllers, with 12 DMX interfaces to make sure that data arrives at the LEDs synchronously.

DMX is an asynchronous serial protocol running at 250 kbits per second. I decided to try using an AVR microcontroller running at 16 Mhz and have it bit-bang the data out of a parallel port. For each bit, 64 machine cycles would be available which should be enough to set the port bits and also grab incoming data out of the USB interface that we wanted to use to interface to the host.

I built a prototype of the interface and we made some promising experiments with it, but in the end we found that the visual latency problems were not fixed. The problem was really in how the LED hardware worked and could not be repaired on the feeding side. The prototype went onto the shelf and we stuck with Ethernet for that project.

A few weeks ago, I discussed the project with a friend of mine and we thought that it'd be cool to turn the 16 port DMX interface into a product. We found someone who would want to use it in an OEM setting, and I made bringing the hardware to product level my weekend project.

Good bye AVR, hello ARM

The AVR based solution proved to be problematic, though. Even though the CPU has enough processing headroom, the lack of RAM requires that the data is fed into the interface "just in time". Any variance in the USB input data stream could result in visible hickups in the DMX streams. Getting it to work seemed like possible, but require too much work in the end. Thus, I decided to turn to using more capable hardware which would also give the DMX interface network connectivity and more local intelligence.

As new platform, I chose the ARM9 based Eddy-S4M board produced by the Korean manufacturer SystemBase. These boards are distributed by trenz electronic who I've found to be quick and reliable in the past.

Linux is the standard operating system on Eddy-S4M. I'm not a huge Linux fan, so I spent some time to find an alternative operating system, but neither eCos, nor FreeRTOS or RTEMS seemed to be ported to the AT91SAM9260 CPU with a free development system. As I was into application development, I decided not to port or write an operating system myself. Running on the bare hardware was not an option either as I need a working TCP stack in order to interface to the host.

SystemBase calls their Linux port "lemonix". It appears to be a relatively plain linux-2.6.20 with real-time patches, BusyBox and a bunch of C applications to control startup and configuration. The userland and kernel source code is available from SystemBase after registration, but the kernel tree that they make available is incomplete and needs to be augmented by some files that are available from a Google group so that it can be built. Getting the cross compilation environment to run on my Ubuntu 9.10 box worked flawlessly, and after removing the whole Eddy application stuff and most of the SystemBase drivers from the file tree I ended up with a relatively sane Linux that I could cross compile and install on the Eddy hardware through tftp.

Making it fast

Back to the DMX interface, I was now faced with a system that gave me about 400 machine cycles for each DMX bit sent. That should be plenty of time to get out some bits to the ports. The bit timing would be achived by a timer interrupt every 4 microseconds. Even with interrupt handling overhead, this should leave enough headroom for TCP processing.

The ARM architecture specifies two interrupts. The normal, vectored interrupt system is used by Linux and mapped to its own, portable interrupt architecture, ignoring the vectoring facilities that the hardware provides. The second, fast interrupt system (FIQ) that the ARM provides is not used by Linux, and it seemed like a good fit to my requirements. The FIQ is interesting because when the FIQ handler is entered, the CPU automatically switches six registers to a distinct bank. These six registers can then just be used without the need to save or restore them to the stack. Even though the FIQ is not used by Linux itself, ARM-Linux provides for an interface so that drivers can use it. FIQ support was missing in lemonix, but it was trivial to backport.

Writing a FIQ interrupt in ARM assembly was straightforward. The driver bottom half set up the FIQ register set and enabled the timer interrupt, the FIQ handler set the port bits and incremented the pointers accordingly.

Reducing jitter

When looking at the output ports with the logic analyzer, though, I could see variations in bit edges in the order of one microsecond, which is beyond what the DMX receivers would be able to tolerate. These variations were visible even when the FIQ handler just set the port bits without writing any real data. The source for this jitter, as it turned out, was virtual address translation. The ARM9 CPU includes a MMU and all software, including interrupt handlers, uses virtual addressing, even to access I/O ports. The virtual address maps are stored in SDRAM due to their size, and the hardware automatically traverses these maps if an address can not be found in the translation lookaside buffers in the MMU. So in the FIQ handler, when the output port address was not present in the TLB, the MMU would access the SDRAM, and SDRAM random access is rather slow.

Thus, I had to change the FIQ handler so that it never accesses the SDRAM, either directly (by reading the data buffer) or indirectly (by causing TLB misses). The AT91SAM9260 CPU has two internal SRAMs that are accessible in two machine cycles, and I use one of those as a data buffer. In order to prevent TLB misses, indivdual TLB entries can be locked down so that they're never removed by the MMU automatically. Thus, my driver locks down the I/O and the SRAM buffer addresses that the FIQ handler accesses.

Reducing jitter even more

Even with the TLBs locked down, I still saw some jitter in the leading edges of the DMX bytes. The cause for this were variations in interrupt response times, as that depends on the instruction being interrupted. The variations were below one microsecond, but that was long enough to bring the serial decoding routine in my logic analyzer out of sync. Wanting to play it safe, I decided to remove that jitter as well by synchronizing on the timer value inside of the FIQ handler: Instead of just banging out the next bit as fast as possible, the FIQ handler now waits until the free-running timer that triggered the interrupt reaches a certain value. That way, the effective FIQ response time is made mostly constant.

Wow, 2010!

In the last few years, I have been using various AVR CPUs for my embedded projects. It is a great CPU that is easy to use and fast, and the advent of LUFA and Teensy made my life a lot easier, as I was freed from USB serial dongles and driver installation. But then, being able to process serious amounts of data is nice, too, and this is were 8 bits just don't suffice. ARM based boards are cheap nowadays, and I'm looking forward to embedding JavaScript or maybe even Common Lisp in one of my future projects. When speed is needed, I can always fall back to C and assembler.