Embedded Firmware — [hard] Lessons learned

Over the years I’ve developed a number of embedded systems across many industries, and have learned a lot of painful lessons along the way.  This document is an attempt to capture some of those, in the hopes others can avoid the same pitfalls on their path.

Failure to propagate an error-code up the call chain usually means lots of painful debugging and ouija board use in the future.

In my early days, if a function would fail, I’d just return “failed” to the caller.  Things had gone sour, so why care about the failure since we were toast now.  Then I actually had to debug large and complex systems, and more so, create robust systems that would either work around the fault or attempt to recover.  After that, I learned that every function should return a unique error code upon failure, and callers should propagate that up the call chain until it was printed/logged/displayed somewhere that a human could see.

Most systems don’t have more than 2^15 ways a function can fail, so it can be as simple as returning a negative 16-bit value to indicate failure, so the cost is very low.

By propagating the unique error code up the chain, I can instantly find where the error occurred no matter how deep in the drivers it happened, rather than having to guess where that ERR_TIMEOUT failure was triggered.

Anything but the most low-level communications link can fail, and sometimes even those fail – be ready for it and never assume success.

For the most part EEPROMs live forever and don’t wear out, and we often assume if we write a memory location we’re good to go after the write completes.  Most code will be successful with this approach, especially when we’ve done that analysis that with one write per boot cycle, and 100,000 boot cycles, we’re not ever coming close to the wear limit.  Seems fine.

But what happens when there’s a bug in your development code and you happen to spin in a loop continually writing a memory cell?  Oops.  Now that part is unexpectedly worn out.

Or maybe you write an I2C register, but failed to disable a write-protect beforehand by accident or due to a race condition and that critical write was ignored.   Oops.  You’ll never know unless you read it back.

And of course there are all the single-bit errors that can happen due to transients on that lowly RS232 serial channel when a solenoid engages that happens to have its power wire adjacent to the serial wires.

Power-ups are messy

If only when we power up our systems, everything would come up clean and ready to go.  But some parts are sleepy and wake up slowly.  Or get grumpy when the slew rate of our power rails isn’t what they expected, or some pin toggles during their power up setting them off like someone who needs coffee before conversation in the morning.

So, on power up, always make sure that peripheral is actually ready to talk.  Often they have their own POST to get through, and might not be fully responsive initially.  If you don’t check, then maybe your power-up initialization works fine until you change the boot-up timing, and then unexpectedly something fails.

When parts have a way to do a reset, it’s a good practice to invoke that reset when the driver initializes.  That way you know it’s clean, and the part is in a defined state.  And that way if your system needs to re-initialize, you can call that driver again and again and it will be immune to the previous peripheral state.

Floating pins create nasty bugs

So many times I’ve seen designs with floating GPIOs, rationalized with the argument, “We’ll initialize that pin on power-up, so it’s okay that it floats.  It’s just for a short time, so it doesn’t matter.”  This has burned countless designs again and again.

Systems must be stable and deterministic in reset.  Non-negotiable.

From a firmware perspective, you should never need to race to initialize pins on power-up, and it’s good to test adding delays in power up to expose race conditions and unstable boot conditions.  Sometimes this can manifest as excessive power consumption during boot, and by adding a delay it’ll help the hardware team find it.

Beware of assumed long term perfection

It’s common for some SPI peripherals to just emit endless streams of data formatted into packets of a fixed length.  Thus once you start the streaming, you just pack every N bytes into a structure and you’re good to go for thousands, millions, billions(?) of packets.  You’re synchronized, right?  Yeah, kinda.

I’ve seen systems like this often purr like well-oiled machines for countless streams, but it all hinges on not skipping a beat.  Ever.  But even a human heart sometimes just skips a beat, and the same for embedded systems – sometimes we miss a byte, or someone messes up sending it, or it gets corrupted.  The point is, it’s essential to not assume long synchronization, and periodically confirm that all is well.

This could be an IMU, for example, that streams out a packet of 6 axes of 2-byte data (12 bytes) whenever read.  Or an AFE reading out sensor channels in a block.  In both of these cases I’ve seen a need to watch for loss of synchronization, and to have a detection/recovery mechanism in place.  In one-way fiber-optic communication systems, this can mean forward-error-correction mechanisms and key-frames, for example.

That lowly UART is your friend

Any system of consequence I’ve done has included a UART that is dedicated to the firmware team to do with what they want.  Yes, there are lots of fancy JTAG debuggers, but the value of getting detailed debug info barked out that simple TxD channel can be immeasurable in value.

On most SoCs it takes just a couple register writes to enable a simple busy-wait UART transmitter, and at 115.2k-baud, it’s an acceptable time to wait for a byte to clock out.  Later in the boot process a nice interrupt-driven driver can be turned on, but it’s helpful to have a dirt-simple way to send debug data immediately in the boot process.

POST data is essential.  Having each driver broadcast out configuration information makes it trivial to see when a system boots up in an unexpected state.

Find your neighbors on power-up

In POST, I have had my debugging time saved many times by scanning each I2C bus to see what addresses are responded to.  It’s reasonably fast to do, and showing all of the I2C addresses in the POST UART stream which elicit a response can help debug a mis-configuration or missing component.  I generally also display the value read from the address, as well as what peripheral I expected to be at that address.

Remember the movie Memento

Most SoCs have one or more registers that try to explain what just happened on boot.  Did I just have a watchdog reset?  Software reset?  Hardware reset?  Brown-out?  Read those registers and display them in your POST UART stream.  And log them later.

That POST TxD stream is ephemeral, but useful – don’t lose it

The serial output stream on power-up can be super useful, but it’s fleeting.  And sometimes while debugging, you want to know what happened at power up but that was long ago.  Solution?  Store it.  Write every byte sent out the UART to a RAM buffer so you can replay it later.  That buffer should be just large enough to hold all the POST messages, and then once boot has completed stop writing to it.  If resources are too tight to keep it, just keep it as long as you can before you reallocate that RAM.  Or after boot, write it out to some non-volatile memory that’s sitting idle.

It’s good to know your history

Most embedded systems have non-volatile storage, which often includes things like MAC addresses, serial numbers, configuration details, and calibration constants.  In a perfect world, we knew at the beginning everything we’d need to store, but reality is otherwise so this table of non-volatile details evolves and grows.  But you don’t want to be saddled with legacy format decisions, so it’s been greatly helpful to track the previous FW version to help migration.  Thus when a new FW version is loaded, it sees that the version changed and executes the appropriate migration code.  New code isn’t hindered by old decisions, and configuration details aren’t lost.

Think OOPy

Long story short, keep the configuration and calibration data close to where it’s needed.  This generally means keeping this data on device in non-volatile memory, rather than in some off-module table.  This saves elaborate systems of tracking, storing, transporting, deploying such data.  It’s just written on the device during manufacturing/calibration and it stays with the module, tucked close to the code that actually needs it.

I’ve seen countless man-years spent on trying to develop/maintain systems that try to keep minimal data on embedded systems and rely on matching that to off-module databases.

Making EEPROM non-volatile on YUN

The Arduino YUN is one of my favorite prototyping platforms for Wi-Fi connected microcontroller applications, but for unclear reasons the originators felt that by default the EEPROM contents should be lost when reprogramming code images.  Let’s fix that.

I typically use EEPROM to store board-specific identification, configuration, and calibration information — things I’d rather not lose when updating the code.  And if you only update using the USB connection you’ll be fine.  But leverage one of the great features of the YUN — updates over Wi-Fi within the IDE — and say goodbye to your EEPROM.

It’s an easy fix, though, and you just need to ssh into the board, change two bytes, and EEPROM will then be preserved.

You’ll need to have your YUN connected to Wi-Fi, and you’ll need the IP address.  To find it, either use the IDE to select the Wi-Fi connected port, or use a Bonjour browser such as this one for Windows.

Once you have an IP address. ssh into the board using a tool such as Putty.  Then navigate to /usr/bin and edit (I use vi) run-avrdude, changing the 0xD8 after hfuse to be 0xD0.  Save the file and your Wi-Fi code updates will no longer clear out your EEPROM settings.

root@MyYunName:/usr/bin# cd /usr/bin
root@MyYunName:/usr/bin# vi run-avrdude
#!/bin/sh
echo 1 > /sys/class/gpio/gpio21/value
avrdude  -q -q -c linuxgpio -C /etc/avrdude.conf -p m32u4 -U efuse:r:/tmp/efuse:d
read EFUSE < /tmp/efuse
rm -f /tmp/efuse
if [ “x$EFUSE” = “x203” ] # 203 = 0xCB
then
        avrdude -c linuxgpio -C /etc/avrdude.conf -p m32u4 -U lfuse:w:0xFF:m -U hfuse:w:0xD0:m -U efuse:w:0xCB:m -Uflash:w:$1:i $2
else
        avrdude -c linuxgpio -C /etc/avrdude.conf -p m32u4 -U lfuse:w:0xFF:m -U hfuse:w:0xD0:m -U efuse:w:0xFB:m -Uflash:w:$1:i $2
fi
echo 0 > /sys/class/gpio/gpio21/value

High resolution run-time delays on ATmega (Arudino) platforms

I recently encountered a need to dynamically generate high resolution time delays on an Arduino platform (YUN, ATmega), and this thread is to share the solution in case it helps others.
This gives run-time adjustable delays as low as 625ns and with resolution steps of 62.5ns to 96.25us.
If others have superior solutions, I’m all ears.
/*————————————————————–*/
/* Delay loop functions so we can do single cycle programmable  */
/* delays (62.5ns resolution).                                  */
/*————————————————————–*/
 
byte tune_delay=20;                  // nominal starting value
unsigned long freq;
 
void Delay_Plus0(byte ticks)
{
  for (; ticks; ticks–) DELAY_63NS; // 375ns/tick
}
 
void Delay_Plus1(byte ticks)
{
  for (; ticks; ticks–) DELAY_63NS; // 375ns/tick
  DELAY_63NS;
}
 
void Delay_Plus2(byte ticks)
{
  for (; ticks; ticks–) DELAY_63NS; // 375ns/tick
  DELAY_63NS; DELAY_63NS;
}
 
void Delay_Plus3(byte ticks)
{
  for (; ticks; ticks–) DELAY_63NS; // 375ns/tick
  DELAY_63NS; DELAY_63NS; DELAY_63NS;
}
 
void Delay_Plus4(byte ticks)
{
  for (; ticks; ticks–) DELAY_63NS; // 375ns/tick
  DELAY_63NS; DELAY_63NS; DELAY_63NS; DELAY_63NS;
}
 
void Delay_Plus5(byte ticks)
{
  for (; ticks; ticks–) DELAY_63NS; // 375ns/tick
  DELAY_63NS; DELAY_63NS; DELAY_63NS; DELAY_63NS; DELAY_63NS;
}
 
 
 
// —————- Setup for changing the delay:
 
    byte tune_delay_div;
    void (*delay_func_ptr)(byte);
    switch (tune_delay%6) {
      case 0 : delay_func_ptr = &Delay_Plus0; break;
      case 1 : delay_func_ptr = &Delay_Plus1; break;
      case 2 : delay_func_ptr = &Delay_Plus2; break;
      case 3 : delay_func_ptr = &Delay_Plus3; break;
      case 4 : delay_func_ptr = &Delay_Plus4; break;
      case 5 : delay_func_ptr = &Delay_Plus5; break;
    }
    tune_delay_div = tune_delay/6

// —————- Using the delay in the timing critical section

 
      (*delay_func_ptr)(tune_delay_div); // (625 + tune_delay*62.5) ns

You know what they say about people with big hands?

They need big mice.

In this era of making everything smaller, some things aren’t getting smaller.  Us.

For me it’s always been a challenge to find a mouse that fit well as most of mice out there are too small.  Years ago there was the Whale Mouse, but alas, that can only be found in dark alleys and surplus stores now and even it wasn’t perfect having lacked side buttons.

The rest of this story is a quick review of larger mice that can be found in active production, and what I found to be the winners in the group. Continue reading

Tagged , , , , ,

Google SearchNinja tips

I often dismiss these type of articles as a collection of “secrets” that aren’t actually so secret.  Like that exit on I-80 for Secret Town Road – how secret can that be?

Well, this article has some great Google Search tricks that are quite useful.

 

My two top favorites:

  • Image search by uploading an image. Wow. Very cool. Upload an image and the Goog will find images that are similar. Looking for a higher resolution of an image you already have? Use this. Like your image of the horned frog but want to find it in a different pose? Use this.
  • Block sites from your searches…always. Have a grudge against Amazon.com? Put them on your list of sites to never search against and you don’t have to see them again.


Tagged , , , ,

Tufte-isms: Great words from a man of great graphs

Engineering is a marvelous thing.  We get the joy of creating the amazing things that make life better and often more interesting, or at least more entertaining at times.  We don’t live in a vacuum, though, we the engineers, and as such we need to be good communicators.

Enter Edward Tufte.

If you’ve not been to one of his presentations, I highly recommend you put one on your schedule — he is the master evangelist of the power of quality graphics.  His point being that the data, no matter how good, is worthless unless you can convey the messages within.

What brings him up in my mind today, is an entertaining article from IEEE Spectrum in which the author talks about not the graphs of Tufte, but rather the words of Tufte.

Worth a read here.

Tagged , , , , ,

Sometimes it *is* all about the blinky light

This is a great story of an amazingly innovative solution to a difficult and expensive problem.

Continue reading

Tagged , , , , , , , , ,

CES 2012 – Armageddon Year?

CES 2012 – Armageddon Year?

According to the Mayans the calendar comes to an end this year.  Does this mean the world ends?  Probably not.  But maybe they were thinking of CES. Continue reading

Tagged , , , , , , ,

NTSB final report on C310 N5225J (Doug Bourn) — Just the facts, ma’am

Earlier this week the NTSB released its final report on the crash of N5225J, a Cessna 310R that crashed into East Palo Alto in February of 2010 when on an instrument departure from KPAO. Continue reading

Tagged , , , , , , , , , , , , , ,

What does the FU in IFU really stand for?*

I really had to wonder this a few days ago as I removed a new smoke alarm out of the box because I was replacing the roof on the house. (yes, California is a funny place)

Replacing a smoke alarm seemed easy enough. What could it take? Just a couple screws, a battery, and I’d be In Like Flint.

Maybe.

Here’s the trick question of the day: “Which comes with more instructions: a $9 smoke alarm or a $400 iPhone?” Right– it’s the smoke alarm. Feast your eyes on this:

On the left are in the instructions for the smoke alarm (all in English), and on the right, the iPhone’s. Okay, yes, I admit that I didn’t unfold the iPhone instruction packet, but there’s not much there even if I did.

My question is, “WHY?!”

Really. Do we really need this many instructions for a smoke alarm? Well, my guess is that the lawyers say, “Yes.” Every possible misuse had to be accounted for and warned against. Every accident that ever occured where a smoke alarm might not have done its job perfectly resulted in another sentence, another picture, another set of guidelines to achieve a successful smoke alarm experience.

And the iPhone? Actually, I’ve never read that little booklet, and my experience is incredible. I’m scared to think how good it might be if I actually read the instructions.

* [Feel Useless]

Tagged , , , , , , ,