Over several months in late '08 and early '09, I struggled with
a serious problem in an app I was coding for the Atmel ATmega1284P
MCU. The symptoms were corrupted RAM, as if I was getting a stack
overflow or a runaway pointer.
The application in question was a BASIC interpreter, quite large and
complex. Had I only been developing on the '1284p, I would have
written off the problem as a code issue. But I was porting the
same application to the ATmega128 and the AT90CAN128 simultaneously,
and neither of those ports showed these issues.
I began scaling back the size of the app, trying to pin down where the
corruption was happening. Eventually, I reduced the system to the
following minimum set capable of showing the problem:
- ATmega1284P with external 20 MHz crystal
- Solid, well-behaved power source (5 VDC bench supply)
- LED and limiting resistor on a port line
- Port line RX0 (PD0) configured as an input
- Application to blink the LED using a subroutine call!
- All interrupts in the application are disabled
Here is the entire program for demonstrating the problem:
#include
<avr/io.h>
#include
<stdint.h>
#define
TESTPIN 0
void
toggle(void);
int
main(void)
{
volatile uint16_t
n;
DDRB = DDRB | (1<<TESTPIN);
while (1)
{
for (n=0; n<20000; n++) ;
toggle();
}
}
void
toggle(void)
{
PORTB = PORTB ^ (1<<TESTPIN);
}
With the above setup running, the LED blinked at the expected
rate. I used a fairly slow rate; anything will work.
All that matters is that you be able to spot a difference if the app
hits the tall weeds because of code runaway following a stack
corruption.
The corruption can be triggered by injecting sharp-edged pulses onto
port line RX0. The simplest way to do this is to hook up an
RS-232 level-shifter wired to a PC's serial port, but any other source
of pulses will work. I can cause the corruption by flashing RX0
using a bare wire connected to ground. When the corruption
occurs, the LED issues at least one blink of incorrect length as the
code runs away, then eventually lands on the reset vector or on main()
and restarts.
This problem is not unknown in the Atmel world, by the way. The
Uzebox group has noticed similar behavior with the Xmega644P; check
here.
I described what I was seeing to the
AVRFreaks
list (an excellent community resource, by the way!) and asked for
help. Most of the responders were hung up on the RX0 line and the
fact that the corruption coincided with serial traffic from the PC; I
got lots of responses about baud rate and USART setup. One
respondent, however, understood what I was describing and provided the
fix. Ossi suggested putting a 1K resistor in series with RX0, and
a 100 pf capacitor from RX0 to ground. I followed his suggestion,
using a 220 pf cap instead. All signs of the RAM corruption
disappeared. If I modified the code to use RX0 as a USART, serial
data from the PC was received properly.
Based on the Uzebox link above, I believe this corruption involves
noise injected onto the internal RAM bus and is somehow
related to the picopower version of the '1284p device. I
contacted Atmel about this issue and was told that the '1284p devices I
used were pre-Production and that target date for Production devices
was Nov '09. After a couple more emails to Atmel about this
subject, I was informed that there was a new target date for Production
devices, now Feb '10!
Note that the problem does NOT appear at 16 MHz! This only seems
to be a problem when you push the '1284p close to its maximum rated
clock speed.
I further suspect that this corruption could be induced by the
application, if the app used PD0 as an output and created pulses on
PD0. Note that I haven't tested this theory, but that would be a
bear to track down if it happened! :-)
Home