Atmel, TCP/IP, GNU

Some time ago I did a port for Interniche of their TCP/IP stack to the Atmel AT91RM9200 under the GNU compiler. The Atmel chip - an ARM9 based SoC - was hosted on a Cogent OEM Single Board Computer: the CSB337 as shown below on a CSB300 base board. Though it is still available through various outlets, mainly for evaluation purposes, the board itself should be considered obsolete. The GNU compiler was a commercial variant; the ARM-ELF targetted tool-chain part from the multi-target MicroCross Visual X distribution.

csb337 on csb300s

Some pertinent documentation and links:

Manual  The Atmel AT91RM9200 chip
Manual  The Cogent CSB337 board
Manual  Ed Sutter's MicroMonitor at MicroCross
Datasheet  The Intel LXT971 PHY on the CSB337
Interniche Cogent Armel Microcross

The port was not that easy going as it should be due to the absence of a JTAG debugger. The CSB337 comes with the MicroMonitor bootloader, and while this is a very nice and nifty facility - including TFTP and a tiny File System - it was also the only thing to actually getting code onto the board and run it.

Of course I ran into several problems getting this up. Anyone interested can contact me for details on the port, but wat I want to post here is the separate MAC/PHY driver that may be of use to others. You'll find the download here.

Also included in this ZIP is a routine is file acksum.s; this is an optimized GNU assembler routine to determine the sumcheck of IP frames - based on the assembler example in section 4.2 of RFC1071. It allows data to be aligned at a 16bit boundary without running the processor into an alignment exception. This code is show below.

   .section .text

.global acksum

.code 32

@ This algorithm performs the sum 32 bits at the time (the native word length of the
@ ARM core; 4 octets/bytes) and 'unrolls' the summing loop with 16 replications.
@ This way, the resulting loop handles chunks of up to 64 octets/bytes.
@ Additional logic is needed to handle the situation where the byte-count is not a
@ multiple of 4. Extra code is present to allow for 16bit data alignment.

@ C prototype is:
@ unsigned short acksum(void * ptr, unsigned count);
@ register usage:
@ r0 - data pointer (parm 1)
@ r1 - count of 16bit words (parm 2)
@ r3 - scratch pointer to data to sum
@ r4 - scratch word count
@ r8 - checksum accumulator

acksum:

stmfd sp!, {r3-r9} @ save local registers

@ Let r8 be the checksum accumulator. Clear it to start with

mov r8, #0 @ r8 is checksum accumulator, initialized to zero

@ If the count of 16bits words to sum is zero, then bail out with zero result

cmp r1, #0 @ is r1 zero?
beq done_sum @ we're done if so..

@ We first need to solve an alignment issue. When handling 32bits entities their
@ addresses MUST be 32bit aligned. This code intends to also handle data starting
@ on a 16bit alignment. If the data pointer IS divisible by 4, we can simply
@ proceed. If the 16bit word count is ODD in this case, we can sum the remaining
@ 16bit word AFTER completing the loop.
@ If the data pointer is NOT divisible by 4 and the 16bit word count is ODD, we
@ can solve both matters by doing the first 16 bit sum _before_ the loop.
@ If we have an EVEN 16 bit word count in this case, we need to do a 16bit word
@ sum both before and after the loop.

tst r0, #3 @ is r0 dividable by 4 (does it have a bit0 and/or bit1)
beq candivby4 @ jump to candivby4 if it is dividable by 4
tst r0, #1 @ r0 NOT dividable by 4, but is it r0 dividable by 2?
beq candivby2 @ jump to candivby2 if it is dividable by 2
mov r0, #0 @ bail out of this routine with 0 if data is not even
b done_sum @ 16bit aligned, preferring this to simply crashing..

candivby2:
sub r0, r0, #2 @ Bump down r0 for 2 bytes
ldr r9, [r0] @ get 32bit word, with first 16bit data word in high part
mov r9, r9, lsr #16 @ get rid of the low 16bit word by double shifting
mov r8, r9, lsl #16 @ and place result in the accumulator register r8
add r0, r0, #4 @ Bump up r0 for 4 bytes; is now 32bit aligned
sub r1, r1, #1 @ Decrement word16 count in r1

candivby4: @ continue operation; end of alignment correction

@ The second parameter, arriving in r1, is intended to be a count of 16bit words.
@ We need to double this value to get a count of 8bit bytes (octets). We use r6 to
@ store the result. Note that this byte count will allways be a even number.

add r6, r1, r1 @ r6 gets the 8bit byte count: r6=r1+r1 = 2xr1

@ We're going to try to handle data 32bits (4 bytes) at the time with 16 summations
@ per loop. If our byte-ount is less then 64, we need to jump somewhere into the middle
@ of the table of subsequent summations. If we have more than 64 bytes, we'll handle
@ all comlete 64 byte chunks after first handling any partial chunk, again by jumping
@ somewhere in the middle of the table. For such a jump, we need to take into account
@ that the byte-count, though allways even, does not need to be a multiple of 4.
@ To handle the latter case, a remaining 16bit word will be summed separately after
@ ending the loop.
@
@ The number of bytes that can be handled per-loop is 0, 4, 8, ...,60, 64.
@ Such a sequence follows from from 4*(byte-count div 4)
@ The number of required loops is (byte-count div 64)
@ The partial chunk contains (byte-count mod 64) bytes
@
@ To get the number of bytes to handle in the first partial loop;
@ Let r7 get 4 * ((bytecount mod 64) div 4)
@ The mod64 can be obtained by ANDing with 63 while the 4x(.. div 4) boils
@ down to clearing the two least significant bits

mvn r9, #3 @ load NOT of 3 (which is -4) to get FFFFFFFC into r9
and r6, r6, r9 @ r6=r6 AND r9; this clears the 2 lsb bits of r6
and r7, r6, #0x03F @ r7 gets r6 AND 0x3F (63d);

@ 64 minus the value just calculated must be substracted from the data pointer in r0
@ to obtain a shadow pointer - for which we will use r3 - on which the sum table
@ can be applied after jumping into it.

mov r5, #64 @ load 64 into r5
sub r7, r5, r7 @ r7 gets r5-r7 -> 64 - remainder
sub r3, r0, r7 @ r3 gets adjusted version of the data pointer from r0

@ Mask out modulus%64 bits (except for lowest bit) in the 16bit word count

mvn r9, #0x1E @ load NOT of 0x1E
and r1, r1, r9 @ mask out bits 0x1E
mov r4, r1 @ r4 contains the count of remaining 16bit words

@ Figure out how far into table to jump. There are two instructions for each
@ summation so the offset to apply to the program counter pc consists of twice the
@ 16bit word count minus the value of the pc's autoincrement

adds r7, r7, r7 @ double word count, clear carry
sub r7, r7, #4 @ subtract amount PC will autoinc
add pc, pc, r7 @ jump into accum. loop

@ Table of additions to sum up to 64 bytes from [r3]

next64:
ldr r9, [r3] @ fetch next 4 bytes to r9
adds r8, r8, r9 @ first add (w/o carry) to sum in r8
ldr r9, [r3, #4] @ repeat for a total of 16 native 32bit words: 64 bytes
adcs r8, r8, r9
ldr r9, [r3, #8]
adcs r8, r8, r9
ldr r9, [r3, #12]
adcs r8, r8, r9
ldr r9, [r3, #16]
adcs r8, r8, r9
ldr r9, [r3, #20]
adcs r8, r8, r9
ldr r9, [r3, #24]
adcs r8, r8, r9
ldr r9, [r3, #28]
adcs r8, r8, r9
ldr r9, [r3, #32]
adcs r8, r8, r9
ldr r9, [r3, #36]
adcs r8, r8, r9
ldr r9, [r3, #40]
adcs r8, r8, r9
ldr r9, [r3, #44]
adcs r8, r8, r9
ldr r9, [r3, #48]
adcs r8, r8, r9
ldr r9, [r3, #52]
adcs r8, r8, r9
ldr r9, [r3, #56]
adcs r8, r8, r9
ldr r9, [r3, #60]
adcs r8, r8, r9

@ A number of 32bits words (0-16) are now added; there can be
@ a carry that must be wrapped in..

adc r8, r8, #0 @ add carry, if any

@ We can finish up if the count of 16bit words is now zero or one

cmp r4, #0 @ check for count == 0
beq fold_sumw @ if so, done looping; go fold result
cmp r4, #1 @ check for count == 1
beq sum_halfword @ if so, go sum the last 16 bits

@ Else loop back through another 64 byte block of sum data

add r3, r3, #64 @ bump pointer 64 bytes
sub r4, r4, #32 @ decrement count of 16bit words
b next64 @ handle next chunck

@ Sum remaining 16bit word, if such data remained after looping

sum_halfword:
mov r9, #0 @ clear high word for final data
ldrh r9, [r3, #64] @ get final 16 bits at [r3+64]
adds r8, r8, r9 @ add to sum
adc r8, r8, #0 @ add carry, if any

@ Fold upper 16 bits back into lower 16 bits;
@ while (sum>>16) sum=(sum and 0xffff)+(sum>>16);

fold_sumw:
mov r9, r8, lsr#16 @ copy high 16 bits to r9
mov r8, r8, lsl#16 @ clear upper 16 bits in r8 ...
mov r8, r8, lsr#16 @ ... by double shifting
add r8, r9, r8 @ add the two 16bit words
mov r7, r8, lsr#16 @ get new hi bits into r7 low
cmp r7, #0 @ was there a carry?
bne fold_sumw @ branch back to wrap in carry bit(s)

@ We're done..

done_sum:
mov r0, r8 @ return value in r8
ldmfd sp!, {r3-r9} @ restore regs
mov pc, lr @ return

.end

Filed under code & stuff – Published 2006 Aug 13 – Modified 2008 May 11 – Permalink

Top