Atmel, TCP/IP, GNU
Some time ago I did a port for Interniche of their TCP/IP stack to the Atmel AT91RM9200 under the GNU compiler. The Atmel chip - an ARM9 based SoC - was hosted on a Cogent OEM Single Board Computer: the CSB337 as shown below on a CSB300 base board. Though it is still available through various outlets, mainly for evaluation purposes, the board itself should be considered obsolete. The GNU compiler was a commercial variant; the ARM-ELF targetted tool-chain part from the multi-target MicroCross Visual X distribution.
Some pertinent documentation and links:
| The Atmel AT91RM9200 chip | |
| The Cogent CSB337 board | |
| Ed Sutter's MicroMonitor at MicroCross | |
| The Intel LXT971 PHY on the CSB337 |
The port was not that easy going as it should be due to the absence of a JTAG debugger. The CSB337 comes with the MicroMonitor bootloader, and while this is a very nice and nifty facility - including TFTP and a tiny File System - it was also the only thing to actually getting code onto the board and run it.
Of course I ran into several problems getting this up. Anyone interested can contact me for details on the port, but wat I want to post here is the separate MAC/PHY driver that may be of use to others. You'll find the download here.
Also included in this ZIP is a routine is file acksum.s; this is an optimized GNU assembler routine to determine the sumcheck of IP frames - based on the assembler example in section 4.2 of RFC1071. It allows data to be aligned at a 16bit boundary without running the processor into an alignment exception. This code is show below.
.section .text
.global acksum
.code 32
@ This algorithm performs the sum 32 bits at the time (the native word length of the
@ ARM core; 4 octets/bytes) and 'unrolls' the summing loop with 16 replications.
@ This way, the resulting loop handles chunks of up to 64 octets/bytes.
@ Additional logic is needed to handle the situation where the byte-count is not a
@ multiple of 4. Extra code is present to allow for 16bit data alignment.
@ C prototype is:
@ unsigned short acksum(void * ptr, unsigned count);
@ register usage:
@ r0 - data pointer (parm 1)
@ r1 - count of 16bit words (parm 2)
@ r3 - scratch pointer to data to sum
@ r4 - scratch word count
@ r8 - checksum accumulator
acksum:
stmfd sp!, {r3-r9} @ save local registers
@ Let r8 be the checksum accumulator. Clear it to start with
mov r8, #0 @ r8 is checksum accumulator, initialized to zero
@ If the count of 16bits words to sum is zero, then bail out with zero result
cmp r1, #0 @ is r1 zero?
beq done_sum @ we're done if so..
@ We first need to solve an alignment issue. When handling 32bits entities their
@ addresses MUST be 32bit aligned. This code intends to also handle data starting
@ on a 16bit alignment. If the data pointer IS divisible by 4, we can simply
@ proceed. If the 16bit word count is ODD in this case, we can sum the remaining
@ 16bit word AFTER completing the loop.
@ If the data pointer is NOT divisible by 4 and the 16bit word count is ODD, we
@ can solve both matters by doing the first 16 bit sum _before_ the loop.
@ If we have an EVEN 16 bit word count in this case, we need to do a 16bit word
@ sum both before and after the loop.
tst r0, #3 @ is r0 dividable by 4 (does it have a bit0 and/or bit1)
beq candivby4 @ jump to candivby4 if it is dividable by 4
tst r0, #1 @ r0 NOT dividable by 4, but is it r0 dividable by 2?
beq candivby2 @ jump to candivby2 if it is dividable by 2
mov r0, #0 @ bail out of this routine with 0 if data is not even
b done_sum @ 16bit aligned, preferring this to simply crashing..
candivby2:
sub r0, r0, #2 @ Bump down r0 for 2 bytes
ldr r9, [r0] @ get 32bit word, with first 16bit data word in high part
mov r9, r9, lsr #16 @ get rid of the low 16bit word by double shifting
mov r8, r9, lsl #16 @ and place result in the accumulator register r8
add r0, r0, #4 @ Bump up r0 for 4 bytes; is now 32bit aligned
sub r1, r1, #1 @ Decrement word16 count in r1
candivby4: @ continue operation; end of alignment correction
@ The second parameter, arriving in r1, is intended to be a count of 16bit words.
@ We need to double this value to get a count of 8bit bytes (octets). We use r6 to
@ store the result. Note that this byte count will allways be a even number.
add r6, r1, r1 @ r6 gets the 8bit byte count: r6=r1+r1 = 2xr1
@ We're going to try to handle data 32bits (4 bytes) at the time with 16 summations
@ per loop. If our byte-ount is less then 64, we need to jump somewhere into the middle
@ of the table of subsequent summations. If we have more than 64 bytes, we'll handle
@ all comlete 64 byte chunks after first handling any partial chunk, again by jumping
@ somewhere in the middle of the table. For such a jump, we need to take into account
@ that the byte-count, though allways even, does not need to be a multiple of 4.
@ To handle the latter case, a remaining 16bit word will be summed separately after
@ ending the loop.
@
@ The number of bytes that can be handled per-loop is 0, 4, 8, ...,60, 64.
@ Such a sequence follows from from 4*(byte-count div 4)
@ The number of required loops is (byte-count div 64)
@ The partial chunk contains (byte-count mod 64) bytes
@
@ To get the number of bytes to handle in the first partial loop;
@ Let r7 get 4 * ((bytecount mod 64) div 4)
@ The mod64 can be obtained by ANDing with 63 while the 4x(.. div 4) boils
@ down to clearing the two least significant bits
mvn r9, #3 @ load NOT of 3 (which is -4) to get FFFFFFFC into r9
and r6, r6, r9 @ r6=r6 AND r9; this clears the 2 lsb bits of r6
and r7, r6, #0x03F @ r7 gets r6 AND 0x3F (63d);
@ 64 minus the value just calculated must be substracted from the data pointer in r0
@ to obtain a shadow pointer - for which we will use r3 - on which the sum table
@ can be applied after jumping into it.
mov r5, #64 @ load 64 into r5
sub r7, r5, r7 @ r7 gets r5-r7 -> 64 - remainder
sub r3, r0, r7 @ r3 gets adjusted version of the data pointer from r0
@ Mask out modulus%64 bits (except for lowest bit) in the 16bit word count
mvn r9, #0x1E @ load NOT of 0x1E
and r1, r1, r9 @ mask out bits 0x1E
mov r4, r1 @ r4 contains the count of remaining 16bit words
@ Figure out how far into table to jump. There are two instructions for each
@ summation so the offset to apply to the program counter pc consists of twice the
@ 16bit word count minus the value of the pc's autoincrement
adds r7, r7, r7 @ double word count, clear carry
sub r7, r7, #4 @ subtract amount PC will autoinc
add pc, pc, r7 @ jump into accum. loop
@ Table of additions to sum up to 64 bytes from [r3]
next64:
ldr r9, [r3] @ fetch next 4 bytes to r9
adds r8, r8, r9 @ first add (w/o carry) to sum in r8
ldr r9, [r3, #4] @ repeat for a total of 16 native 32bit words: 64 bytes
adcs r8, r8, r9
ldr r9, [r3, #8]
adcs r8, r8, r9
ldr r9, [r3, #12]
adcs r8, r8, r9
ldr r9, [r3, #16]
adcs r8, r8, r9
ldr r9, [r3, #20]
adcs r8, r8, r9
ldr r9, [r3, #24]
adcs r8, r8, r9
ldr r9, [r3, #28]
adcs r8, r8, r9
ldr r9, [r3, #32]
adcs r8, r8, r9
ldr r9, [r3, #36]
adcs r8, r8, r9
ldr r9, [r3, #40]
adcs r8, r8, r9
ldr r9, [r3, #44]
adcs r8, r8, r9
ldr r9, [r3, #48]
adcs r8, r8, r9
ldr r9, [r3, #52]
adcs r8, r8, r9
ldr r9, [r3, #56]
adcs r8, r8, r9
ldr r9, [r3, #60]
adcs r8, r8, r9
@ A number of 32bits words (0-16) are now added; there can be
@ a carry that must be wrapped in..
adc r8, r8, #0 @ add carry, if any
@ We can finish up if the count of 16bit words is now zero or one
cmp r4, #0 @ check for count == 0
beq fold_sumw @ if so, done looping; go fold result
cmp r4, #1 @ check for count == 1
beq sum_halfword @ if so, go sum the last 16 bits
@ Else loop back through another 64 byte block of sum data
add r3, r3, #64 @ bump pointer 64 bytes
sub r4, r4, #32 @ decrement count of 16bit words
b next64 @ handle next chunck
@ Sum remaining 16bit word, if such data remained after looping
sum_halfword:
mov r9, #0 @ clear high word for final data
ldrh r9, [r3, #64] @ get final 16 bits at [r3+64]
adds r8, r8, r9 @ add to sum
adc r8, r8, #0 @ add carry, if any
@ Fold upper 16 bits back into lower 16 bits;
@ while (sum>>16) sum=(sum and 0xffff)+(sum>>16);
fold_sumw:
mov r9, r8, lsr#16 @ copy high 16 bits to r9
mov r8, r8, lsl#16 @ clear upper 16 bits in r8 ...
mov r8, r8, lsr#16 @ ... by double shifting
add r8, r9, r8 @ add the two 16bit words
mov r7, r8, lsr#16 @ get new hi bits into r7 low
cmp r7, #0 @ was there a carry?
bne fold_sumw @ branch back to wrap in carry bit(s)
@ We're done..
done_sum:
mov r0, r8 @ return value in r8
ldmfd sp!, {r3-r9} @ restore regs
mov pc, lr @ return
.end