How does the processor differentiate between signed and unsigned numbers?


The concept of data types is a highlevel term with which the compilers and programmers are familiar. To the hardware, however, data are only a stream of bits. Data types, despite of being too highlevel for the hardware to understand do play a crucial role making the programs work in the way they should and avoiding security hazards.

In a language like c when we define (int x) and (char c) the former occupies 4 bytes of physical memory and is supposed to store numerical data while the later is meant to store a single byte character. Both types can be signed or unsigned. In computer terminology we use 2th complement of a positive number to represent its negative counterpart. The interesting feature of using the number’s complement is that we won’t need extra circuits in processor for negative numbers. Same circuit is used for example for adding both positive and negative numbers. To the processor there are only some sequence of bits to operate on.

char type variables despite of having the name “Character” but again are numerical values inside (which are later mapped to their characters while being displayed on your screen). For the sake of simplicity we use char type cause it’s only one byte long. The concept can be extended to any other types.

Well now look at the code below:

char          a ;
unsigned char b ;
int         tmp ;
int main(){
    a   = 255;
    b   = 255;
    tmp = -1 ;
    if (a==tmp)
        printf("a equals -1\n");

    if (b==tmp)
        printf("b equals -1\n");
        printf("but b is %d\n",b);        

After compiling and running the program we get this output:

a equals -1
but b is 255

Well the system recognizes “a” as being equal to -1 but “b” is 255. In the first look it may seem strange cause we stored 255 for both of them. But shortly you will see this is exactly how it must be.
This is the machine code of this program:

:	push   rbp
:	mov    rbp,rsp
:	mov    BYTE PTR [rip+0x20048b],0xff            # 0x6009dc a
:	mov    BYTE PTR [rip+0x20047c],0xff            # 0x6009d4 b
:	mov    DWORD PTR [rip+0x200476],0xffffffff     # 0x6009d8 tmp
:	movzx  eax,BYTE PTR [rip+0x200473]             # 0x6009dc a
:	movsx  edx,al
:	mov    eax,DWORD PTR [rip+0x200466]            # 0x6009d8 tmp 
:	cmp    edx,eax
:	jne    0x400580 
:	mov    edi,0x400644
:	call   0x400410 
:	movzx  eax,BYTE PTR [rip+0x20044d]             # 0x6009d4 b
:	movzx  edx,al
:	mov    eax,DWORD PTR [rip+0x200448]            # 0x6009d8 tmp
:	cmp    edx,eax
:	jne    0x4005a0 
:	mov    edi,0x400650
:	call   0x400410 
:	jmp    0x4005bb 
:	movzx  eax,BYTE PTR [rip+0x20042d]              # 0x6009d4 b
:	movzx  eax,al
:	mov    esi,eax
:	mov    edi,0x40065c
:	mov    eax,0x0
:	call   0x400420 
:	pop    rbp
:	ret    

Both “char” and “unsigned char” types are 1 byte. At main+11 and main+14 you can see a single value (0xFF) is stored for both a and b. The comparison operation is done through registers EAX and EDX. The register EAX contains “tmp” value which is minus one, and EDX has “a” value. The reason that we used another variable to compare -1 with “a” and “b”, is that the compiler removes the code related to the comparison in pursuit of efficiency when using a constant number instead of variable “tmp”.
EDX which stores “a” is 4 bytes long. So “a” is stored through “AL” subregister which is also one byte. Now here is the interesting part: At main+35 the compiler inserted movsx instruction so AL is stored in EDX which is 4 bytes. AL will be in the first byte of EDX and the rest of the bits are populated with the sign bit of AL. Now at the time of comparison, EDX equals 0xFFFFFFFF and is compared with “tmp” which has the value 0xFFFFFFFF. This means comparing -1 with -1. So the condition is True.
Now for “b” every thing is similarly done except when storing AL in EDX. This time movzx has been used at main+97 in lieu of movsx. This avoids making a negative number by zeroing all the remaining bits in EDX (representing 0xFF in a way different from when it was considered a negative number). So this time 0x000000FF (or 255 in decimal) is compared with 0xFFFFFFFF and the condition becomes False. The processor compares the two registers bit by bit,but as you saw this is the compiler that bares the tricky part by representing the two numbers in two different ways without having any means but the sole bits of the memory.
You can see how much it is important to use correct types when writing programs. Also be careful about overflows. A single signed byte can store -128 to +128. This is why we wrote 255 for “a” variable but the compiler treated it as -1. On the other hand variable “b” got its maximum possible value which is 255 and the system correctly considered it as what we wanted.

         BTC Donation: 14VbVxML8M2MUnXF9kPAKWCEQka232pc5h
Iran University of Science and Technology
Department of Computer Engineering

Leave a Reply

Your email address will not be published. Required fields are marked *