DEPARTMENT OF COMPUTER SCIENCE, THE UNIVERSITY OF QUEENSLAND

The dcc Decompiler

Please note new URL: http://www.it.uq.edu.au/groups/csm/dcc.html


Index


dcc

The dcc decompiler decompiles .exe files from the (i386, DOS) platform to C programs. The final C program contains assembler code for any subroutines that are not possible to be decompiled at a higher level than assembler.

The analysis performed by dcc is based on traditional compiler optimization techniques and graph theory. The former is capable of eliminating registers and intermediate instructions to reconstruct high-level statements; the later is capable of determining the control structures in each subroutine.

The structure of a decompiler resembles that of a compiler: a front-, middle-, and back-end which perform separate tasks. The front-end is a machine-language dependent module that reads in machine code for a particular machine and transforms it into an intermediate, machine-independent representation of the program. The middle-end (aka the Universal Decompiling Machine or UDM) is a machine and language independent module that performs the core of the decompiling analysis: data flow and control flow analysis. Finally, the back-end is high-level language dependent and generates code for the program (C in the case of dcc).

In practice, several programs are used with the decompiler to create the high-level program. These programs aid in the detection of compiler and library signatures, hence augmenting the readability of programs and eliminating compiler start-up and library routines from the decompilation analysis.



Example of Decompilation

We illustrate the decompilation of a fibonacci program (see Figure 4). Figure 1 illustrates the relevant machine code of this binary. No library or compiler start up code is included. Figure 2 presents the disassembly of the binary program. All calls to library routines were detected by dccSign (the signature matcher), and thus not included in the analysis. Figure 3 is the final output from dcc. This C program can be compared with the original C program in Figure 4.


	 55 8B EC 83 EC 04 56 57 1E B8 94 00 50 9A 
   0E 00 3C 17 59 59 16 8D 46 FC 50 1E B8 B1 00 50 
   9A 07 00 F0 17 83 C4 08 BE 01 00 EB 3B 1E B8 B4
   00 50 9A 0E 00 3C 17 59 59 16 8D 46 FE 50 1E B8
   C3 00 50 9A 07 00 F0 17 83 C4 08 FF 76 FE 9A 7C
   00 3B 16 59 8B F8 57 FF 76 FE 1E B8 C6 00 50 9A
   0E 00 3C 17 83 C4 08 46 3B 76 FC 7E C0 33 C0 50
   9A 0A 00 49 16 59 5F 5E 8B E5 5D CB 55 8B EC 56
   8B 76 06 83 FE 02 7E 1E 8B C6 48 50 0E E8 EC FF
   59 50 8B C6 05 FE FF 50 0E E8 E0 FF 59 8B D0 58
   03 C2 EB 07 EB 05 B8 01 00 EB 00 5E 5D CB       

Figure 1 - Machine Code for Fibonacci.exe


		proc_1  PROC  FAR                        
000 00053C 55                  PUSH           bp         
001 00053D 8BEC                MOV            bp, sp      
002 00053F 56                  PUSH           si          
003 000540 8B7606              MOV            si, [bp+6]  
004 000543 83FE02              CMP            si, 2       
005 000546 7E1E                JLE            L1          
006 000548 8BC6                MOV            ax, si      
007 00054A 48                  DEC            ax          
008 00054B 50                  PUSH           ax          
009 00054C 0E                  PUSH           cs          
010 00054D E8ECFF              CALL  near ptr proc_1      
011 000550 59                  POP            cx          
012 000551 50                  PUSH           ax          
013 000552 8BC6                MOV            ax, si      
014 000554 05FEFF              ADD            ax, 0FFFEh  
015 000557 50                  PUSH           ax          
016 000558 0E                  PUSH           cs          
017 000559 E8E0FF              CALL  near ptr proc_1      
018 00055C 59                  POP            cx          
019 00055D 8BD0                MOV            dx, ax      
020 00055F 58                  POP            ax          
021 000560 03C2                ADD            ax, dx      
023 00056B 5E             L2:  POP            si          
024 00056C 5D                  POP            bp          
025 00056D CB                  RETF                       
026 000566 B80100         L1:  MOV            ax, 1       
027 000569 EB00                JMP            L2          
		proc_1  ENDP                              
							  
		main  PROC  FAR                           
000 0004C2 55                  PUSH           bp          
001 0004C3 8BEC                MOV            bp, sp      
002 0004C5 83EC04              SUB            sp, 4       
003 0004C8 56                  PUSH           si          
004 0004C9 57                  PUSH           di          
005 0004CA 1E                  PUSH           ds          
006 0004CB B89400              MOV            ax, 94h     
007 0004CE 50                  PUSH           ax          
008 0004CF 9A0E004D01          CALL   far ptr printf      
009 0004D4 59                  POP            cx          
010 0004D5 59                  POP            cx          
011 0004D6 16                  PUSH           ss          
012 0004D7 8D46FC              LEA            ax, [bp-4]  
013 0004DA 50                  PUSH           ax          
014 0004DB 1E                  PUSH           ds          
015 0004DC B8B100              MOV            ax, 0B1h    
016 0004DF 50                  PUSH           ax          
017 0004E0 9A07000102          CALL   far ptr scanf       
018 0004E5 83C408              ADD            sp, 8       
019 0004E8 BE0100              MOV            si, 1       
021 000528 3B76FC         L3:  CMP            si, [bp-4]  
022 00052B 7EC0                JLE            L4          
023 00052D 33C0                XOR            ax, ax      
024 00052F 50                  PUSH           ax          
025 000530 9A0A005A00          CALL   far ptr exit        
026 000535 59                  POP            cx          
027 000536 5F                  POP            di          
028 000537 5E                  POP            si          
029 000538 8BE5                MOV            sp, bp      
030 00053A 5D                  POP            bp          
031 00053B CB                  RETF                       
032 0004ED 1E             L4:  PUSH           ds          
033 0004EE B8B400              MOV            ax, 0B4h    
034 0004F1 50                  PUSH           ax          
035 0004F2 9A0E004D01          CALL   far ptr printf      
036 0004F7 59                  POP            cx          
037 0004F8 59                  POP            cx          
038 0004F9 16                  PUSH           ss          
039 0004FA 8D46FE              LEA            ax, [bp-2]  
040 0004FD 50                  PUSH           ax          
041 0004FE 1E                  PUSH           ds          
042 0004FF B8C300              MOV            ax, 0C3h    
043 000502 50                  PUSH           ax          
044 000503 9A07000102          CALL   far ptr scanf       
045 000508 83C408              ADD            sp, 8       
046 00050B FF76FE              PUSH  word ptr [bp-2]      
047 00050E 9A7C004C00          CALL   far ptr proc_1      
048 000513 59                  POP            cx          
049 000514 8BF8                MOV            di, ax      
050 000516 57                  PUSH           di          
051 000517 FF76FE              PUSH  word ptr [bp-2]      
052 00051A 1E                  PUSH           ds          
053 00051B B8C600              MOV            ax, 0C6h    
054 00051E 50                  PUSH           ax          
055 00051F 9A0E004D01          CALL   far ptr printf      
056 000524 83C408              ADD            sp, 8       
057 000527 46                  INC            si          
058                            JMP            L3         ;Synthetic inst 
		main  ENDP                                

Figure 2 - Code produced by the Disassembler


/*                                                            
 * Input file	: fibo.exe                                    
 * File type	: EXE                                         
 */                                                           
							      
int proc_1 (int arg0)                                         
/* Takes 2 bytes of parameters.                               
 * High-level language prologue code.                         
 * C calling convention.                                      
 */                                                           
{                                                             
int loc1;                                                     
int loc2; /* ax */                                            
							      
    loc1 = arg0;                                              
    if (loc1 > 2) {                                           
	loc2 = (proc_1 ((loc1 - 1)) + proc_1 ((loc1 + 0xFFFE)));  
    }                                                         
    else {                                                    
	loc2 = 1;                                             
    }                                                         
    return (loc2);                                            
}                                                             
							      
							      
void main ()                                                  
/* Takes no parameters.                                       
 * High-level language prologue code.                         
 */                                                           
{                                                             
int loc1;                                                     
int loc2;                                                    
int loc3;                                                     
int loc4;                                                     
							      
    printf ("Input number of iterations: ");                  
    scanf ("%d", &loc1);                                      
    loc3 = 1;                                                 
    while ((loc3 <= loc1)) {                                  
	printf ("Input number: ");                            
	scanf ("%d", &loc2);                                  
	loc4 = proc_1 (loc2);                                 
	printf ("fibonacci(%d) = %u\n", loc2, loc4);          
	loc3 = (loc3 + 1);                                    
    } /* end of while */                                      
    exit (0);                                                 
}                                                             

Figure 3 - Code produced by dcc in C


#include <stdio.h>                                            
							      
int main()                                                    
{ int i, numtimes, number;                                    
  unsigned value, fib();                                      
							      
   printf("Input number of iterations: ");                    
   scanf ("%d", &numtimes);                                   
   for (i = 1; i <= numtimes; i++)                            
   {                                                          
      printf ("Input number: ");                              
      scanf ("%d", &number);                                  
      value = fib(number);                                    
      printf("fibonacci(%d) = %u\n", number, value);          
   }                                                          
   exit(0);                                                   
}                                                             
							      
unsigned fib(x) 		/* compute fibonacci number recursively */
int x;                                                        
{                                                             
   if (x > 2)                                                 
      return (fib(x - 1) + fib(x - 2));                       
   else                                                       
      return (1);                                             
}                                                             

Figure 4 - Initial C Program



PhD Thesis

C Cifuentes. Reverse Compilation Techniques, Queensland University of Technology, PhD thesis. July 1994. (474 Kb compressed postscript file). Also available in compressed dvi format (365 Kb).

ABSTRACT

Techniques for writing reverse compilers or decompilers are presented in this thesis. These techniques are based on compiler and optimization theory, and are applied to decompilation in a unique way; these techniques have never before been published.

A decompiler is composed of several phases which are grouped into modules dependent on language or machine features. The front-end is a machine dependent module that parses the binary program, analyzes the semantics of the instructions in the program, and generates an intermediate low-level representation of the program, as well as a control flow graph of each subroutine. The universal decompiling machine is a language and machine independent module that analyzes the low-level intermediate code and transforms it into a high-level representation available in any high-level language, and analyzes the structure of the control flow graph(s) and transform them into graphs that make use of high-level control structures. Finally, the back-end is a target language dependent module that generates code for the target language.

Decompilation is a process that involves the use of tools to load the binary program into memory, parse or disassemble such a program, and decompile or analyze the program to generate a high-level language program. This process benefits from compiler and library signatures to recognize particular compilers and library subroutines. Whenever a compiler signature is recognized in the binary program, all compiler start-up and library subroutines are not decompiled; in the former case, the routines are eliminated from the final target program and the entry point to the main program is used for the decompiler analysis, in the latter case the subroutines are replaced by their library name.

The presented techniques were implemented in a prototype decompiler for the Intel i80286 architecture running under the DOS operating system, dcc, which produces target C programs for source .exe or .com files. Sample decompiled programs, comparisons against the initial high-level language program, and an analysis of results is presented in Chapter 9.

Chapter 1 gives an introduction to decompilation from a compiler point of view, Chapter 2 gives an overview of the history of decompilation since its appearance in the early 1960s, Chapter 3 presents the relations between the static binary code of the source binary program and the actions performed at run-time to implement the program, Chapter 4 describes the phases of the front-end module, Chapter 5 defines data optimization techniques to analyze the intermediate code and transform it into a higher-representation, Chapter 6 defines control structure transformation techniques to analyze the structure of the control flow graph and transform it into a graph of high-level control structures, Chapter 7 describes the back-end module, Chapter 8 presents the decompilation tool programs, Chapter 9 gives an overview of the implementation of dcc and the results obtained, and Chapter 10 gives the conclusions and future work of this research.

The techniques presented in this thesis expand on earlier work described in the literature. Previous work in decompilation did not document on the interprocedural register analysis required to determine register arguments and register return values, the analysis required to eliminate stack-related instructions (i.e. push and pop), or the structuring of a generic set of control structures. Innovative work done for this research is described in Chapters 5, 6, and 8. Chapter 5, Sections 5.2 and 5.4 illustrate and describe nine different types of optimizations that transform the low-level intermediate code into a high-level representation. These optimizations take into account condition codes, subroutine calls (i.e. interprocedural analysis) and register spilling, eliminating all low-level features of the intermediate instructions (such as condition codes and registers) and introducing the high-level concept of expressions into the intermediate representation. Chapter 6, Sections 6.2 and 6.6 illustrate and describe algorithms to structure different types of loops and conditional, including multi-way branch conditionals (e.g. case statements). Previous work in this area has concentrated in the structuring of loops, few papers attempt to structure 2-way conditional branches, no work on multi-way conditional branches is described in the literature. This thesis presents a complete method for structuring all types of structures based on a predetermined, generic set of high-level control structures. A criterion for determining the generic set of control structures is given in Chapter 6, Section 6.4. Chapter 8 describes all tools used to decompile programs, the most important tool is the signature generator (Section 8.2) which is used to determine compiler and library signatures in architectures that have an operating system that do not share libraries, such as the DOS operating system.



Related Publications

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.



Future Work - A Retargetable Decompiler

Creation of a retargetable decompiler based on the dcc work and the New Jersey Machine-Code toolkit (which currently supports SPARC, MIPS, Intel and PowerPC binary code).

Testing of the generality of the data flow analysis stage of the decompiler in Intel, MIPS, SPARC and PowerPC architectures.

Improvement of the structuring algorithm for control flow graphs.

Incorporation of better data-type detection.

This project will be integrated with the retargetable binary translation work in due course. No current updates are being done on this project.



The dcc Decompiler

A complete distribution of dcc is available by anonymous ftp from ftp.it.uq.edu.au/pub/CSM/dcc. For Unix users, the gzip'd tar'd file dcc.tar.gz contains the whole distribution. For PC users, the individual .zip, .sig and .dat files are located in the abovementioned area. Read the readme file for a description of what is included in the distribution and installation instructions. If you do not have the tar and/or pkunzip programs, contact your system's administrator.

Restrictions Our ftp server is restricted to valid DNS entries only. Hence, if your computer does not have a valid DNS entry, you won't get ftp access. Contact your systems administrator if this is the case; we cannot help you in this case .

Support Please note that the authors are not currently working on this project and therefore cannot support any changes required on dcc. Source code is provided "as is". Read the documentation first.


Last updated: February 21 1997
cristina@it.uq.edu.au

This page: http://www.it.uq.edu.au/groups/csm/dcc.html