This article explains all the steps needed to write a C++ program which dynamically generates encryption algorithms in x86 assembly code.
Off-the-shelf encryption algorithms
There is a wide variety of encryption algorithms available. The advantage of these algorithms is that they are heavily studied and their strengths and weaknesses are known. The first type are block cipher algorithms (data is encrypted in blocks of fixed size), including:
There are also stream cipher algorithms (which encrypt data one byte at a time), such as the popular RC4 algorithm. Most encryption algorithms are symmetric, which means that the same key is used for both encrypting and decrypting data. However, there are a number of asymmetric encryption functions, based on public key infrastructure. This group of algorithms includes:
The encryption component of these algorithms uses a public key, while the decryption component requires a private key. As the name implies, public keys can be safely published on the Internet (like the keys in the PGP encryption system). Unlike symmetric encryption algorithms, where a leaked key means anyone can read the message, asymmetric encryption algorithms ensure that only the holder of the private key can decrypt the message. Since the public and private keys are mathematically related, it is theoretically possible to “crack” a public key and obtain the private key, however there are no known practical ways of doing this. In the case of RSA, cracking the public key involves splitting it into its two prime factors (RSA public keys are the product of two very large prime numbers). This factoring process currently takes an enormous time. The RSA company even organized a public challenge to break RSA keys, where the largest prize was $US200,000 for cracking a 2048-bit key.
All the above algorithms are extensively documented, and implementations can easily be found for many programming languages.
Polymorphic engines
Polymorphism has several meanings in the software world. In this case, we are referring to unique, dynamically-generated code (i.e. computer instructions). This technique has its roots in computer viruses. Already in the days of MS-DOS, some computer viruses were encrypted by their creators in order to evade detection by antivirus software. Of course, the viruses needed to contain decryption routines, and these were quickly added to virus signature databases. Soon, virus creators developed decryption algorithms whose code was uniquely generated every time, which allowed viruses to be created which could not be detected using static signatures. The first recorded polymorphic engines date back to the year 1990.
In time, polymorphic algorithms evolved and became ever more sophisticated, in order to make it as difficult as possible for antivirus programs to analyze viruses and run them in an emulated environment. Some well-known algorithms included:
- KME – Kewl Mutation Engine (by z0mbie and Vecna) – an engine which requires no input data and writes the decrypted code to the stack. Rather than running a decryption algorithm on some encrypted data, the decryption routine is specially constructed to generate the original data every time. This means there is no buffer of encrypted data to be found.
- MMXE – MultiMedia eXtensions Engine (by Billy Belcebu) – a polymorphic engine which generates MMX code
- TUAREG – Tameless Unpredictable Anarchic Relentless Encryption Generator (by Mental Driller) – introduces nonlinear data decryption and many innovative methods into the polymorphic engine
- PPE-II – Prizzy Polymorphic Engine (by Prizzy) – a polymorphic engine which exploits MMX and FPU code, as well as intentional brute-force calculations in its code, so as to slow down operation of emulation strategies used by antivirus software
In most cases, this type of encryption was used in executable file infectors. Due to the level of complexity of polymorphic engines, and the necessity of an in-depth understanding of assembly language for their creation, these days polymorphic engines are rarely used (one exception to this is Virut which, I'd like to point out, is strongly suspected of being Polish in origin!)
However, polymorphic engines have found another use: software protection systems, or exe-protectors, such as PELock, ASProtect, EnigmaProtector, Themida, and Obsidium. They are used to make it more difficult for an application's code to be analyzed by crackers (i.e. people who try to break software protection), and to make it harder for automated unpackers to be created (i.e. tools which remove software protection). If you possess any shareware program which is protected by an exe-protector, there is a 99% chance that it uses a polymorphic encryption engine.
Our own polymorphic engine
In this article, we will cover all the steps necessary to create a simple polymorphic engine, which will serve to encrypt any supplied data, and generate a unique decryption procedure, with the encrypted data embedded within it.
To create this engine, we'll use the C++ language and the AsmJit library, which allows dynamic generation of assembly code. AsmJit makes it possible to create assembly code at the C++ level, as if you were writing it by hand. It is possible to create 32- or 64-bit instructions, and even pseudoinstructions. Thanks to this, the library can be used to target code for 32- and 64-bit environments. In our case we'll stick with 32-bit code.
DWORD DecryptionProc(PDWORD lpdwOutput)
{
// randomly generated encryption key
DWORD dwDecryptionKey = 0xA39383D;
// pointer to the encrypted data, which will be located
// at the end of the decryption function
PDWORD lpdwInput = reinterpret_cast<PDWORD>(&cEncryptedData);
// main decryption loop, composed of
// randomly selected encryption instructions
for (DWORD i = 0; i < NUM_BLOCKS; i++)
{
DWORD dwInputBlock = lpdwInput[i];
dwInputBlock ^= 0x453BC;
dwInputBlock += dwDecryptionKey;
...
lpdwOutput[i] = dwInputBlock;
}
// return the size of the encrypted block
// (in bytes)
return DECRYPTED_DATA_SIZE;
// the encrypted data, stored at the end
// of the function
BYTE cEncryptedData[] = { 0xAB, 0xBA... };
}
Our decryption function will take just one parameter: a pointer to the output buffer where the decrypted data will go. The function will return the size of the decrypted data. Sound simple? Then let's get to work!
Random register selection
Our decryption function will use the standard 32-bit processor registers, and will obtain its one parameter from the stack using the stack frame pointer in the EBP
register. The registers used for decryption, for holding a pointer to the encrypted data, and for holding a pointer to the output buffer, will be selected randomly each time the decryption routine is generated.
///////////////////////////////////////////////////////////
//
// random register selection
//
///////////////////////////////////////////////////////////
void CMutagenSPE::RandomizeRegisters()
{
// set random registers
AsmJit::GPReg cRegsGeneral[] = { eax, ecx, ebx, edx, esi, edi };
// shuffle the order of registers in the array
mixup_array(cRegsGeneral, _countof(cRegsGeneral));
// the register which will contain
// a pointer to the encrypted data
regSrc = cRegsGeneral[0];
// the register which will contain
// a pointer to the output buffer
// (supplied as a function parameter)
regDst = cRegsGeneral[1];
// the register which will contain
// the size of the encrypted data buffer
regSize = cRegsGeneral[2];
// the register which will contain
// the decryption key
regKey = cRegsGeneral[3];
// the register which will contain
// the current data value and on which
// the decryption instructions will operate
regData = cRegsGeneral[4];
// set the register whose values will be
// preserved across function invocations
AsmJit::GPReg cRegsSafe[] = { esi, edi, ebx };
// shuffle the order of the registers in the array
mixup_array(cRegsSafe, _countof(cRegsSafe));
regSafe1 = cRegsSafe[0];
regSafe2 = cRegsSafe[1];
regSafe3 = cRegsSafe[2];
}
This random choice of registers will contribute to the generated code being different each time.
Prologue of the decryption function
The prologue of the decryption function is simply the first section of the function, which contains some standard elements like setting up the stack frame and preserving registers. The stack frame is a special region reserved on the stack for holding local variables (if the function needs any) as well as holding the parameters supplied to the function. In our case the one parameter will be a pointer to the buffer where we will store the decrypted data.
///////////////////////////////////////////////////////////
//
// generate the prologue of the decryption function
//
///////////////////////////////////////////////////////////
void CMutagenSPE::GeneratePrologue()
{
// function prologue
// first the original value of EBP is saved
// so we can use EBP to refer to the stack frame
if (rnd_bin() == 0)
{
a.push(ebp);
a.mov(ebp,esp);
}
else
{
// equivalent to the instructions
// push ebp
// mov ebp,esp
a.enter(imm(0), imm(0));
}
// if our function is called using the stdcall
// convention, and modifies ESI, EDI, or EBX,
// they must be saved at the beginning of
// the function and restored at the end
a.push(regSafe1);
a.push(regSafe2);
a.push(regSafe3);
// load the pointer to the output buffer
// into our randomly-selected register regDst
// (this is the only parameter to the function,
// passed on the stack)
a.mov(regDst, dword_ptr(ebp, 0x08 + (4 * 0)));
}
Address of the encrypted data
Our dynamically generated code can be located anywhere in memory and launched from there (assuming the memory region has the appropriate executable flags set). In such cases we cannot refer to parts of the function using absolute memory addresses, because we simply don't know where the function will reside. The encrypted data is located just after the end of the decryption routine, and we don't know its address in advance either. We will need to determine its address at runtime, through the use of relative addressing.
We will employ the delta offset technique for this purpose. It depends on the use of the call
assembly instruction, normally used to call a function. We take advantage of the fact that this instruction stores a return address on the stack before jumping to a chosen memory address.
Decryptor proc near
; preserve the original value of EBP
push ebp
; this puts the return address on the stack
call delta_offset
delta_offset:
; pop the return address into EBP
; it will be the address of the label 'delta_offset'
pop ebp
; load the true address of var_1 by using the
; relative address with reference to delta_offset
lea eax,[ebp + (var_1 - delta_offset)]
; verify that our delta offset address is
; correct; return TRUE / FALSE
cmp dword ptr[eax],0DEADC0DEh
sete al
movzx eax,al
; restore the value of EBP
pop ebp
ret
; this data value is placed directly after
; the function
var_1 dd 0DEADC0DEh
Decryptor endp
The return address generated by call delta_offset
will refer to the next instruction, in this case pop ebp
. Once we know the memory address of this instruction, we can work out the address of anything else in our function. In our case, we work out the location of our "encrypted" data by adding the number of bytes between the delta_offset:
label and the end of the function.
An interesting alternative for calculating a delta offset address involves the use of the FPU instruction fnstenv
, which stores the environment state of the FPU. Among other things, this contains the memory location of the most recently executed FPU instruction. Like call
, this allows us to obtain the location of an instruction in our function and we can use a similar delta offset calculation to refer to data stored after the function.
DeltaOffsetFPUTest proc uses esi
; local buffer for FPU environment
local fpEnvironment[32]:byte
; initialize FPU
finit
; execute any FPU instruction (e.g. fld1
; fldz, fldln2, fldlg2 etc.)
delta_offset:
fldpi
; save the FPU environment state
lea eax,fpEnvironment
fnstenv byte ptr[eax]
; read the address of the last executed
; FPU instruction – it should point to the
; address of the fldpi instruction, that is,
; the address of the delta_offset label
mov esi,dword ptr[eax+12]
; remove 1 register from the FPU stack
; (after fldpi)
fistp dword ptr[esp-4]
; get the address of var_1 by using its
; address relative to delta_offset
lea eax,[esi + (var_1 - delta_offset)]
; verify that our delta offset address is
; correct; return TRUE / FALSE
cmp dword ptr[eax],0ABBAh
sete al
movzx eax,al
ret
; value written just after the function body
var_1 dd 0ABBAh
DeltaOffsetFPUTest endp
This approach to the delta offset calculation is sometimes used in exploits, e.g. in the Metasploit framework. A similar technique is also used in debugger detection, where the fnstenv
instruction will reveal that the most recently executed FPU instruction is not part of our program, but refers to some other FPU instruction, which is directly or indirectly executed by the debugger. In this way, it is possible to discover tracing of our code by debuggers including OllyDbg.
Delta offset calculations can rouse the suspicions of antivirus programs, since normal applications do not have a need for this kind of thing. The combination of call
and pop r32
can cause the application to be suspected of containing malware. In order to prevent this, we must generate extra code between these two instructions, and utilize a different means of getting values from the stack.
///////////////////////////////////////////////////////////
//
// generate delta offset
//
///////////////////////////////////////////////////////////
void CMutagenSPE::GenerateDeltaOffset()
{
// generate code which will allow us to
// obtain a pointer to the encrypted data
// at the end of the decryption function
// decryption_function:
// ...
// call delta_offset
// mov eax,1 | xor eax,eax ; \
// leave ; > unused instructions
// ret 4 ; /
// delta_offset:
// pop regSrc
// add regSrc, (encrypted_data-delta_offset +
// ... + size of the unused instructions)
// ret 4
// db 0CCh, 0CCh...
// encrypted_data:
// db 0ABh, 0BBh, 083h...
// create the delta_offset label
lblDeltaOffset = a.newLabel();
// generate 'call delta_offset'
a.call(lblDeltaOffset);
sysint_t posUnusedCodeStart = a.getOffset();
// in order to avoid getting flagged by
// antivirus software, we avoid the typical
// delta offset construction, i.e. call + pop,
// by inserting some unused instructions in
// between, in our case a sequence that looks
// like the normal code which returns from
// a function
if (rnd_bin() == 0)
{
a.mov(eax, imm(1));
}
else
{
a.xor_(eax,eax);
}
a.leave();
a.ret(1 * sizeof(DWORD));
// calculate the size of the unused code,
// i.e. the difference between the current
// position and the beginning of the
// unused code
dwUnusedCodeSize = static_cast<DWORD>(a.getOffset() - posUnusedCodeStart);
// put the label "delta_offset:" here
a.bind(lblDeltaOffset);
posDeltaOffset = a.getOffset();
// instead of the pop instruction, we will
// use a different method of reading the
// stack, to avoid rousing the suspicions of
// antivirus programs
//a.pop(regSrc);
a.mov(regSrc, dword_ptr(esp));
a.add(esp, imm(sizeof(DWORD)));
// the address of the label "delta_offset:"
// will now be in the regSrc register;
// we need to adjust this by the size of
// the remainder of the function (which we
// don't know and will have to update later)
// for now we use the value 987654321 to
// ensure AsmJit generates the long form of
// the "add" instruction
a.add(regSrc, imm(987654321));
// save the position of the previous DWORD
// so that we can later update it to contain
// the length of the remainder of the function
posSrcPtr = a.getOffset() - sizeof(DWORD);
}
At the moment, we don't know the size of the remainder of the decryption function, which is why we will need to update this value later, once we have generated all the other instructions of the decryption function.
Encryption
The encryption of the data will occur in 4-byte blocks. The data can be smaller than this (it will be rounded up to a multiple of 4). Pseudoinstructions will be randomly generated for the encryption process, and they will later be used to generate the decryption code.
///////////////////////////////////////////////////////////
//
// generate the encryption keys, encryption instructions,
// and finally encrypt the input data
//
///////////////////////////////////////////////////////////
void CMutagenSPE::EncryptInputBuffer(PBYTE lpInputBuffer, \
DWORD dwInputBuffer, \
DWORD dwMinInstr, \
DWORD dwMaxInstr)
{
// generate an encryption key
dwEncryptionKey = rnd_dword();
// round up the size of the input buffer
DWORD dwAlignedSize = align_dword(dwInputBuffer);
// number of blocks to encrypt
// divide the size of the input data
// into blocks of 4 bytes (DWORDs)
dwEncryptedBlocks = dwAlignedSize / sizeof(DWORD);
PDWORD lpdwInputBuffer = reinterpret_cast<PDWORD>(lpInputBuffer);
// allocate memory for the output data
// (the size will be rounded to the
// block size)
di_valloc(&diEncryptedData, dwAlignedSize);
PDWORD lpdwOutputBuffer = reinterpret_cast<PDWORD>(diEncryptedData.lpPtr);
// randomly select the number of encryption instructions
dwCryptOpsCount = rnd_range(dwMinInstr, dwMaxInstr);
// allocate memory for an array which will
// record information about the sequence of
// encryption instructions
di_valloc(&diCryptOps, dwCryptOpsCount * sizeof(SPE_CRYPT_OP));
// set up a direct pointer to this table
// in a helper variable
lpcoCryptOps = reinterpret_cast<P_SPE_CRYPT_OP>(diCryptOps.lpPtr);
// generate encryption instructions and their type
for (DWORD i = 0; i < dwCryptOpsCount; i++)
{
// will the instruction perform an operation
// combining regData and regKey?
lpcoCryptOps[i].bCryptWithReg = rnd_bool();
// the register we are operating on
lpcoCryptOps[i].regDst = regData;
// if the instruction doesn't use the regKey
// register, generate a random key which
// will be used in the operation
if (lpcoCryptOps[i].bCryptWithReg == FALSE)
{
lpcoCryptOps[i].dwCryptValue = rnd_dword();
}
else
{
lpcoCryptOps[i].regSrc = regKey;
}
// randomly choose the type of encryption instruction
lpcoCryptOps[i].cCryptOp = static_cast<BYTE>(rnd_range(SPE_CRYPT_OP_ADD, SPE_CRYPT_OP_NEG));
}
// encrypt the input data according to the
// instructions we have just generated
for (DWORD i = 0, dwInitialEncryptionKey = dwEncryptionKey; \
i < dwEncryptedBlocks; i++)
{
// take the next block for encryption
DWORD dwInputBlock = lpdwInputBuffer[i];
// encryption loop: executes the sequence of
// encryption instructions on the data block
for (DWORD j = 0, dwCurrentEncryptionKey; j < dwCryptOpsCount; j++)
{
if (lpcoCryptOps[j].bCryptWithReg == FALSE)
{
dwCurrentEncryptionKey = lpcoCryptOps[j].dwCryptValue;
}
else
{
dwCurrentEncryptionKey = dwInitialEncryptionKey;
}
// depending on the encryption operation,
// perform the appropriate modification
// of the data block
switch(lpcoCryptOps[j].cCryptOp)
{
case SPE_CRYPT_OP_ADD:
dwInputBlock += dwCurrentEncryptionKey;
break;
case SPE_CRYPT_OP_SUB:
dwInputBlock -= dwCurrentEncryptionKey;
break;
case SPE_CRYPT_OP_XOR:
dwInputBlock ^= dwCurrentEncryptionKey;
break;
case SPE_CRYPT_OP_NOT:
dwInputBlock = ~dwInputBlock;
break;
case SPE_CRYPT_OP_NEG:
dwInputBlock = 0L - dwInputBlock;
break;
}
}
// store the encrypted block in the buffer
lpdwOutputBuffer[i] = dwInputBlock;
}
}
Setting up the decryption keys
Our algorithm will use a randomly generated encryption key, but to prevent the key appearing directly in the code, we will encrypt it with a second, random key. At the start of the decryption function we will load the encrypted key into regKey
, and then decrypt it.
///////////////////////////////////////////////////////////
//
// set up the keys which will be used to decrypt the data
//
///////////////////////////////////////////////////////////
void CMutagenSPE::SetupDecryptionKeys()
{
// set up a decryption key in the regKey
// register, which will itself be encrypted
DWORD dwKeyModifier = rnd_dword();
// randomly generate instructions to set up
// the decryption key
switch(rnd_max(2))
{
// mov regKey,dwKey - dwMod
// add regKey,dwMod
case 0:
a.mov(regKey, imm(dwEncryptionKey - dwKeyModifier));
a.add(regKey, imm(dwKeyModifier));
break;
// mov regKey,dwKey + dwMod
// sub regKey,dwMod
case 1:
a.mov(regKey, imm(dwEncryptionKey + dwKeyModifier));
a.sub(regKey, imm(dwKeyModifier));
break;
// mov regKey,dwKey ^ dwMod
// xor regKey,dwMod
case 2:
a.mov(regKey, imm(dwEncryptionKey ^ dwKeyModifier));
a.xor_(regKey, imm(dwKeyModifier));
break;
}
}
Decryption
Using the pseudoinstructions we generated earlier and which we used to encrypt the data, we will now generate instructions which perform the encryption process in reverse, and place them in a loop that repeats this process on each input block. Before the loop we will initialize the register we have called regSize
with the number of blocks which need to be decrypted.
///////////////////////////////////////////////////////////
//
// generate the decryption code (for the main decryption loop)
//
///////////////////////////////////////////////////////////
void CMutagenSPE::GenerateDecryption()
{
// set up the size of the encrypted data
// (in blocks)
a.mov(regSize, imm(dwEncryptedBlocks));
// create a label for the start of the
// decryption loop
Label lblDecryptionLoop = a.newLabel();
a.bind(lblDecryptionLoop);
// read the data referred to by the
// regSrc register
a.mov(regData, dword_ptr(regSrc));
// build the decryption code by generating each
// decryption instruction in turn (reversing the
// order and the operations that were used for
// encryption!)
for (DWORD i = dwCryptOpsCount - 1; i != -1L; i--)
{
// encryption was done either with the key
// in register regKey, or a constant value,
// so depending on this we need to generate
// the appropriate decryption instructions
if (lpcoCryptOps[i].bCryptWithReg == FALSE)
{
DWORD dwDecryptionKey = lpcoCryptOps[i].dwCryptValue;
switch(lpcoCryptOps[i].cCryptOp)
{
case SPE_CRYPT_OP_ADD:
a.sub(lpcoCryptOps[i].regDst, imm(dwDecryptionKey));
break;
case SPE_CRYPT_OP_SUB:
a.add(lpcoCryptOps[i].regDst, imm(dwDecryptionKey));
break;
case SPE_CRYPT_OP_XOR:
a.xor_(lpcoCryptOps[i].regDst, imm(dwDecryptionKey));
break;
case SPE_CRYPT_OP_NOT:
a.not_(lpcoCryptOps[i].regDst);
break;
case SPE_CRYPT_OP_NEG:
a.neg(lpcoCryptOps[i].regDst);
break;
}
}
else
{
switch(lpcoCryptOps[i].cCryptOp)
{
case SPE_CRYPT_OP_ADD:
a.sub(lpcoCryptOps[i].regDst, lpcoCryptOps[i].regSrc);
break;
case SPE_CRYPT_OP_SUB:
a.add(lpcoCryptOps[i].regDst, lpcoCryptOps[i].regSrc);
break;
case SPE_CRYPT_OP_XOR:
a.xor_(lpcoCryptOps[i].regDst, lpcoCryptOps[i].regSrc);
break;
case SPE_CRYPT_OP_NOT:
a.not_(lpcoCryptOps[i].regDst);
break;
case SPE_CRYPT_OP_NEG:
a.neg(lpcoCryptOps[i].regDst);
break;
}
}
}
// write the decrypted block to the output
// buffer
a.mov(dword_ptr(regDst), regData);
// update the pointers to the input and ouput
// buffers to point to the next block
a.add(regSrc, imm(sizeof(DWORD)));
a.add(regDst, imm(sizeof(DWORD)));
// decrement the loop counter (the number of
// blocks remaining to decrypt)
a.dec(regSize);
// check if the loop is finished
// if not, jump to the start
a.jne(lblDecryptionLoop);
}
The loop takes successive encrypted blocks of data from the buffer at the end of the function, then carries out all the decryption instructions on each block, and writes the result to the output buffer. After decrypting a data block, the pointers to the encrypted data and the decrypted data buffer are updated, and the loop counter which indicates the number of blocks remaining is decremented. If it has not reached 0, the loop is repeated.
Setting final register values
Our decryption function will return a value of type DWORD
(a 32-bit value corresponding to unsigned int
), which indicates the size of the decrypted data. This value will be returned in the EAX
register.
///////////////////////////////////////////////////////////
//
// set up output registers, including the function return value
//
///////////////////////////////////////////////////////////
void CMutagenSPE::SetupOutputRegisters(SPE_OUTPUT_REGS *regOutput, DWORD dwCount)
{
// if there are no output registers to
// set up, return
if ((regOutput == NULL) || (dwCount == 0))
{
return;
}
// shuffle the order in which the registers
// will be set up
mixup_array(regOutput, dwCount);
// generate instructions to set up the
// output registers
// mov r32, imm32
for (DWORD i = 0; i < dwCount; i++)
{
a.mov(regOutput[i].regDst, imm(regOutput[i].dwValue));
}
}
Our polymorphic engine allows any set of output registers to be defined (that is, we are not limited to setting the value returned in EAX
). This makes it possible for the function to output extra values.
Function epilogue
The epilogue, in other words, the last portion of the function, is where the original value of the EBP
register is restored, and if necessary, the sensitive registers ESI EDI EBX
, whose state must be preserved across function calls according to the stdcall convention.
///////////////////////////////////////////////////////////
//
// generate epilogue of the decryption function
//
///////////////////////////////////////////////////////////
void CMutagenSPE::GenerateEpilogue(DWORD dwParamCount)
{
// restore the original values of
// registers ESI EDI EBX
a.pop(regSafe3);
a.pop(regSafe2);
a.pop(regSafe1);
// restore the value of EBP
if (rnd_bin() == 0)
{
a.leave();
}
else
{
// equivalent to "leave"
a.mov(esp,ebp);
a.pop(ebp);
}
// return to the code which called
// our function; additionally adjust
// the stack by the size of the passed
// parameters (by stdcall convention)
a.ret(imm(dwParamCount * sizeof(DWORD)));
}
Alignment
Instructions which access memory are faster if the data they read or write is aligned, i.e. divisible by the size of the block of memory being accessed. For instance, our encrypted data is accessed in 32-bit blocks (since we are using 32-bit registers), which means memory addresses should be divisible by 4. This is due to processor cache effects. If memory addresses are correctly aligned, the data can be stored in the fast memory attached to the processor known as L1 cache. Non-aligned addresses will be read from the slower L2 cache or directly from the computer's RAM. For small blocks of data, this does not create a noticeable difference, but for large buffers the performance reduction can be significant and should be taken into account.
///////////////////////////////////////////////////////////
//
// align the size of the decryption function
// to the specified granularity
//
///////////////////////////////////////////////////////////
void CMutagenSPE::AlignDecryptorBody(DWORD dwAlignment)
{
// take the current size of the code
DWORD dwCurrentSize = a.getCodeSize();
// find the number of bytes that would
// align the size to a multiple of the
// supplied size (e.g. 4)
DWORD dwAlignmentSize = align_bytes(dwCurrentSize, dwAlignment) - dwCurrentSize;
// check if any alignment is required
if (dwAlignmentSize == 0)
{
return;
}
// add padding instructions (int3 or nop)
if (rnd_bin() == 0)
{
while (dwAlignmentSize--) a.int3();
}
else
{
while (dwAlignmentSize--) a.nop();
}
}
Alignment between functions is normally achieved with the 1-byte instruction nop
(no-operation) or int3
(an interrupt for the purpose of trapping into a debugger). Alignment is also performed for the beginning of loops, using instructions which have the same effect as nop
but have a longer binary encoding, such as lea eax,[eax*8+eax+00000000]
. In our case we won't bother with this level of optimization.
Fixing up delta offset addresses
Having generated the entire code of the decryption function, we can now correct any relative addresses used earlier, which in our case is just the address of the encrypted data.
///////////////////////////////////////////////////////////
//
// correct all instructions making use of
// addressing relative to the delta offset
//
///////////////////////////////////////////////////////////
void CMutagenSPE::UpdateDeltaOffsetAddressing()
{
DWORD dwAdjustSize = static_cast<DWORD>(a.getOffset() - posDeltaOffset);
// correct the instruction which sets up
// a pointer to the encrypted data block
// at the end of the decryption function
//
// this pointer is loaded into the regSrc
// register, and must be updated by the
// size of the remainder of the function
// after the delta_offset label
a.setDWordAt(posSrcPtr, dwAdjustSize + dwUnusedCodeSize);
}
Appending the encrypted data
Now that the entire code of the function has been generated, we can write out the encrypted data block, so it will follow the code of the function.
///////////////////////////////////////////////////////////
//
// append the encrypted data to the end of the code
// of the decryption function
//
///////////////////////////////////////////////////////////
void CMutagenSPE::AppendEncryptedData()
{
PDWORD lpdwEncryptedData = reinterpret_cast<PDWORD>(diEncryptedData.lpPtr);
// place the encrypted data buffer
// at the end of the decryption function
// (in 4-byte blocks)
for (DWORD i = 0; i < dwEncryptedBlocks; i++)
{
a._emitDWord(lpdwEncryptedData[i]);
}
}
Our entire polymorphic engine
Now that you have seen the successive elements of the operation of our polymorphic engine, it's time to bring it all together. I have included the class header below, followed by the implementation of the main function which generates the polymorphic code.
#include "mutagen.h"
class CMutagenSPE : private CMutagen
{
public:
CMutagenSPE(void);
~CMutagenSPE(void);
// the main function which performs the
// encryption and generates the polymorphic
// code
CMutagen::erCodes PolySPE(PBYTE lpInputBuffer, \
DWORD dwInputBuffer, \
PBYTE *lpOutputBuffer, \
PDWORD lpdwOutputSize);
private:
// a structure describing the values of
// the output registers
typedef struct _SPE_OUTPUT_REGS {
// target register
AsmJit::GPReg regDst;
// value to write in this register
DWORD dwValue;
} SPE_OUTPUT_REGS, *P_SPE_OUTPUT_REGS;
// description of an encryption operation
typedef struct _SPE_CRYPT_OP {
// TRUE if the operation is performed
// on two registers; FALSE if it is
// performed between the target register
// and the value in dwCryptValue
BOOL bCryptWithReg;
AsmJit::GPReg regDst;
AsmJit::GPReg regSrc;
// encryption operation
BYTE cCryptOp;
// encryption value
DWORD dwCryptValue;
} SPE_CRYPT_OP, *P_SPE_CRYPT_OP;
enum
{
SPE_CRYPT_OP_ADD = 0,
SPE_CRYPT_OP_SUB,
SPE_CRYPT_OP_XOR,
SPE_CRYPT_OP_NOT,
SPE_CRYPT_OP_NEG,
};
// buffer with the encryption operations
DATA_ITEM diCryptOps;
// pointer to the table of encryption
// operations
P_SPE_CRYPT_OP lpcoCryptOps;
// count of encryption operations
DWORD dwCryptOpsCount;
// pointer to the encrypted data block
DATA_ITEM diEncryptedData;
// number of blocks of encrypted data
DWORD dwEncryptedBlocks;
// encryption key
DWORD dwEncryptionKey;
// AsmJit Assembler instance
Assembler a;
// the register which will store a pointer
// to the data which is to be decrypted
AsmJit::GPReg regSrc;
// the register which will store a pointer
// to the output buffer
AsmJit::GPReg regDst;
// the register which hold the size of the
// encrypted data
AsmJit::GPReg regSize;
// the register with the encryption key
AsmJit::GPReg regKey;
// the register on which the decryption
// instructions will operate
AsmJit::GPReg regData;
// the preserved registers (ESI EDI EBX in random order)
AsmJit::GPReg regSafe1, regSafe2, regSafe3;
// the delta_offset label
Label lblDeltaOffset;
// the position of the delta offset
sysint_t posDeltaOffset;
// the relative address of the encrypted data
sysint_t posSrcPtr;
// the size of the unused code between delta
// offset and the instructions which get that
// value from the stack
DWORD dwUnusedCodeSize;
// helper methods
void RandomizeRegisters();
void GeneratePrologue();
void GenerateDeltaOffset();
void EncryptInputBuffer(PBYTE lpInputBuffer, \
DWORD dwInputBuffer, \
DWORD dwMinInstr, \
DWORD dwMaxInstr);
void SetupDecryptionKeys();
void GenerateDecryption();
void SetupOutputRegisters(SPE_OUTPUT_REGS *regOutput, \
DWORD dwCount);
void GenerateEpilogue(DWORD dwParamCount);
void AlignDecryptorBody(DWORD dwAlignment);
void AppendEncryptedData();
void UpdateDeltaOffsetAddressing();
};
///////////////////////////////////////////////////////////
//
// main function - encrypts data and generates polymorphic
// decryptor code
//
///////////////////////////////////////////////////////////
CMutagen::erCodes CMutagenSPE::PolySPE(PBYTE lpInputBuffer, \
DWORD dwInputBuffer, \
PBYTE *lpOutputBuffer, \
PDWORD lpdwOutputSize)
{
///////////////////////////////////////////////////////////
//
// check input parameters
//
///////////////////////////////////////////////////////////
if ( (lpInputBuffer == NULL) || (dwInputBuffer == 0) || \
(lpOutputBuffer == NULL) || (lpdwOutputSize == NULL) )
{
return CMutagen::MUTAGEN_ERR_PARAMS;
}
// randomly select registers
RandomizeRegisters();
///////////////////////////////////////////////////////////
//
// generate polymorphic function code
//
///////////////////////////////////////////////////////////
// generate function prologue
GeneratePrologue();
// set up relative addressing through the delta offset
// technique
GenerateDeltaOffset();
// encrypt the input data, generate encryption keys
// the additional parameters set the lower and upper
// limits on the number of encryption instructions
// which will be generated (there is no limit to this
// number, you can specify numbers in the thousands,
// but be aware that this will make the output code
// quite large)
EncryptInputBuffer(lpInputBuffer, dwInputBuffer, 3, 5);
// generate code to set up keys for decryption
SetupDecryptionKeys();
// generate decryption code
GenerateDecryption();
// set up the values of the output registers
SPE_OUTPUT_REGS regOutput[] = { { eax, dwInputBuffer } };
SetupOutputRegisters(regOutput, _countof(regOutput));
// generate function epilogue
GenerateEpilogue(1L);
// align the size of the function to a multiple
// of 4 or 16
AlignDecryptorBody(rnd_bin() == 0 ? 4L : 16L);
// fix up any instructions that use delta offset addressing
UpdateDeltaOffsetAddressing();
// place the encrypted data at the end of the function
AppendEncryptedData();
///////////////////////////////////////////////////////////
//
// free resources
//
///////////////////////////////////////////////////////////
// free the encrypted data buffer
di_vfree(&diEncryptedData);
// free the array of encryption pseudoinstructions
di_vfree(&diCryptOps);
///////////////////////////////////////////////////////////
//
// copy the polymorphic code to the output buffer
//
///////////////////////////////////////////////////////////
DWORD dwOutputSize = a.getCodeSize();
// assemble the code of the polymorphic function
// (this resolves jumps and labels)
PVOID lpPolymorphicCode = a.make();
// this struct describes the allocated memory block
DATA_ITEM diOutput;
// allocate memory (with execute permissions) for the
// output buffer
di_valloc(&diOutput, dwOutputSize);
// check that allocation was successful
if (diOutput.lpPtr != NULL)
{
// copy the generated code of the decryption function
memcpy(diOutput.lpPtr, lpPolymorphicCode, dwOutputSize);
// provide the output buffer and code size to
// this function's caller
*lpOutputBuffer = diOutput.lpPtr;
*lpdwOutputSize = dwOutputSize;
MemoryManager::getGlobal()->free(lpPolymorphicCode);
}
else
{
MemoryManager::getGlobal()->free(lpPolymorphicCode);
return CMutagen::MUTAGEN_ERR_MEMORY;
}
///////////////////////////////////////////////////////////
//
// function exit
//
///////////////////////////////////////////////////////////
return CMutagen::MUTAGEN_ERR_SUCCESS;
}
Test
In order to test out the code, we will encrypt a text string with our engine and then call the generated code of the decryption function. It's important to remember that the memory where the code is found should be allocated with the correct executable flags. Otherwise, the operating system's protection mechanisms, like e.g. Windows' DEP (Data Execution Prevention) feature will throw an exception if you attempt to call the function.
#include <conio.h>
#include "mutagen\mutagen.h"
#include "mutagen\mutagen_spe.h"
// declare the prototype of the decryption procedure
typedef DWORD(__stdcall *DecryptionProc)(PVOID);
int __cdecl main()
{
// input data (in this case a simple string,
// although it could be any data buffer)
char szHelloWorld[] = "Hello world!";
// create an instance of the polymorphic
// engine
CMutagenSPE *speEngine = new CMutagenSPE();
// a pointer to the generated decryption
// function will be placed here
PBYTE lpcDecryptionProc = NULL;
// the size of the decryption code (and
// its encrypted payload) will go here
DWORD dwDecryptionProcSize = 0;
// encrypt the input data and dynamically
// generate a decryption function
speEngine->PolySPE(reinterpret_cast<PBYTE>(szHelloWorld), \
sizeof(szHelloWorld), \
&lpcDecryptionProc, \
&dwDecryptionProcSize);
// write the generated function to disk
FILE *hFile = fopen("polymorphic_code.bin", "wb");
if (hFile != NULL)
{
fwrite(lpDecryptionProc, dwDecryptionProcSize, 1, hFile);
fclose(hFile);
}
// cast the function pointer to the right type
DecryptionProc lpDecryptionProc = reinterpret_cast<DecryptionProc>(lpcDecryptionProc);
// the output buffer for the decrypted data
char szOutputBuffer[128] = { 0xCC };
// call the decryption function via its
// function pointer
DWORD dwOutputSize = lpDecryptionProc(szOutputBuffer);
// display the decrypted text - if everything
// went correctly this will show "Hello world!"
printf(szOutputBuffer);
return 0;
}
What does the generated code look like?
The generated code will look different each time. This is the main goal of a polymorphic engine, after all! I have included two different decryption functions below, both generated by our engine from the same input data.
; beginning of decryption function (prologue)
enter 0, 0
; save sensitive registers (stdcall)
push esi
push ebx
push edi
; get the parameter passed to this function
mov ecx, [ebp+8]
call delta_offset
; unused instructions
xor eax, eax
leave
retn 4
delta_offset:
; EBX will contain the address of the label delta_offset
mov ebx, [esp]
add esp, 4
; correct the address so that it points to the encrypted data
add ebx, 51h
; set up the decryption key
mov edx, 918D3D1Bh
add edx, 78170F2Ah
; number of blocks to decrypt
mov eax, 4
decryption_loop:
; take one block of encrypted data
mov esi, [ebx]
; decryption instructions (random)
xor esi, 138E6781h
sub esi, edx
not esi
xor esi, edx
; write the decrypted block to the output buffer
mov [ecx], esi
add ebx, 4
add ecx, 4
dec eax
jnz short decryption_loop
; set up values returned by the function
; namely the size of the decrypted data
mov eax, 0Dh
; function epilogue and return to caller
pop edi
pop ebx
pop esi
leave
retn 4
; align function to 16 byte boundary (using NOP instructions)
db 5 dup(90h)
; encrypted data
db 0B6h, 044h, 052h, 0B0h, 09Bh, 087h, 05Eh, 0B1h
db 08Ch, 04Bh, 06Ah, 0F3h, 07Eh, 0ACh, 0B5h, 057h
; beginning of decryption function (prologue)
push ebp
mov ebp, esp
; save sensitive registers (stdcall)
push edi
push ebx
push esi
; get the parameter passed to this function
mov esi, [ebp+8]
call delta_offset
; unused instructions
xor eax, eax
leave
retn 4
delta_offset:
; ECX will contain the address of the label delta_offset
mov ecx, [esp]
add esp, 4
; correct the address so that it points to the encrypted data
add ecx, 4Ah
; set up the decryption key
mov ebx, 76F71EBFh
sub ebx, 50531439h
; number of blocks to decrypt
mov edi, 4
; decryption loop
decryption_loop:
; take one block of encrypted data
mov eax, [ecx]
; decryption instructions (random)
neg eax
sub eax, 4B1A7C17h
not eax
xor eax, ebx
neg eax
; write the decrypted block to the output buffer
mov [esi], eax
add ecx, 4
add esi, 4
dec edi
jnz short decryption_loop
; set up values returned by the function
; namely the size of the decrypted data
mov eax, 0Dh
; function epilogue and return to caller
pop esi
pop ebx
pop edi
leave
retn 4
; align function to 16 byte boundary (using int3 instructions)
db 5 dup(0CCh)
; encrypted data
db 028h, 014h, 01Dh, 06Ah, 001h, 059h, 012h, 06Bh
db 0F2h, 01Ch, 025h, 0ADh, 070h, 0C2h, 07Ch, 0CAh
What next?
Our polymorphic engine is pretty simple at the moment. We could add additional features, such as:
- generating spurious instructions, known as junks – these are simply blocks of code which make the code more difficult to analyze in a debugger or disassembler
- generating white noise – that is, instructions which don't affect the operation of the function (e.g. adding a value to a random register and then taking it away again)
- generating equivalent instructions (code mutations) in various forms, joined by random comparisons and jumps
- generating additional helper functions, e.g. returning values which were originally put in place by code in the main decryption function
- changes to the calling convention used, e.g. using cdecl instead of stdcall
- adding jumps to the code so the instructions are not executed linearly
- and, most advanced – multilayer encryption, that is, generated code which in turn contains another layer of decryptor
This can all be taken quite a long way, especially making use of a library like AsmJit, which greatly simplifies the dynamic generation of assembly code, including the use of all current extensions of x86 and x64 processors.