Creating a simple JIT compiler in C++

{3 Comments}

When we are designing a programming language, we generally have to decide whether it will be compiled or interpreted. Compilation has the obvious benefit of producing code that is even orders of magnitude faster than the purely interpreted counterpart. On the other hand, interpreters tend to be more easily implemented, since we can implement them in a high-level programming language without knowing the platform specific details, such as the machine code, ABI and the executable format. It also allows for easy introspection, dynamic typing, eval statements et cætera. JIT (Just-in-time) compilation combines the benefits of both. When code is compiled as needed we retain the flexibility of interpreted languages along with the speed of compiled code.

In this article we will make a simple JIT compiler for parametrized mathematical expressions for x86-64 platform, written in C++.

Reverse Polish notation

(1 + 1) * (2 / 3) in standard notation corresponds to 1 1 + 2 3 / * in RPN

Expressions in the reverse Polish notation are more natural for a computer than the standard infix notation, since they are evaluated using a stack, which is a ubiquitous data structure. The other reason I chose RPN is that I have a special affinity for it, having designed a programming language based on it. The evaluation algorithm is very simple:

  1. Read the tokens (numbers, operators) one by one
    1. If the token is a number, put it on the stack
    2. If the token is an (binary) operator, pop the two topmost elements off the stack, perform the operation on them and push them back on the stack
  2. If the expression is well formed, the stack will contain only one element which will be the result

A simple RPN evaluator for parametrized RPN expressions can thus be written like this (we could use std::stack but we deliberately avoid heap allocation for now):

double rpn(char const* expression, double x)
{
    array<double, 128> stk; 
    int depth = 0;

    for (; *expression; ++expression)
    {
    	if (depth >= 127)
    		throw runtime_error("Too much nesting");

        if (isdigit(*expression) || (*expression == '.') || 
            (*expression == '-' && isdigit(*(expression + 1))))
        {
            char* new_expression;
            stk[depth] = strtod(expression, &new_expression);
            expression = new_expression;
            ++depth;
        }
        else if (*expression == 'x')
        {
            stk[depth] = x;
            ++depth;
        }
        else if (*expression > 32)
        {
            switch (*expression) {
                case '+': stk[depth - 2] += stk[depth - 1]; break; 
                case '*': stk[depth - 2] *= stk[depth - 1]; break; 
                case '-': stk[depth - 2] -= stk[depth - 1]; break; 
                case '/': stk[depth - 2] /= stk[depth - 1]; break; 
                default: ;
            }
            --depth;
        }

        if (*expression == '\0')
            break;
    }
    
    return stk[0];
}

Straightforward enough, the problem is that this evaluator is relatively slow since it has to parse the expression string every time evaluates the expression. For example – on my laptop (Intel Core i7 2640M @ 2.8 GHz), it takes around 9.5 seconds to evaluate an expression with 100,000,000 tokens. Of course this can be made significantly faster by various means, for example by first parsing the string into bytecode, but we will skip this step and go directly to machine code compilation.

Executing arbitrary machine code

Ahead-of-time compilation is usually performed in three stages. First the compiler generates the machine code from compilation units and generates object files. Then a linker combines the object files and external statically linked libraries to generate an executable file which is saved on a disk. When we want to execute, a program loader loads the executable into memory, sets up stack space, initializes environment data and registers and then transfers the execution to the program’s entry point (such as a main() function in C/C++).

JIT compilers skip the intermediary stage of linking and load the compiled program directly into memory. We will be be constructing the machine code somewhere, so let us define – for convenience’s sake – a helper class that will allow us to easily inject opcodes of arbitrary size (8, 16, 32, 64 bytes):

using namespace std;

class code_vector : public vector<uint8_t>
{
public:
    /*
        This enables us to push an arbitrary type as an immediate 
        value to the stream
    */
    template<typename T>
    void push_value(iterator whither, T what)
    {
        auto position = whither - begin();
        insert(whither, sizeof(T), 0x00);
        *reinterpret_cast<T*>(&(*this)[position]) = what;
    }
};

Due to security precautions on modern OSes, we cannot execute arbitrary code in memory. It has to reside in a memory page that has the NX bit disabled. Fortunately, it very easy to allocate an exectutable chunk of memory. On Windows, we would call something like:

void* ptr = VirtualAlloc(0, size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);

and on Unix:

void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC, 
                 MAP_PRIVATE | MAP_ANON, -1, 0);

Before we continue, let me stress that what we just did is dangerous. We don’t want the memory to be both writeable and executable at the same time as this opens up a rather large attack surface for buffer overrun attacks. Therefore, we will first write everything we want and only then mark it executable, which is not any harder.

Even though we are leaving the safety and comfort of C++ by writing the machine code, we would still like a nice way of wrapping our functionality. We will write a x_function class with a similar interface as std::function.

First the memory allocation, protection and deallocation routines:

template<typename> class x_function;

template<typename R, typename... ArgTypes>
class x_function<R(ArgTypes...)>
{
    bool executable_;
    size_t size_;
    void * data_;

public:
    void set_executable(bool executable)
    {
        if (executable_ != executable)
        {
            #ifdef _WIN32
            DWORD old_protect;
            VirtualProtect(
                data_, size_,
                executable ? PAGE_EXECUTE_READ : PAGE_READWRITE, 
                &old_protect
            );
            #else
            mprotect(data_, size_,
                PROT_READ | (executable ? PROT_EXEC : PROT_WRITE));
            #endif
            executable_ = executable;
        }
    }

    x_function(size_t size) :
        executable_(false),
        size_(size)
    {
        if (size == 0) { data_ = nullptr; return; }
        #ifdef _WIN32
        data_ = VirtualAlloc(0, size, MEM_COMMIT, PAGE_READWRITE);
        #else
        data_ = mmap(NULL, size,
            PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
        #endif
    }

    x_function(void* data, size_t size) :
        x_function(size)
    {
        memcpy(data_, data, size);
        set_executable(true);
    }

    ~x_function()
    {
        #ifdef _WIN32
        VirtualFree(data_, 0, MEM_RELEASE);
        #else
        munmap(data_, size_);
        #endif
    }
    
/* ... */

Next we implement basic copy and move semantics. For demonstration purposes, we will not be overly concerned with how efficiently memory management is handled. For now, let the memory be actually copied when the object is copied, but in reality some form of reference counting is more reasonable, since the executable code will not be changing much once it is committed to memory.

/* ... */

    void swap(x_function& other)
    {
        using std::swap;
        swap(executable_, other.executable_);
        swap(size_, other.size_);
        swap(data_, other.data_);
    }

    x_function() : x_function(0) {}
    x_function(const x_function& other) : x_function()
    {
        x_function copy(other.size_);
        memcpy(copy.data_, other.data_, other.size_);
        copy.set_executable(other.executable_);

        swap(copy);
    }

    x_function(x_function&& other) : x_function() { swap(other); }
    x_function& operator=(x_function other)
    {
        swap(other);
        return *this;
    }

    x_function(code_vector::iterator begin, code_vector::iterator end) :
        x_function(&*begin, end - begin) { }

/* ... */

And now the most interesting part of our class – the operator():

/* ... */

    template<typename... RArgTypes>
    R operator()(RArgTypes&&... args) const
    {
        return reinterpret_cast<R(*)(ArgTypes...)>(data_)(
            forward<RArgTypes>(args)...);
    }; 
};

Since it may look slightly esoteric on the surface, let us break it down:

  1. First it casts the data_ pointer to the function pointer of the type the x_function was instantiated with. E.g. if we have x_function<int(int)>, it will cast the pointer to int (*) (int) pointer type.
  2. It invokes the function referenced by the function with the parameters provided to it. The rest of the clutter is just for perfect forwarding made possible by rvalue references and variadic templates.

This is it! Now we can write our first function. The remainder of this will involve x86-64 machine code and assembly with which I admit I don’t have a tremendous amount of experience. So if you spot any error or inelegance, please let me know, so I can improve it.

“Actual” C++ closures

Simply put, closures are functions that “capture” one of their arguments by return another function whose functionality depends on it. Since C++11, we can create closures in an elegant manner with lambdas:

function<int(int)> adder(int n) 
{
   return [n] (int m) { return n + m; };
}

auto f = adder(42);
cout << f(10) << endl  // 52
     << f(-1);         // 41

Usually the closures are implemented by storing the captured value somewhere and later accessing it. Using our class we can create “actual closures” i.e. functions whose entire functionality is dependent on the parameter. Without further ado, the “actual closure” equivalent is:

x_function<int64_t(int64_t)> real_adder(int64_t value)
{
    code_vector code;
    code.insert(code.end(), { 0x48, 0xB8 });
    code.push_value(code.end(), value);
#ifdef _WIN32
    code.insert(code.end(), { 0x48, 0x01, 0xC8, 0xC3 });
#else
    code.insert(code.end(), { 0x48, 0x01, 0xF8, 0xC3 });
#endif
    return x_function<int64_t(int64_t)>(code.begin(), code.end());
}

auto f = real_adder(42);
cout << f(10) << endl  // 52
     << f(-1);         // 41

Here is the disassembly of the code it produces:
real_adder disassembly

The function does exactly as the disassembly above shows. It moves the hardcoded value to the rax register, then it adds the contents of rcx (or rdi on UNIX) and then returns. Granted, it is not as elegant as the lambda version above (and it shouldn’t be any faster), but it serves to illustrate the principle we will extend for our RPN compiler.

Calling conventions

We have seen that there was some conditional compilation in the previous snippet. The reason for this is a different calling convention on Windows and Unix-like operating systems.

Calling convention is a a standard specifying how functions interact on an machine code level in an interoperable way. It varies greatly between different system architectures, and even among different OSes on the same architecture. It specifies how the parameters are passed when calling a function and who (caller or callee) takes care of stack unwinding. 32-bit x86 architecture knows a plethora of calling conventions (cdecl, stdcall, pascal, fastcall, …) but on x86-64, there is generally just one (with minor differences among OSes). Since the JIT we are building targets x86-64 with Windows and Linux, let us highlight the important points:

  1. The callee cleans up the stack
  2. Some integer and pointer (reference) parameters are passed in registers, the others on stack
    1. On Windows first 4 parameters in rcx, rdx, r8 and r9
    2. On Unix first 6 parameters in rdi, rsi, rdx, rcx, r8, and r9
  3. Integer/reference return value is always passed in rax
  4. Floating point parameters are passed in SSE registers, the others on stack:
    1. On Windows first 4 parameters in xmm0, xmm1, xmm2 and xmm3
    2. On Unix first 8 parameters in xmm0-xmm7
  5. Floating point return value is always passed in xmm0
  6. On Windows certain registers must be preserved across function calls (xmm5-xmm7)

Full details on x86-64 calling conventions can be found here (for Windows) and in the System V ABI (for Unix).

Large datatypes are usually passed by reference. We can just as easily do that, for example like this (persumes Windows calling convention):

x_function<void(int64_t&)> real_adder_by_ref(int64_t value)
{
    code_vector code;
    code.insert(code.end(), { 0x48, 0xB8 });
    code.push_value(code.end(), value);
    code.insert(code.end(), { 0x48, 0x01, 0x01, 0xC3 });
    /*
        movabs rax, %(value)
        add    QWORD PTR [rcx],rax  
        ret
    */
    return x_function<void(int64_t&)>(code.begin(), code.end());
}

int64_t x = 10;
real_adder_by_ref(42)(x);
cout << x; // 52

We can use C++ references or pointers interchangeably – under the hood they correspond to exactly the same machine code but in my opinion references have a cleaner syntax.

Compiling RPN expressions to machine code

Now we have everything we need to create a compiler for RPN expressions. We define a function rpn_compile(char const*) that takes a RPN expression string (parametrized with x and returns a function that maps x to the evaluated expression’s value. The machine code will be a simple sequence of numbers and operations without any jumps (since there is no need for them). We have 8 registries xmm0-xmm7 at our disposal and we will try to compute the expression without the main stack but in case that the nesting level is too high and 8 slots do not suffice, we will store the intermediate values there.

First we define some macros for commonly used x86-64 instructions. These macros inject their respective opcodes into our code_vector:

x_function<double(double)> rpn_compile(char const* expression) 
{
    code_vector code;
    vector<double> literals;
    
    // Some macros for commonly used instructions.
    auto xmm = [](uint8_t n1, uint8_t n2) -> uint8_t 
    { return 0xC0 + 8 * n1 + n2; };

    auto pop_xmm = [&](uint8_t whither) { 
        code.insert(code.end(), { 0xF3, 0x0F, 0x6F, 
          (uint8_t)(0x04 + whither * 8), 0x24, 0x48, 0x83, 0xC4, 0x10 }); 
    };

    auto push_xmm = [&](uint8_t whence) { 
        code.insert(code.end(), { 0x48, 0x83, 0xEC, 0x10, 
          0xF3, 0x0F, 0x7F, (uint8_t)(0x04 + whence * 8), 0x24 }); 
    };

    auto operation = [&](uint8_t op, uint8_t n1, uint8_t n2) { 
        code.insert(code.end(), { 0xF2, 0x0F, op, xmm(n1, n2) });
    };

    auto movapd_xmm = [&](uint8_t n1, uint8_t n2) {
        code.insert(code.end(), { 0x66, 0x0F, 0x28, xmm(n1, n2) });
    };

    auto load_xmm = [&](uint8_t n, int32_t offset) {
        code.insert(code.end(), { 0x66, 0x0F, 0x6F, 
          (uint8_t)(0x81 + n * 8) }); 
        code.push_value<int32_t>(code.end(), 16 * offset); 
    };
/* ... */    

Their meaning is as follows:

Macro Meaning
xmm Gives the opcode signifying a pair of xmm{n1},xmm{n2} registers
pop_xmm Pops the floating point value from the stack into the xmmN
push_xmm Pushes the floating pint value from the xmmN onto the stack
operation Performs one of the four arithmetic operations, in the xmm{n1},xmm{n2} registers
movapd Copies the contents of xmm{n2} into xmm{n1}
load_xmm Loads the literal from some position in memory to the xmmN

pop_xmm and push_xmm are implemented by manually in-/decrementing the stack pointer and copying the value.

Let us examine how the constant values will be stored. Since xmmN register are 128 bit (16 = 10(16) bytes) registers that contain two packed double-precision floating point values, we will store the constant values separately, after the execution code and load it from said memory when we need it. To this end we will use RIP-relative addressing and store the pointer to the beginning of data section in the rcx pointer at the beginning of the routine and then load the data from offsets relative to it – for example:

# Function entry point
000000A0614C0000  lea         rcx, [rip + 0x59]  # We store the beginning of data
...
000000A0614C0022  movdqa      xmm1,xmmword ptr [rcx]     # We move the 1st constant to xmm1
000000A0614C002A  movdqa      xmm2,xmmword ptr [rcx+10h] # We move the 2nd constant to xmm2
... 
000000A0614C0055 ret # End of the function
...

# Beginning of data section 
000000A0614C0060  00 00   
000000A0614C0062  00 00 
000000A0614C0064  00 00
000000A0614C0066  0F 3F
000000A0614C0068  00 00 
000000A0614C006A  00 00
000000A0614C006C  00 00
000000A0614C006A  00 00

The constants have to be aligned on the 16-byte boundary so we can load them with movdqa that is more efficient than its unaligned movdqu counterpart. We will inject the lea instruction when we will have generated the rest of the code, since we have to know the total code length.

As I mentioned in the previous section, Windows ABI requires us to preserve the registers, so we push them on the stack to be restored when we are done:

/* ... */ 
    #ifdef _WIN32
    push_xmm(5); push_xmm(6); push_xmm(7);
    #endif
/* ... */ 

The rest of the code is a straightforward modification of the algorithm we had earlier. At each point we keep track the current depth and if we are less than 6 levels high, we just shuffle the registers around and compute the operations on them. If depth exceeds 6 (we need xmm6 and xmm7 for calculating, so we cannot use them as a part of stack), we push or pop the values onto/from the main stack.

If we encounter the literal, we push the load instruction with the appropriate offset (which is a multiple of 16) into the code. The actual literal is stored separately to be inserted in the data section:

/* ... */
    for (; *expression; ++expression)
    {
        if ( isdigit(*expression) || (*expression == '.') || 
            (*expression == '-' && isdigit (*(expression+1)) ))
        {
            char* new_expression;
            literals.push_back(strtod(expression, &new_expression));
            expression = new_expression;

            // We copy the data from the literal table to the appropriate 
            // register
            load_xmm(min(depth + 1, 6), literals.size() - 1);
            
            if (depth + 1 >= 6)
                push_xmm(6);
            
            ++depth;
        }
/* ... */

By the calling convention, we have received the parameter x in xmm0 register. So in our case, the stack begins at xmm1 and when we want to use the parameter, we just move it to the appropriate place.

/* ... */
        else if (*expression == 'x')
        {
            // The parameter is already in this register, so we just 
            // copy/push it.
            if (depth + 1 >= 6)
                push_xmm(0);
            else
                movapd_xmm(depth + 1, 0); 

            ++depth;
        }	
/* ... */

If we read an operator, we push the appropriate arithmetic instruction:

/* ... */
        else if(*expression > 32)
        {
            // If we have fewer than 2 operands on the stack, the 
            // expression is malformed.
            if (depth < 2)
                throw runtime_error("Invalid expression");

            if (depth >= 6)
                pop_xmm(6);

            if (depth > 6)
                pop_xmm(7);
            
            // Perform the operation in the correct registers
            int tgt_reg = min(depth - 1, 6);
            int src_reg = min(depth, 7);

            switch (*expression) {
                case '+': 
                    operation(0x58, tgt_reg, src_reg); 
                    // addsd xmm{tgt}, xmm2{src}
                    break; 
                case '*': 
                    operation(0x59, tgt_reg, src_reg); 
                    // mulsd xmm{tgt}, xmm2{src}
                    break; 
                case '-': 
                    operation(0x5C, tgt_reg, src_reg); 
                    // subsd xmm{tgt}, xmm2{src}
                    break; 
                case '/': 
                    operation(0x5E, tgt_reg, src_reg); 
                    // divsd xmm{tgt}, xmm2{src}
                    break; 
                default:;
            }

            // If the register stack is full, we push onto the main stack.
            if (depth > 6)
                push_xmm(6);

            --depth;
        }
        
        // If strtof moved the pointer to the end.
        if (*expression == '\0')
            break;
    }
/* ... */    

When the above code finishes execution, we have result stored in xmm1. So we move it into the xmm0 (as specified by the calling convention), restore the upper registers that we stored earlier and return from the function:

/* ... */
    // If there is to little or too much left on stack.
    if (depth != 1)
        throw runtime_error("Invalid expression");

    // The return value is passed by xmm0. Now we no longer need 
    // to hold onto the value for x.
    movapd_xmm(0, 1);

    #ifdef _WIN32
    // We restore the XMM5-7 register state.
    pop_xmm(7); pop_xmm(6); pop_xmm(5);
    #endif

    code.push_back( 0xc3 ); // ret
/* ... */

That’s it! The only thing left to do is to insert the constant values and calculate the offset.

/* ... */

    // +7 since we are going to insert a lea instruction 
    int32_t executable_size = code.size() + 7; 

    // Align on the 16-byte boundary:
    executable_size = 15 + executable_size - (executable_size - 1) % 16; 

    code.insert(code.begin(), { 0x48, 0x8D, 0x0D }); 
    code.push_value<int32_t>(code.begin() + 3, executable_size - 7);
    code.insert(code.end(), executable_size - code.size(), 0x00);

    /*
        We place all the floating point literals AFTER the code.
    */
    for (double val : literals)
    {
     	// The registers are 128-bit but we are only
     	// interested in the lower half
        code.push_value<double>(code.end(), val);
        code.push_value<double>(code.end(), 0);
    }

    return x_function<double(double)>(code.begin(), code.end());
}

We can now compile expressions and later evaluate them for a given parameter:

reciprocal = rpn_compile("1 x /"); 

for (double x = -1; x <= 1; x += 0.001)
{
    cout << "1/" << x << " = " << reciprocal(x) << endl;
}

In this concrete case, the code compiles to the machine code seen below:

000000F41BEB0000 00 00    lea         rcx,[0F41BEB0060h]  

# Save the register values 
000000F41BEB0007 sub         rsp,10h 
000000F41BEB000B movdqu      xmmword ptr [rsp],xmm5  
000000F41BEB0010 sub         rsp,10h  
000000F41BEB0014 movdqu      xmmword ptr [rsp],xmm6  
000000F41BEB0019 sub         rsp,10h  
000000F41BEB001D movdqu      xmmword ptr [rsp],xmm7  

# Actual calculation
000000F41BEB0022 movdqa      xmm1,xmmword ptr [rcx]  
000000F41BEB002A movapd      xmm2,xmm0  
000000F41BEB002E divsd       xmm1,xmm2  
000000F41BEB0032 movapd      xmm0,xmm1  

# Restore the register values
000000F41BEB0036 movdqu      xmm7,xmmword ptr [rsp]  
000000F41BEB003B add         rsp,10h  
000000F41BEB003F movdqu      xmm6,xmmword ptr [rsp]  
000000F41BEB0044 add         rsp,10h  
000000F41BEB0048 movdqu      xmm5,xmmword ptr [rsp]  
000000F41BEB004D add         rsp,10h  
000000F41BEB0051 ret

# The rest is data. The constant 1.0 resides at address 0F41BEB0060

Summing up

Of course we now want to know just how much was gained by the extra effort of compiling the function. The benchmark I mentioned before was an output from a small routine that generates random expressions of specified length. It turns up that the speed-up is quite significant, for an expression with 100,000,000 tokens (compiled with Intel C++ Compiler 15):

Generation of a random expression 42.4s
Evaluation (interpreted) 9.62s
Compilation 16.9s
Evaluation (compiled) 0.113s

This is a speed-up of almost 8500%. Of course, this is not a completely fair comparison – both interpreted and compiled calculation could be optimized further and evaluating a random expression 100,000,000 tokens long is hardly a decisive benchmark of real-world performance. But the point stays the same – compilation can yield significant performance improvements and with enough abstraction around the scary parts it is not even that hard to do. There are projects, such as AsmJit library that provide a very nice way to build optimized functions at runtime and it can be used even in programs that are not just programming language interpreters.

The whole program is available here. Feel free to suggest possible improvements.

Implementing a variant type from scratch in C++

{20 Comments}

When programming in C++ we sometimes need heterogeneous containers that can hold more than one type of value (though not more than one at the same time). If our container needs only contain basic types and Plain Old Data Structures, the solution is simple:

union simple_variant_1 {
   int as_integer;
   bool as_bool;
   float as_float
}

If we want to know which type is stored inside, we can make a tagged union, something along the lines of:

struct simple_variant_2 {
   enum {t_int, t_bool, t_float} type_id;
   union {
      int as_integer;
      bool as_bool;
      float as_float
   }
}

From the efficiency standpoint, this is probably almost an optimal solution. An union will only consume as much space as the largest of its member types. If we try to access the as_float member variable when the actual value stored is int, we will see implementation-dependent gibberish values but that’s OK because we know we are accessing the wring type. Likewise, basic types do not need to be initialized or finalized – they can safely contain gibberish and after we are done using them, they don’t need to do free memory, clean up global state etc.

The problem of course arises, when we want to use “smart” objects, such as std::string, std::vector, … in our union. These types implement custom constructors, copy and move operators, destructors, … The internal structure of std::string, for example, contains a pointer to the string data. If this structure is not properly initialized, it will point to a random location in memory, causing a segmentation fault the first time we access it. If it is not finalized, the memory that it allocated will not be freed thus causing memory leaks.

This is exactly the reason why non-basic types were disallowed in C++ unions. When the union is created, there is simply no means of knowing which of its members should be initialized and when it goes out of scope, the compiler cannot know on which type to perform the destructor. In C++11, the rules have been relaxed, but this requires us to take the initializing/finalizing part into our own hands.

This is how a variant type implementation can look like. We use the rarely-used placement new syntax that allows us to decouple the memory acquisition and object initialization that the new operator provides, by performing only the initialization part on a memory that we already have. In the same manner, we have to manually call the destructor, because free also attempts to deallocate memory.

struct variant {
   enum variant_type {t_int, t_string};
   variant_type type_;
   union {
      int as_integer;
      string as_string;
   };
   variant(variant_type type) : type_(type) {
       if (type == t_string)
           new (&as_string) string();
       else
           new (&as_integer) int();
   }
   ~variant() {
       if (type_ == t_string)
           as_string.~string();
       else
           as_integer.~int();
   }
};

Of course in the above example, we notice that new (&int) as_integer and as_integer.~int() are completely redundant, because integers do not need neither initialization nor finalization, so their constructors and destructors are NOP. This is included for consistency once we get to the point of using templates for generic variant types.

Note: In fact the above code does not even compile on GCC 4.8 and Clang 3.4, but when we use templates to refer to int with a generic template parameter, everything works as expected.

The above code works just well but this approach does not lend itself to implement generic templated types. What we would like is something, that can be used like this:

variant<int, float, string> var;
var.as<string>() = "Hello world!";
cout << var.as<string>(); // Prints Hello World
var.as<int>() = 42;
cout << var.as<int>(); // Prints 42
var.as<float>() = 3.14;
cout << var.as<float>(); // Prints 3.14

In the following section we will build the fully-templated variant type from scratch.

Basic construction

We begin the construction by evaluating the basic layout of our variant type. The approach with unions works well, when we have a hard-coded set of types to choose from, but not so much when we have a templated-type.

We start by setting the basic layout to something like this:

struct variant {
   type type_id;
   char data[enough];
}

This of course raises the question of “How much is enough?”. Since we need every object to fit into the container, it must be at least as big as the largest of the member types.
This is the time to introduce a helper object and some variadic template magic:

template<typename F, typename... Ts>
struct variant_helper {
  static const size_t size = 
    sizeof(F) > variant_helper<Ts...>::size ? sizeof(F) : variant_helper<Ts...>::size;
};

template<typename F>
struct variant_helper<F>  {
	static const size_t size = sizeof(F);
};

We will use the helper object for everything that will require a template specialization for the case of one single member type so as to avoid the code duplication in the main variant class.

This allows us to define:

template<size_t size>
class raw_data { char data[size]; };

template<typename... Ts>
struct variant {
  typedef variant_helper<Ts...> helper;

  type type_id;
  raw_data<helper::size> data;
}

We use the wrapper class raw_data to work around the awkwardness and inconsistency of C-style arrays. This also allows us to easily change the definition if we are programming for a platform where say, sizeof(char) != 1.

Edit (10/11/2013): as some have pointed out, this implementation of raw_data will lead to alignment issues on certain platforms, which can be alleviated by declaring the container type a union of all the variant member types. C++11 provides a very nice container type that works with variadic templates, that we can directly use in this code without any other changes. We can also use aligned_storage. The full source code has been updated.

Type identification

We need a way to create a mapping between member types and values that will be stored in the type_id variable. We will use C++’s very handy RTTI utilities. This requires that the program be compiled with RTTI data and in theory it would be possible to implement this without using it but we would have to manually create type traits for every type that we would like to use, like this:

template<typename T> struct type_traits;
template<> struct type_traits<int> { static const size_t id = 0; };
template<> struct type_traits<string> { static const size_t id = 1; };
template<> struct type_traits<float> { static const size_t id = 2; };
/* ... */

This is useful in some cases, for example if we want to have precise control over the representation of the stored type ID, but very cumbersome when writing a general-purpose container. If we take a look at the std::type_info, we see that typeid(T).hash_code() is guaranteed to return a unique size_t-typed value for every different type, making it ideal for our purposes.

It is also possible to have the types identified sequentially with some template magic, but we would like to have global type IDs, to allow for compatibility between the variant type instantiated with different sets of member types in a possible future expansion.

Destructor

We will now look at the implementation of the destructor. The destructor has to go through every member type and check if it’s ID matches the ID stored in the object. If this is the case, it has to cast the data to this member type and invoke its destructor.
We create an external routine called destroy in the helper class, so that we don’t have to specialize the main template:

template<typename F, typename... Ts>
struct variant_helper {
  /* ... */
  inline static void destroy(size_t id, void * data) 
  {
    if (id == typeid(F).hash_code())
      reinterpret_cast<F*>(data)->~F();
    else
      variant_helper<Ts...>::destroy(id, data);
  }
  /* ... */
};

template<typename F>
struct variant_helper<F>  {
  /* ... */
  inline static void destroy(size_t id, void * data) 
  {
    if (id == typeid(F).hash_code())
      reinterpret_cast<F*>(data)->~F();		
  }
  /* ... */
};

template<typename... Ts>
struct variant {
private:
  typedef variant_helper<Ts...> helper;

  size_t type_id;
  raw_data<helper::size> data;
public:	
  /* ... */
  ~variant() { helper::destroy(type_id, &data); }
  /* ... */
};

Initialization of objects

Contrary to Boost.Variant’s Never-Empty guarantee, our variant type will allow the uninitialized state – the “empty container” – because it greatly simplifies matters. Ergo, we need only make the default constructor:

template<typename... Ts>
struct variant {
  static inline size_t invalid_type() {
    return typeid(void).hash_code();
  }
  /* ... */
  variant() : type_id(invalid_type()) {	}
};

Despite what we outlined in the introduction, we will have two functions to access the member types – set() and get().
set() will perfectly forward all of its parameters directly to T‘s constructor, whereas get() will throw an error if we try to access the type that is not currently contained within.

template<typename Ts>
struct variant {
  /* ... */
  template<typename T, typename... Args>
  void variant::set(Args&&... args)
  {
    // We must first destroy what is currently contained within
    helper::destroy(type_id, &data);		

    new (&data) T(std::forward<Args>(args)...);
    type_id = typeid(T).hash_code();
  }
 

  template<typename T>
  T& get()
  {
    if (type_id == typeid(T).hash_code())
      return *reinterpret_cast<T*>(&data);
    else
      throw std::bad_cast();
  } 	
  /* ... */
}

Copy and move semantics

At this point, the variant type is already usable, but its functionality is severely limited since it does not implement move/copy semantics, so we cannot use it in STL containers. Actually, it is worse. The implicitely defined copy constructor will just copy the contents of data, so when both copies fall out of scope, the object therein contained will be finalized twice, which will probably lead to a crash. It is therefore imperative to either implement custom constructors/operators or mark the object as non-copyable/non-moveable.

The move constructor is very simple, since we only need to steal the resources and invalidate the old instance:

variant(variant<Ts...>&& old) : type_id(old.type_id), data(old.data)
{
  old.type_id = invalid_type();
} 

The copy constructor involves some more work, since it has to invoke the suitable copy constructor of the member type of the object contained in the original container.

template<typename F, typename... Ts>
struct variant_helper {
  /* ... */
  inline static void copy(size_t old_t, const void * old_v, void * new_v)
  {
    if (old_t == typeid(F).hash_code())
      new (new_v) F(*reinterpret_cast<const F*>(old_v));
    else
      variant_helper<Ts...>::copy(old_t, old_v, new_v);		
  }
};

template<typename F>
struct variant_helper<F>  {
  /* ... */
  inline static void copy(size_t old_t, const void * old_v, void * new_v)
  {
    if (old_t == typeid(F).hash_code())
      new (new_v) F(*reinterpret_cast<const F*>(old_v));
  }
};

struct variant {
  /* ... */
  variant(const variant<Ts...>& old) : type_id(old.type_id)
  {
    // We need to do a "deep" copy, that is - invoke the copy constructor of the contained type.
    helper::copy(old.type_id, &old.data, &data);
  }
  ...
}

The last thing we need are the corresponding operator= implementations, everything as standard and some helper functions:

variant<Ts...>& operator= (variant<Ts...>&& old)
{
  // We steal the data and invalidate the old value
  data = old.data;
  old.type_id = invalid_type();
	
  return *this;
}

variant<Ts...>& operator= (variant<Ts...> old)
{
  // Note the call-by-value above
  std::swap(data, old.data);
  std::swap(type_id, old.type_id);
		
  return *this;
}	

// ----------------------------------------- //

template<typename T>
void is() {
  return (type_id == typeid(T).hash_code());
}

void valid() {
  return (type_id != invalid_type());
}

Conclusion

This implementation does not use dynamic memory management/heap storage, so it should be quite efficient. It’s goal is to be simple to understand but may not be so robust as heavy-duty implementations, such as Boost.Variant. It uses fancy C++11 features, so it will probably not work on older compilers The full source code is available here.

Fixing the brightness fluctuation problem on HP laptops with AMD graphics

{0 Comments}

After installing Windows 8 on my laptop for a second I decided not to install the AMD driver pack from HP website, since the driver that was installed by Windows Update worked perfectly and as you can imagine, I am not particularly fond of the oft-included management software that hogs system resources. Everything worked well until I discovered that when the computer is not plugged to the power, the screen brightness fluctuates widely – not following the ambient light as one would expect but according to the current content of the screen. Counterintuitively, the white windows in foreground made the screen turn brighter and vice versa, which made the problem even more visible.

I sifted through the power options (since this only happened when the computer was not plugged in), but I noted that the Adaptive brightness options were already disabled. I remembered having similar problems on my previous installation of Windows and as it turned out after a quick googling, this behaviour is caused by the ingenious feature of the AMD drivers, called Vari-Bright™ which can be easily turned off using the AMD’s Catalyst Control Center, which I chose not to install.

I figured that since the Catalyst Control Center is just a GUI frontend, the settings that are stored somewhere else, persumably in the registry. A bit of Google, a bit of trial and error and a couple restarts later, the brightness stayed constant.

I added a DWORD value PP_VariBrightFeatureEnable set to 0×00000000 to the following key:
HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Class\{4d36e968-e325-11ce-bfc1-08002be10318}\0000

There is no need to install CCC just for this :)

How to calculate Jordan’s normal form (the hard way)

{0 Comments}

  1. First thing we do is find out the spectrum of \(A\), that is, its eigenvalues. We shall mark them as \(\text{sp}A=\{\lambda_1, \lambda_2, \lambda_3, \cdots, \lambda_n\}\). We do this however we can, usually by finding the zeros of the characteristic polynomial, that is solving the equation \(\det(A-\lambda I) = 0\) for \(\lambda\)
  2. In the next step we will be examining the sequence of generalized eigenspaces (i.e. \(\ker(A-\lambda I)^n\) for a given \(n\)) for each \(\lambda \in \text{sp}A\). $$\begin{gather} \ker(A-\lambda I) \\ \ker(A-\lambda I)^2 \\ \ker(A-\lambda I)^3 \\ \vdots \end{gather}$$ We calculate each kernel by simply solving the equation $$(A-\lambda I)^n
    \begin{bmatrix}
    x_1 \\
    x_2 \\
    x_3 \\
    \vdots \\
    x_n
    \end{bmatrix} = 0$$ Eventually, the two consecutive kernels in a sequence will be the same – \(\ker(A-\lambda I)^k = \ker(A-\lambda I)^{k+1}\)
  3. When we have found the first such kernel, we take a look at the one that it succeeded – \(\ker(A-\lambda I)^{k-1}\). We must now choose such a vector \(x_0\) that is in \(\ker(A-\lambda I)^k\) but not in \(\ker(A-\lambda I)^{k-1}\). For example if $$\ker(A-\lambda I)^{k-1} = \mathcal{L}\left( \begin{bmatrix}
    1 \\
    1 \\
    0
    \end{bmatrix},\begin{bmatrix}
    1 \\
    0 \\
    0
    \end{bmatrix} \right)$$ whereas \(\ker(A-\lambda I)=\mathbb{R}^3\), we can choose $$x_0=\begin{bmatrix}
    0 \\
    0 \\
    1
    \end{bmatrix}$$
  4. We now consider a sequence: $$\begin{gather} x_1 = (A-\lambda I) x_0 \\ x_2 = (A-\lambda I) x_1 \\ \vdots \end{gather} $$ Eventually one vector will be zero, $$x_n = 0$$
  5. Concatenate the vectors up to \(x_{n-1}\) together from right to left $$A_{\lambda} = [ x_{n-1}, x_{n-2}, \ldots, x_1, x_0 ] $$
  6. Repeat the above steps for every eigenvalue of \(A\). Then concatenate the matrices \(A_{\lambda}\) corresponding to each eigenvalue together. $$P = [ A_{\lambda_1}, A_{\lambda_2}, \ldots, A_{\lambda_n} ] $$ This is the so-called Jordan basis, the change-of-basis matrix we need to calculate the Jordan form. The order of \(A_{\lambda}\) matrices does not matter since Jordan normal form is only unique up to a permutation of Jordan blocks.
  7. We need to calculate the inverse of \(P\), usually by Gaussian ellimination.
  8. We calculate the Jordan form by \(J = P^{-1} A P\).

Comparing BGP routing tables for missing routes

{0 Comments}

If we are connected to the internet via one single ISP, it is most likely that we will have a default route set up to one of their access routers. However, if we want to multi-home, either for reliability or load-balancing reasons, the most straightforward way is to set up BGP peering with our upstream providers. For example, if we connect to three upstream ISPs – A, B and C, there is no point in sending packets destined to one of B’s (or its clients’) IP addresses, to A or C. In order to make informed decisions on where to send packets, we will usually need to receive a full BGP feed so as to see the “whole internet” in terms of routing tables.

In the ideal world, we would see all the IP prefixes that are advertised in the internet, via all of our peers, leaving us to choose the best path, based on the AS path length and local preference. This is not always the case, though. Some routes will be missing in BGP feeds received from our upstream providers. That is – we will have some routes in our routing tables that are advertised only by some of our peers. These can be our other peers’ internal networks (no-advertise/no-export), in which case we don’t really care. If these routes, however, represent actual parts of the internet that should be reachable, this can be a cause for concern.

For example, if we only have two upstream providers and only one of them advertises a route to a certain network and the connection to that ISP is temporarily lost, that network will be unreachable to us.

The problem

If there are routes missing that should not be, there is little that can be done other than escalating the issue to the problematic ISP. In order to do so, we must first identify the routes that are missing. By the time of writing this article, the global IPv4 routing table has more than 400,000 entries and the number is expected to grow in the future. Therefore, we need a way to work with large collections of routes efficiently. The other problem is that due to different route summarization rules of our peers, routes may be split in different parts by different peers, so we cannot run a simple diff through all of our routing tables.

We would like to have a tool to perform a binary difference operation of two sets of IP address ranges (subnets) in an efficient manner. Ideally, the tool would also be able to perform other set operations, such as  union or intersection, so we can merge several ISP’s advertised routes prior to comparing it with a single ISP’s route table and so on.

What we want to achieve is perhaps best illustrated by an illustration:

Analysis of the problem

While making a tool like this, we would like to avoid pitfalls, such as using expensive binary trees, even though these may seem at first to be the most versatile and offer most flexibility while working with datasets. Since the IP address space is essentially a binary tree in itself, we can be tempted to model our internal representation of the route table by it.

A quick calculation tells us that even if limit ourselves to networks with the maximum prefix length of /24, we would need $$\sum _{n=0}^{24} 2^n = 2^{25}-1 = 33554431$$ nodes or 384 MiB with 12 bytes per node (2×32-bit pointer, 1×32-bit address). This is workable with IPv4, but completely infeasible when dealing with IPv6 routes. We would like to allow for at least the networks with a maximum prefix of /48, yielding:
$$\sum _{n=0}^{48} 2^n = 2^{49}-1 = 562949953421311$$ nodes or 16 PiB of data with 32 bytes per node (2×64-bit pointer, 1×128-bit address). Clearly impossible to hold in RAM by today’s standards.

Fortunately, constructing a large binary tree can be completely avoided when performing this task and can in fact be done in \(O(n)\) when the input routing tables are already sorted and usually \(O(n \log n)\) when they are not, depending on the sorting algorithm.

Algorithm

The algorithm used for comparison is very simple and as stated above runs in linear time. Suppose we compare two routing tables, A and B.

  1. Iterate through all the entries in both routing tables and mark the start and the end of each subnet (all-zeros and all-ones addresses of a subnet). Insert both of them into a single array, while noting their type (startA, startB, endA, endB)
  2. Sort the above array, ordered only by IP number.
  3. Iterate through the array and keep two counters, say \(c_A\) and \(c_B\)Each time, startX marker is encountered, increase \(c_X\) and each time endX is encountered, decrease it.
  4. If we are in the region, where \(c_A > 0 \wedge c_B = 0\), we have found a IP range that is in A but not in B.

The algorithm works well even with overlapping subnets, since the counter is simply going to be larger than one in that case. But even though the algorithm is simple, there are a few things to consider:

  • If we want to aggregate the missing subnets for clarity, we should perform the action (print out the subnet, …) when counter drops back to zero.
  • If we do this, we should be careful with neighbouring subnets, since the counter may drop to zero on the boundary but rises again immediately afterwards.
  • We should be careful when dealing with single-address subnets (/32 or /128), since the start and the end point are going to be the same.

The algorithm works for other binary set operations as well. If we want to produce a union of two sets (\(A \cup B\)), we should mark the regions where \(c_A > 0 \vee c_B > 0\) and for intersections  (\(A \cap B\)), we use \(c_A > 0 \wedge c_B > 0\) comparison kernel.

Summing-up

I implemented the above algorithm, based on this Stack Overflow answer, in a tool that can be used to analyse routing tables, called BgpCompare. It reads IP subnets from two input files and then outputs the result of a set operation to standard output. It uses regular expressions to capture IP addresses and prefix lengths, so it can work with a wide variety of different routing platforms’ routing table dumps, just by providing a new regular expression.

BgpCompare is free and open-source software, licensed under LGPL licence. It is written in C++11 and comes with a handy header-only library for IPv4/IPv6 address manipulation (parsing, textual representation, arithmetics, …).

The latest version, with full source code along with Windows and Linux binaries, is available here (1.02 MiB).

Why NAT has no place in our networks

{0 Comments}

Let’s imagine a hypothetical situation. We have a problem with our internet connection, so we call our ISP’s tech support. We wait for a couple of minutes to get a free operator, and then we spend fifteen minutes describing the layout of our local network and then another 20 minutes debugging a complex problem (intermittent packet loss when link is not saturated, for example). Finally, the problem seems to be resolved and the operator concludes the call with: “Should there be any more problems, do not hesitate to contact us again.”

Half an hour later, the problem reappears. We have no other options than to call the ISP’s switchboard again, wait for a free operator, who is invariably going to be someone else, describe the problem once again and go through a debugging session once again, repeating the answers to the same questions and so on.

This is the frustration that web-services “feel” each time you connect to them via a NAT gateway. In a situation like the one above, we would like to be able to call the previous operator directly but because we do not know the operator’s direct phone number, if there is one, we cannot. Even though we were given the permission to do so!

The central point of this article is the old adage, that NAT does not have anything to do with security. Ladies and gentlemen, I present you this:

Even if you use public IP addresses and are thereby connected “directly to the internet”, just these 4 rules on a stateful firewall should give you the same degree of protection for the network as a typical consumer-grade NAT implementation. The stateful firewall is called stateful because it maintains some degree of internal state that is used to guide the decisions on what to do with an incoming packet. Let’s make a breakdown of these four rules:

  1. First rule will accept/let through any packet that goes from the internal network to the internet. While doing so, it will also remember the connection parameters, such as TCP sequence numbers, source and destination IP-port pair, …
  2. Second rule will let trough any packet coming from the internet directed to the internal network IF the connection parameters match one of those that were recorded when packets were going out. Therefore, only if the connection had previously been established from the inside.
  3. Third rule will let in any packet that is somehow related to one of the established connections. This rule is not strictly necessary, but it helps with certain applications, such as active-mode FTP that need connections be established from the outside.
  4. Fourth rule will drop anything else coming from the internet, such as SYN packets directed towards our internal servers, services running on our workstations, …

Of course, I will never say that these four rules guarantee network security, as they most certainly do not. Network level security and firewalling are complex topics and much work is needed to render the network fully secure. However, this is the configuration used in many networks, residential and corporate, as it allows full client access to the internet, while blocking any servers from within to serve outside clients. It is a quite restrictive default, but exceptions can easily be made by adding more rules to the firewall.

If you deploy NAT in your network, you will most likely have the above rules configured in the firewall. This is one of the reasons why people consider deploying NAT as adding security to the network. The actual reason for this is of the practical matter. If NAT was not combined with a stateful firewall, it could not work, since it has to keep track of the established sessions in order to know where to forward incoming packets. NAT cannot function without a stateful firewall, whereas the stateful firewall can and will function without NAT functionality.

Not using NAT allows us to configure much more fine-grained security at the network perimeter. If we return to the introductory anecdote, we mentioned that we were given the permission to follow-up on the issue. The “phone firewall” in that case could be configured to let in calls from our number to the operator’s direct phone number for a specific period of time, afterwards, the access could be blocked. Because NAT-like mechanism is deployed, we cannot do this, because operator does not even have a direct number.

Appropriate usage of NAT

One may ask themselves when using NAT is in fact appropriate. Apart from stateful firewalling, which is not really a characteristic of NAT at all but more of a side product, NAT mechanisms accomplish two more important things:

  • conservation of public IP address space, and
  • hiding the details of the internal network (individual computers’ IP addresses, subnets, …)

NAT was invented primarily to address the first bullet point. Beforehand, people were already using stateful firewalls, but in the years to come, NAT has become so prevalent that many people simply could not imagine running their networks on public IPs. The sad truth is, that due to the IPv4 address exhaustion, public IPs everywhere are simply not an option anymore. We will have to live with NAT for as long as we have IPv4, the only thing we want to avoid are things like double-NAT and similar atrocities that unmanageably complicate the network and prevent certain applications from functioning at all.

That being said, we should be aware that IPv6 offers so much address space that the first bullet point is simply not applicable. We will never have to use masquerading to hide multiple IPv6 addresses behind one single address, because there is such an abundance of it. The second bullet point still lurks though, leaving people who are deploying IPv6 in their networks anxious, since their computers are not “hidden” anymore but rather present themselves with an unique address to the whole internet.

This is a typical case of the security through obscurity fallacy. Knowing the internal telephone number does not help me at all, because I cannot reach the person unless the operator allows me so when I call the switchboard. Apart from that, there are already mechanisms in place that help providing higher anonymity, such as IPv6 Privacy Extensions.

While the classical NAT masquerading that we are familiar with has no place in the IPv6 world, schemes such as NAT66 that provide a one-on-one mapping between public addresses and unique local (ULA) addresses, are marginally effective, especially to facilitate renumbering when a company changes their ISP. Still, deploying any kind of middlebox that changes IP addresses in packets brings about increased complexity, breaks compatibility with certain applications that require Direct connectivity, such as IPsec, while its positive effective are usually negligible.

The verdict would be that while NAT is unavoidable in IPv4 world, we must not fall into temptation and deploy it on IPv6 networks as disadvantages of NAT schemes far outweigh its advantages. IPv6 allows the internet to function as it was originally conceived. As the proverb goes: When all you have is a hammer, everything starts to look like a nail. We should not apply the same logic that we have become used to dealing with IPv4 networks as it will only cause us trouble further down the line.

Generating faux words for a specific language

{1 Comment}

Generating faux words (words that look and sound like they could be a part of the language) is an interesting problem. Different languages have variously complex phonetic and orthography rules, but the factors that contribute towards identification of a word with a specific language, often include:

  • Balance of consonants and vowels,
  • Well known prefixes (un-, dé-, …) and suffixes (-tion, -ment, …),
  • Capitalization and word length (think German :-) ),
  • Diacritics and special symbols (apostrophes, hypens, …).

Some languages’ rules regarding the look-and-feel of words are also more flexible than others’. Languages that have a long history of borrowing, such as English, typically have more relaxed rules and a word is more likely to be perceived as being a part of it. Of course languages that are well known to the person in question are harder to replicate than languages one is just vaguely familiar with. As a non-speaker, Japanese seems trivial to generate faux words for – just concatenate simple syllables of one consonant and a vowel and perhaps prefer a y in the second part of the word. For example: nagibutaya, a word I’ve just made up, looks very Japanese to me.

First attempt

I set on a task to create a computer program that would generate faux words for the Slovene language, first by upgrading the simple algorithm I used for Japanese. I tired adding more rules, such as allowing consonants that can indeed stand together without a vowel in between (a famous example in Slovene is čmrlj, a word written without any vowels, meaning bumblebee). Also, I accounted for a few exceptions to those rules. The program was working, but the results were poor. Either the words were unpronounceable or they were not very Slovenian-sounding.

Second attempt

I chose an alternative approach for the second attempt. Instead of trying to figure out the rules of a language’s word form, I let the program figure it out for itself. I made a program in C++ that analyses a lexicon and constructs new word on the basis of information it gathered from it.

I chose a very simple metric. For each combination of two letters in the alphabet, the program calculates the probability of one following another in a word as well as probabilities of single letters to be the first or the last ones in a word. I used wordlists provided by WinEdt that can be found here.

Neighbouring letter analysis of the French lexicon. First row signifies the likelihood that a letter starts a word and the first column the likelihood that a letter ends it.

The next part, after having analysed the lexicon, was to generate new words. For this, again, a simple algorithm was used. Each letter was chosen with a weighted random function, based on the frequency table, such as the one visualised above. The results were quite surprising. Even though the algorithm was not sophisticated at all, words such as these (no cherry-picking was involved) could pass as being in French to unknowing observer:

inéesuqué, posesur, copler, amonct, teucutér, urdonsonta, bédispez, adélitéve, auisttete, fentais, ilèmera, lysses, irngeles, piécaramon

The results for English were slightly less convincing. This can be attributed to the fact that English allows for more different two-letter combinations than French (the visualisation would have many white squares), but they do not always work when combined into a longer word:

voy, cleme, plaonas, abiok, macleses, usodens, poshad, medistets, amubodg, oshongs, detccichre, ngreang, moodery

Third attempt

The last attempt was not a new approach to the problem but merely an upgrade to the algorithm used in second attempt. As we have seen in the case of English, a simple frequency map of two-letter combinations is not always enough. A straightforward upgrade would be to account for mutiple-letter combinations. The metric used in this attempt was the probability of a single letter following a string of n letters in a word. I experimented with various values for n. The results for English were amazing (n=5):

centringall, unliner, goutflyings, mackinets, handbaggers, thirstlings, spungency, clatternum, ophiolitise, goutineer, brickmaw, prophetised, firmant, acronyism

The caveat of the bigger-n approach is that many outputted words are actual words or could be by the language’s standard word formation rules, such as compounding. When trying this approach with highly inflected languages, such as Slovene or Latin, there is a chance that very little actually new words will be formed as the program would just permute suffixes.

Conclusion

This mental exercise is a prime example of how statistics on large datasets can greatly simplify complex problems. Even though the program possessed no actual rules of the language, it was able to generate words reasonably faithful to the language’s feel. The same principle is used in machine translation where statistical algorithms significantly outperform those that do extensive and complex parsing, both in terms of fidelity and practicality.

You can download the source code (C++) and the Win32 executable of the project here.

The humble beginnings

{0 Comments}

After quite a long time of hesitation and indecisiveness, I have finally managed to get it together and start a blog. Despite having been active in the IT field for quite a while, I have never set up a place where I could unconstrainedly express myself, i.e. personal webpage, mostly for the reason that I have never really had anything to say to the masses. (well, I’ve been writing a Windows Longhorn beta testing blog for a while, but let’s just agree to never mention it again, shall we? ☺)

Well, things have changed. I have become more proactive in my fields of interest and during my technological meanderings, I have stumbled upon some interesting problems, figured out some interesting solutions and it would be a waste if I kept it all to myself and a small circle of my colleagues. I think that becoming active in the StackOverflow community helped towards this decision the most, because I have developed a habit of regular written expression, of whatever sort it may be. Hence, during my summer internship at Arnes, the idea of this blog was born et voilà, here it is. To the more fundamentalist of you: Yes, it is powered by WordPress and no, I am not ashamed. I find it quite exquisite thus far.

The content on this blog will probably revolve around topics that interest me most. Generally, you should expect some networking (routing software/hardware, IPv6 deployment and developments, …), some programming (C++11, alogrithms), and almost certainly some mathematics and linguistics. Occasionally I will (probably) drop in some non-technology, Real Life™ stuff. Maybe.

I guess that a promise that I will be posting regularly is in place but I consider this a bad omen. I hope it works out ☺.

Cheers, Tibor