If we are connected to the internet via one single ISP, it is most likely that we will have a default route set up to one of their access routers. However, if we want to multi-home, either for reliability or load-balancing reasons, the most straightforward way is to set up BGP peering with our upstream providers. For example, if we connect to three upstream ISPs – A, B and C, there is no point in sending packets destined to one of B’s (or its clients’) IP addresses, to A or C. In order to make informed decisions on where to send packets, we will usually need to receive a full BGP feed so as to see the “whole internet” in terms of routing tables.

In the ideal world, we would see all the IP prefixes that are advertised in the internet, via all of our peers, leaving us to choose the best path, based on the AS path length and local preference. This is not always the case, though. Some routes will be missing in BGP feeds received from our upstream providers. That is – we will have some routes in our routing tables that are advertised only by some of our peers. These can be our other peers’ internal networks (no-advertise/no-export), in which case we don’t really care. If these routes, however, represent actual parts of the internet that should be reachable, this can be a cause for concern.

For example, if we only have two upstream providers and only one of them advertises a route to a certain network and the connection to that ISP is temporarily lost, that network will be unreachable to us.

## The problem

If there are routes missing that should not be, there is little that can be done other than escalating the issue to the problematic ISP. In order to do so, we must first identify the routes that are missing. By the time of writing this article, the global IPv4 routing table has more than 400,000 entries and the number is expected to grow in the future. Therefore, we need a way to work with large collections of routes efficiently. The other problem is that due to different route summarization rules of our peers, routes may be split in different parts by different peers, so we cannot run a simple diff through all of our routing tables.

We would like to have a tool to perform a binary difference operation of two sets of IP address ranges (subnets) in an efficient manner. Ideally, the tool would also be able to perform other set operations, such as  union or intersection, so we can merge several ISP’s advertised routes prior to comparing it with a single ISP’s route table and so on.

What we want to achieve is perhaps best illustrated by an illustration:

## Analysis of the problem

While making a tool like this, we would like to avoid pitfalls, such as using expensive binary trees, even though these may seem at first to be the most versatile and offer most flexibility while working with datasets. Since the IP address space is essentially a binary tree in itself, we can be tempted to model our internal representation of the route table by it.

A quick calculation tells us that even if limit ourselves to networks with the maximum prefix length of /24, we would need $$\sum _{n=0}^{24} 2^n = 2^{25}-1 = 33554431$$ nodes or 384 MiB with 12 bytes per node (2×32-bit pointer, 1×32-bit address). This is workable with IPv4, but completely infeasible when dealing with IPv6 routes. We would like to allow for at least the networks with a maximum prefix of /48, yielding:
$$\sum _{n=0}^{48} 2^n = 2^{49}-1 = 562949953421311$$ nodes or 16 PiB of data with 32 bytes per node (2×64-bit pointer, 1×128-bit address). Clearly impossible to hold in RAM by today’s standards.

Fortunately, constructing a large binary tree can be completely avoided when performing this task and can in fact be done in $$O(n)$$ when the input routing tables are already sorted and usually $$O(n \log n)$$ when they are not, depending on the sorting algorithm.

## Algorithm

The algorithm used for comparison is very simple and as stated above runs in linear time. Suppose we compare two routing tables, A and B.

1. Iterate through all the entries in both routing tables and mark the start and the end of each subnet (all-zeros and all-ones addresses of a subnet). Insert both of them into a single array, while noting their type (startA, startB, endA, endB)
2. Sort the above array, ordered only by IP number.
3. Iterate through the array and keep two counters, say $$c_A$$ and $$c_B$$Each time, startX marker is encountered, increase $$c_X$$ and each time endX is encountered, decrease it.
4. If we are in the region, where $$c_A > 0 \wedge c_B = 0$$, we have found a IP range that is in A but not in B.

The algorithm works well even with overlapping subnets, since the counter is simply going to be larger than one in that case. But even though the algorithm is simple, there are a few things to consider:

• If we want to aggregate the missing subnets for clarity, we should perform the action (print out the subnet, …) when counter drops back to zero.
• If we do this, we should be careful with neighbouring subnets, since the counter may drop to zero on the boundary but rises again immediately afterwards.
• We should be careful when dealing with single-address subnets (/32 or /128), since the start and the end point are going to be the same.

The algorithm works for other binary set operations as well. If we want to produce a union of two sets ($$A \cup B$$), we should mark the regions where $$c_A > 0 \vee c_B > 0$$ and for intersections  ($$A \cap B$$), we use $$c_A > 0 \wedge c_B > 0$$ comparison kernel.

## Summing-up

I implemented the above algorithm, based on this Stack Overflow answer, in a tool that can be used to analyse routing tables, called BgpCompare. It reads IP subnets from two input files and then outputs the result of a set operation to standard output. It uses regular expressions to capture IP addresses and prefix lengths, so it can work with a wide variety of different routing platforms’ routing table dumps, just by providing a new regular expression.

BgpCompare is free and open-source software, licensed under LGPL licence. It is written in C++11 and comes with a handy header-only library for IPv4/IPv6 address manipulation (parsing, textual representation, arithmetics, …).

The latest version, with full source code along with Windows and Linux binaries, is available here (1.02 MiB).

Let’s imagine a hypothetical situation. We have a problem with our internet connection, so we call our ISP’s tech support. We wait for a couple of minutes to get a free operator, and then we spend fifteen minutes describing the layout of our local network and then another 20 minutes debugging a complex problem (intermittent packet loss when link is not saturated, for example). Finally, the problem seems to be resolved and the operator concludes the call with: “Should there be any more problems, do not hesitate to contact us again.”

Half an hour later, the problem reappears. We have no other options than to call the ISP’s switchboard again, wait for a free operator, who is invariably going to be someone else, describe the problem once again and go through a debugging session once again, repeating the answers to the same questions and so on.

This is the frustration that web-services “feel” each time you connect to them via a NAT gateway. In a situation like the one above, we would like to be able to call the previous operator directly but because we do not know the operator’s direct phone number, if there is one, we cannot. Even though we were given the permission to do so!

The central point of this article is the old adage, that NAT does not have anything to do with security. Ladies and gentlemen, I present you this:

Even if you use public IP addresses and are thereby connected “directly to the internet”, just these 4 rules on a stateful firewall should give you the same degree of protection for the network as a typical consumer-grade NAT implementation. The stateful firewall is called stateful because it maintains some degree of internal state that is used to guide the decisions on what to do with an incoming packet. Let’s make a breakdown of these four rules:

1. First rule will accept/let through any packet that goes from the internal network to the internet. While doing so, it will also remember the connection parameters, such as TCP sequence numbers, source and destination IP-port pair, …
2. Second rule will let trough any packet coming from the internet directed to the internal network IF the connection parameters match one of those that were recorded when packets were going out. Therefore, only if the connection had previously been established from the inside.
3. Third rule will let in any packet that is somehow related to one of the established connections. This rule is not strictly necessary, but it helps with certain applications, such as active-mode FTP that need connections be established from the outside.
4. Fourth rule will drop anything else coming from the internet, such as SYN packets directed towards our internal servers, services running on our workstations, …

Of course, I will never say that these four rules guarantee network security, as they most certainly do not. Network level security and firewalling are complex topics and much work is needed to render the network fully secure. However, this is the configuration used in many networks, residential and corporate, as it allows full client access to the internet, while blocking any servers from within to serve outside clients. It is a quite restrictive default, but exceptions can easily be made by adding more rules to the firewall.

If you deploy NAT in your network, you will most likely have the above rules configured in the firewall. This is one of the reasons why people consider deploying NAT as adding security to the network. The actual reason for this is of the practical matter. If NAT was not combined with a stateful firewall, it could not work, since it has to keep track of the established sessions in order to know where to forward incoming packets. NAT cannot function without a stateful firewall, whereas the stateful firewall can and will function without NAT functionality.

Not using NAT allows us to configure much more fine-grained security at the network perimeter. If we return to the introductory anecdote, we mentioned that we were given the permission to follow-up on the issue. The “phone firewall” in that case could be configured to let in calls from our number to the operator’s direct phone number for a specific period of time, afterwards, the access could be blocked. Because NAT-like mechanism is deployed, we cannot do this, because operator does not even have a direct number.

## Appropriate usage of NAT

One may ask themselves when using NAT is in fact appropriate. Apart from stateful firewalling, which is not really a characteristic of NAT at all but more of a side product, NAT mechanisms accomplish two more important things:

• conservation of public IP address space, and
• hiding the details of the internal network (individual computers’ IP addresses, subnets, …)

NAT was invented primarily to address the first bullet point. Beforehand, people were already using stateful firewalls, but in the years to come, NAT has become so prevalent that many people simply could not imagine running their networks on public IPs. The sad truth is, that due to the IPv4 address exhaustion, public IPs everywhere are simply not an option anymore. We will have to live with NAT for as long as we have IPv4, the only thing we want to avoid are things like double-NAT and similar atrocities that unmanageably complicate the network and prevent certain applications from functioning at all.

That being said, we should be aware that IPv6 offers so much address space that the first bullet point is simply not applicable. We will never have to use masquerading to hide multiple IPv6 addresses behind one single address, because there is such an abundance of it. The second bullet point still lurks though, leaving people who are deploying IPv6 in their networks anxious, since their computers are not “hidden” anymore but rather present themselves with an unique address to the whole internet.

This is a typical case of the security through obscurity fallacy. Knowing the internal telephone number does not help me at all, because I cannot reach the person unless the operator allows me so when I call the switchboard. Apart from that, there are already mechanisms in place that help providing higher anonymity, such as IPv6 Privacy Extensions.

While the classical NAT masquerading that we are familiar with has no place in the IPv6 world, schemes such as NAT66 that provide a one-on-one mapping between public addresses and unique local (ULA) addresses, are marginally effective, especially to facilitate renumbering when a company changes their ISP. Still, deploying any kind of middlebox that changes IP addresses in packets brings about increased complexity, breaks compatibility with certain applications that require Direct connectivity, such as IPsec, while its positive effective are usually negligible.

The verdict would be that while NAT is unavoidable in IPv4 world, we must not fall into temptation and deploy it on IPv6 networks as disadvantages of NAT schemes far outweigh its advantages. IPv6 allows the internet to function as it was originally conceived. As the proverb goes: When all you have is a hammer, everything starts to look like a nail. We should not apply the same logic that we have become used to dealing with IPv4 networks as it will only cause us trouble further down the line.

Generating faux words (words that look and sound like they could be a part of the language) is an interesting problem. Different languages have variously complex phonetic and orthography rules, but the factors that contribute towards identification of a word with a specific language, often include:

• Balance of consonants and vowels,
• Well known prefixes (un-, dé-, …) and suffixes (-tion, -ment, …),
• Capitalization and word length (think German ),
• Diacritics and special symbols (apostrophes, hypens, …).

Some languages’ rules regarding the look-and-feel of words are also more flexible than others’. Languages that have a long history of borrowing, such as English, typically have more relaxed rules and a word is more likely to be perceived as being a part of it. Of course languages that are well known to the person in question are harder to replicate than languages one is just vaguely familiar with. As a non-speaker, Japanese seems trivial to generate faux words for – just concatenate simple syllables of one consonant and a vowel and perhaps prefer a y in the second part of the word. For example: nagibutaya, a word I’ve just made up, looks very Japanese to me.

## First attempt

I set on a task to create a computer program that would generate faux words for the Slovene language, first by upgrading the simple algorithm I used for Japanese. I tired adding more rules, such as allowing consonants that can indeed stand together without a vowel in between (a famous example in Slovene is čmrlj, a word written without any vowels, meaning bumblebee). Also, I accounted for a few exceptions to those rules. The program was working, but the results were poor. Either the words were unpronounceable or they were not very Slovenian-sounding.

## Second attempt

I chose an alternative approach for the second attempt. Instead of trying to figure out the rules of a language’s word form, I let the program figure it out for itself. I made a program in C++ that analyses a lexicon and constructs new word on the basis of information it gathered from it.

I chose a very simple metric. For each combination of two letters in the alphabet, the program calculates the probability of one following another in a word as well as probabilities of single letters to be the first or the last ones in a word. I used wordlists provided by WinEdt that can be found here.

The next part, after having analysed the lexicon, was to generate new words. For this, again, a simple algorithm was used. Each letter was chosen with a weighted random function, based on the frequency table, such as the one visualised above. The results were quite surprising. Even though the algorithm was not sophisticated at all, words such as these (no cherry-picking was involved) could pass as being in French to unknowing observer:

inéesuqué, posesur, copler, amonct, teucutér, urdonsonta, bédispez, adélitéve, auisttete, fentais, ilèmera, lysses, irngeles, piécaramon

The results for English were slightly less convincing. This can be attributed to the fact that English allows for more different two-letter combinations than French (the visualisation would have many white squares), but they do not always work when combined into a longer word:

voy, cleme, plaonas, abiok, macleses, usodens, poshad, medistets, amubodg, oshongs, detccichre, ngreang, moodery

## Third attempt

The last attempt was not a new approach to the problem but merely an upgrade to the algorithm used in second attempt. As we have seen in the case of English, a simple frequency map of two-letter combinations is not always enough. A straightforward upgrade would be to account for mutiple-letter combinations. The metric used in this attempt was the probability of a single letter following a string of n letters in a word. I experimented with various values for n. The results for English were amazing (n=5):

centringall, unliner, goutflyings, mackinets, handbaggers, thirstlings, spungency, clatternum, ophiolitise, goutineer, brickmaw, prophetised, firmant, acronyism

The caveat of the bigger-n approach is that many outputted words are actual words or could be by the language’s standard word formation rules, such as compounding. When trying this approach with highly inflected languages, such as Slovene or Latin, there is a chance that very little actually new words will be formed as the program would just permute suffixes.

## Conclusion

This mental exercise is a prime example of how statistics on large datasets can greatly simplify complex problems. Even though the program possessed no actual rules of the language, it was able to generate words reasonably faithful to the language’s feel. The same principle is used in machine translation where statistical algorithms significantly outperform those that do extensive and complex parsing, both in terms of fidelity and practicality.

You can download the source code (C++) and the Win32 executable of the project here.

After quite a long time of hesitation and indecisiveness, I have finally managed to get it together and start a blog. Despite having been active in the IT field for quite a while, I have never set up a place where I could unconstrainedly express myself, i.e. personal webpage, mostly for the reason that I have never really had anything to say to the masses. (well, I’ve been writing a Windows Longhorn beta testing blog for a while, but let’s just agree to never mention it again, shall we? ☺)

Well, things have changed. I have become more proactive in my fields of interest and during my technological meanderings, I have stumbled upon some interesting problems, figured out some interesting solutions and it would be a waste if I kept it all to myself and a small circle of my colleagues. I think that becoming active in the StackOverflow community helped towards this decision the most, because I have developed a habit of regular written expression, of whatever sort it may be. Hence, during my summer internship at Arnes, the idea of this blog was born et voilà, here it is. To the more fundamentalist of you: Yes, it is powered by WordPress and no, I am not ashamed. I find it quite exquisite thus far.

The content on this blog will probably revolve around topics that interest me most. Generally, you should expect some networking (routing software/hardware, IPv6 deployment and developments, …), some programming (C++11, alogrithms), and almost certainly some mathematics and linguistics. Occasionally I will (probably) drop in some non-technology, Real Life™ stuff. Maybe.

I guess that a promise that I will be posting regularly is in place but I consider this a bad omen. I hope it works out ☺.

Cheers, Tibor