Monday, April 2, 2012

GCC lossy conversion warnings for C and C++

Have you ever been bitten by the bug where you, for example, return a long long from a function, but set the return type of the function to int (by mistake)? I sure have, and off the top of my head, I can remember this bug costing me a failed solution at least twice at TopCoder. Just to make this more concrete, here's a simple example function that has this problem.

int foo() {
  long long x = 1LL<<42;
  return x;
}

This kind of conversion is actually perfectly legal (for example, by section 4.7 of C++11 n3337), and the result is implementation-defined if the target type (here int) is signed and can't represent the value of the source type (here long long). GCC defines the result to be the value modulo 2^n such that it fits into the type, but what they actually mean is that the lower however-many bits are kept and the rest is discarded, which makes sense in two's complement (which is what GCC and the large majority of "conventional" systems use). So if int is 32-bit, and long long is 64-bit, you get the lower 32-bits of the original value. In this specific case, the result would therefore be zero.

Now, you'd really like to be warned about this kind of thing (at least I would :), but you don't get that with neither -Wextra (a.k.a. -W) nor -Wall. It turns out you have to explicitly ask for it using -Wconversion. The argument they make for code that intentionally does these kinds of conversions (without casts) is definitely valid, but I will be adding this flag to my default set for sure, at least for TopCoder :).

Thursday, March 22, 2012

Fun with bugs

I encountered a fairly interesting bug a few days ago. I will not post the minimal code snippet that reproduces the bug, but rather something similar to the actual code that I had the bug in so that you might try to find the bug for fun (preferably just by looking at the code :). Note however, that the bug I'm going to write about is not really a logic error, but more of a "language issue" (I don't want to be too specific yet because a hint will quickly ruin the fun :).

First, a bit of introduction about what the code does, or is supposed to do. I was implementing an interval tree that would efficiently find the maximal value in an interval of keys. When creating an empty tree (with no values yet inserted), you'd want to initialize all the values to some very small value that I'll call -INF. Basically, the intent of initialization is the set up the M vector that represents the intervals that the tree will be storing directly. So here is the initialization code.

class IntervalTree {
  public:
  IntervalTree(int maxval): n(maxval), M(2*maxval+1) {
    init(0, 0, n);
  }

  private:
  int n;
  std::vector M;
  
  int init(int node, int l, int r) {
    while (node >= (int)M.size()) {
      M.push_back(0);
    }
    if (l == r) {
      return (M[node] = -INF);
    } else {
      int mid = l + (r-l)/2;
      return (M[node] = max(init(2*node + 1, l, mid),
                            init(2*node + 2, mid+1, r)));
    }
  }
};

The constructor is pretty minimal. It stores the rightmost interval boundary (the leftmost one is implicitly zero) and tries to initialize the vector that holds the tree nodes. Now, unfortunately, the maximum index of a node in the tree might be bigger than 2n if the tree is not complete, i.e. the number of keys is not a power of two. An example of this case is n=5 when the maximum index of a node is 12 (the node representing the interval [4,4]).

One way to fix this problem would be to use the first larger power of two, but that can potentially waste space. So what I did here was in the recursive init phase add on additional nodes as necessary. The remaining code is fairly straightforward: if the interval contains only one key then initialize that leaf node (here you might use A[l] if you were initializing the tree with an initial array of values); otherwise, recursively initialize the left and the right halved subintervals and assign this node the maximum of those. Again, in this case it would be OK to just assign M[node] to be -1, but if we were using an initial value array, this is pretty much how you do it.

So, where's the bug? :) This code actually has undefined behavior and is fairly likely to fail spectacularly (with a segfault). If you don't want to try to find the bug, then read on for the answer. Before that, here's a hint - the bug is related to std::vector. Can you spot it now? :)

Here's the explanation. The bug is actually in the line that recursively calls init and assigns the max of the two calls to the tree node. As you might have guessed by now, there are two issues in interplay here. First of all, when you push_back to a std::vector, you will get a reallocation if the previous size of the vector equaled the capacity (i.e., all the preallocated memory was exhausted). When that happens, all pointers and references to the elements of the vector are invalidated, for obvious reasons (meaning you must not use them). This is why you should always be wary of taking pointers (or references) to vector elements, i.e. carefully consider if it is logically possible for the vector to grow which might cause it to get reallocated.

However, one might wonder, where are the pointers or references in this code? The answer is, fairly obviously, vector's operator[]. The thing is, the compiler is free to evaluate M[node] before, between or after both init calls are made, i.e. it is free to evaluate the left side of the assignment before the right. Since init uses push back on the same vector it's assigning to, it might happen that the reference returned by operator[] in M[node] gets invalidated and you assign to it, breaking the program.

The manifestation of this bug was actually quite entertaining. I got a segfault for many executions, but sometimes I got a std::bad_alloc (for minute trees of a dozen or so nodes :). Quite spectacularly, it actually caused my AV to flag the executable as a trojan on one run (though that is, I suspect, not really closely related to this particular bug).

It is fairly easy to fix this problem in this case... just use a temporary to store the max of the recursive calls and then assign it to the appropriate location in the vector. Another solution would be to determine how many nodes there will be in the constructor and avoid having to extend the vector. Both of these solutions are either uglier or more complex than this one, but they are correct while this one is not :). However, I think this bug is generally fairly subtle and I wouldn't be surprised if it "leaked out" (in the abstraction sense) of some libraries that use std::vector internally.

Thursday, March 15, 2012

The order of perfect K-shuffles in a deck of N cards

The problem I'll write about in this post is the following: Given a deck of N cards (with N potentially as big as say 10^12) and an integer K that is a divisor of N, find the minimum positive number of perfect K-shuffles that takes the deck back to its original order. I'll define what I actually mean by a perfect K-shuffle in a second, but I'd just like to stress that this problem is very easy if N is small and K is two and you can find the answer for "normal" decks of cards (say 52 card decks) by googling it.

Two kinds of "perfect K-shuffles" are mentioned in the literature: the in-shuffle and the out-shuffle. This post will deal with the in-shuffle, but the same approach can be used to solve the out-shuffle version. So, in a perfect K-in-shuffle, you take the deck of N cards and split it "perfectly" into K piles with N/K cards each. The first pile contains the topmost N/K cards of the deck, the second pile the next N/K topmost cards and so on.

Here's an example with K=3, N=15 with the cards numbered from 1 to 15 and originally in sorted order from the top of the deck downward. After splitting the deck as described, you'd get the following configuration:

 1  6 11
 2  7 12
 3  8 13
 4  9 14
 5 10 15

To complete the K-in-shuffle, you start compiling the deck from top to bottom, taking one card from each pile starting from the rightmost pile and moving leftwards. That is, card 11 would end up on the top of the deck, followed by 6 and 1, then followed by 12, 7, and 2 etc. What you get, from top to bottom, is:

11 6 1 12 7 2 13 8 3 14 9 4 15 10 5
By the way, the out-shuffle differs only in that we start taking the cards from the leftmost pile, therefore preserving the position of the "outer cards" (the first and the last card always stay in their original position).

This shuffle (as any other shuffle) is a permutation of N elements. In fact, the numbers written above are the inverse permutation defined by the perfect 3-in-shuffle of a deck of 15 cards. If we invert it, we will get the following permutation:

3 6 9 12 15 2 5 8 11 14 1 4 7 10 13

The problem is basically asking us to find out the order of this permutation, i.e. how many times it has to be applied to get back to the sorted sequence. When a permutation is fairly small, it is easy to find its order by examining its cycle structure - the order is just the LCM of all the cycle lengths. This will in general take O(N) time since we basically want to walk through all the cycles. However, with N large, this is certainly way too slow.

So the idea is to analyze the permutation and try to come up with something better. It turns out (which you might guess by playing around with small examples) that this permutation takes the card from position c to position Kc mod N+1. It is fairly easy to prove this. Consider the general case. Let Q = N/K. Then the general case after the first part of the shuffle looks like:

  1 Q+1 ... (K-1)Q+1
  2 Q+2 ... (K-1)Q+2
  .  .   .      .
  .  .   .      .
  .  .   .      .
  Q Q+Q  .  (K-1)Q+Q = N

So the card in the ith row and the jth column (1-based) is c=(j-1)Q + i. When completing the second part of the shuffle, all the (i-1)K cards in the preceding rows will be before c, as well as the K-j+1 cards in the ith row that are to the right of c, and are therefore collected before c. That means that any card c=(j-1)Q + i moves to position (i-1)K + K-j+1 = iK - j + 1 (*).

Now consider the number cK = (j-1)QK + iK. We find that iK = cK - (j-1)QK. Plugging that into (*), we get that the position of card c after one shuffle is cK - (j-1)QK - j + 1 = cK - (j-1)QK - (j-1) = cK - (j-1)(QK + 1) = cK (mod QK+1). This completes the proof since QK = N.

One useful sanity check to make here is that since we're numbering the cards from 1 to N (inclusive), we better not get the result that a card moves to position zero. To see that this is indeed impossible, consider the GCD of K and the modulus N+1. Since K is a divisor of N, it follows that N+1 is not divisible by any prime factor of K. This is because every second number is divisible by 2, every third number by 3, every fifth number by 5 etc. Therefore, if K has a prime factor p, that is also a prime factor of N and therefore can't be a factor of N+1. It follows that K is relatively prime to the modulus N+1. Using this fact, the only way that cK would be divisible by N+1 (and therefore we would get a zero as the position of the card) is if c would be a multiple of N+1, which is impossible since c < N+1.

Now consider the first card. After one shuffle, it will be in position K, after two shuffles 2*K mod N+1 etc. Eventually it will get back to the top of the deck. The lowest positive number of shuffles that achieves this will be a number e such that K^e = 1 (mod N+1). Since K is relatively prime with N+1, it is a member of the multiplicative group modulo N+1, and e is exactly the order of K in this group. In fact, this number will be the sough order of the shuffle. That is fairly obvious - any card c will after e shuffles end up at position cK^e = c (mod N+1). Note that some cards might cycle faster, but then they will have to cycle several times before the topmost card cycles.

I wrote about efficiently computing the order of an element in a multiplicative group in a previous post, and we can do that in O(sqrt(N)) which will be sufficient for N up to around 10^14 in under one second.

Tuesday, February 7, 2012

Efficiently computing the order of an element in a multiplicative group

A few weeks ago, I stumbled upon an algorithmic problem that I was able to reduce to finding the order of a specific member of the multiplicative group modulo N. In this post I'll talk about three different ways of doing that, from the naive way, the fairly obvious one and finally to the less obvious way which is also the fastest of the tree. I'm writing about this even though I'm not a mathematician because I've been unable to find the proof for the fastest algorithm by searching online, even though it's quite elementary. I'll write about the original problem itself in a separate post because it is interesting in its own right. You can find the code that I used to benchmark these algorithms here, and read the whole story in the rest of this (long) post, if you're interested.

So, the problem we're addressing is the following: given numbers K and N, compute the order of K in the multiplicative group modulo N, i.e. the smallest positive integer e such that K^e = 1 (mod N). This is supposed to work quickly, say in few hundred milliseconds, for N as big as one billion, i.e. 10^9. It's often instructive to think about these boundaries when looking for an algorithm because we want the simplest algorithm that will be fast enough. With N in the billions, we're usually looking for something that's O(sqrt(N)), possibly with a logarithmic factor. This restirction immediately eliminates the naive algorithm, which would just repeatedly multiply by K in the group until the order was found, and is therefore O(N).

To start off, it's instructive to think about some meaningful restrictions on K. Obviously, K must itself be a member of this group to have an order in it, and that means it has to be relatively prime with N (and obviously smaller than N as well). Furthermore, all the members of this group (call it G_N) have this property and there are phi(N) of them, where phi is Euler's totient function. Not to stray too much off topic, I'll just say that phi(N) can easily be computed in O(sqrt(N)) time by decomposing N into prime factors, so from now on we can assume that we know n = phi(N), the cardinality of our group G_N (and we're still on target in terms of complexity). Here is some code that computes n.

int mod_group_order(int N) {
    int d = 2;
    int n = 1; // this will be the order of the group
    while (d <= N/d) {
        if (N%d == 0) { // d is a prime p
            int num = 1; // accumulate p^k
            do {
                N /= d;
                num *= d;
            } while (N%d == 0);
            n *= (num - num/d); // totient(p^k) = p^k-p^(k-1)
        }
        ++d;
    }
    if (N > 1) { // this is a prime, totient(p) = p - 1
        n *= N - 1;
    }
    return n;
}

Having n, we can now use Lagrange's theorem to get a much faster algorithm for computing the order of an element than the naive algorithm mentioned at the beginning. Lagrange's theorem states that the cardinality of any subgroup of a finite group (such as our G_N) must divide the cardinality of the group. Now, the order of an element K is actually the cardinality of a cyclic group generated by K which is obviously a subgroup of G_N. Therfore, the order of K must be a divisor of n. Since n = phi(N) is O(N) and we can find all the divisors of n in O(sqrt(n)) by trial division, we have an algorithm that might hit our target. Also note that any multiple of the order of an element also takes that element to one, i.e. K^n will always be 1.

int order_divisors(int N, int K, int n) {
    int up_order = n; // this works for sure
    for (int d=2; d<=n/d; ++d) {
        if (n%d == 0) {
            if (fastexp_mod(K, d, N) == 1) {
                return d; // found it!
            }
            if (d*d<n && fastexp_mod(K, n/d, N)==1) {
                // can't return here
                up_order = n/d;
            }
        }
    }
    return up_order;
}

The order of K can either be less than or equal to sqrt(n), or it can be greater. In the former case, we can just return the smallest exponent that takes K to one (line 6), but in the latter we can't yet be sure that there are no smaller exponents that will do the same thing so we must go on (line 9). The fastexp_mod function raises its first argument to the second argument power modulo the third argument in logarithmic time. Therefore, the complexity of this algorithm is O(sqrt(n)*log(n)). More precisely, this algorithm is O(sqrt(n) + d(n)*log(n)), where d(n) is the number of divisors of n. Obviously, when the order of an element is very small, this algorithm will be very fast but when it is larger than sqrt(n) it will actually be Theta(sqrt(n) + d(n)*log(n)).

The third algorithm to consider is based on the prime decomposition of n, i.e. the order of the group. Assume that n = p_0^q_0 * p_1^q_1 * ... * p_m^q_m is the prime decomposition of n. Then the order of K is the product of p_i^(q_i - f(K, p_i, n)) for all i, where f(K, p_i, n) is the largest exponent e_i no greater than q_i such that K^(n/(p_i^e_i)) = 1 (mod N). We will prove this is true in the remainder of this post and then see how this algorithm stacks up against the divisor-based one in terms of efficiency. Before that, here is the code that implements this algorithm (see the whole program for comments).

int f(int p, int lim, int num, int n, int K, int N) { 
    while (lim > 0) {
        if (fastexp_mod(K, n/num, N) == 1) {
            return lim;
        }
        --lim;
        num /= p;
    }
    return 0;
}

int order_primes_iter(int N, int K, int n) {
    int sol = 1;
    int t = n;
    int d = 2;
    while (d <= t/d) {
        if (t%d == 0) {
            int cnt = 0;
            int num = 1;
            do {
                num *= d;
                t /= d;
                ++cnt;
            } while (t%d == 0);
            sol *= fastexp(d, cnt-f(d, cnt, num, n, K, N));
        }
        ++d;
    }
    if (t > 1) {
        sol *= fastexp(t, 1 - f(t, 1, t, n, K, N));
    }
    return sol;
}

So, after having introduced e_i as the value returned by the function f, we can rewrite the formula for the order of K as the product of p_i^(q_i - e_i) for all i, and this is actually n divided by the product of p_i^e_i. We need to prove two things: that this number actually takes K to one in the group and that there is no smaller number that does so.

Theorem: If K^(n/a) = 1 (mod N) and K^(n/b) = 1 (mod N) and a and b are relatively prime, then K^(n/ab) = 1 (mod N).

This theorem would allow us to compose what we know about the individual e_i numbers into the formula we're trying to prove. Let u = K^(n/ab). Then both u^a = 1 (mod N) and u^b = 1 (mod N) by assumption. This can only be true if u = 1 or if a and b share a common factor larger than one, i.e. are both multiples of the order of u. By the assumption that a and b are relatively prime, it must be that u = 1 which completes the proof.

We know by the definition of e_i that n/(p_i^e_i) takes K to one for all i. Using the above theorem, we can compose all the p_i^e_i factors and easily prove by induction that the suggested formula indeed takes K to one. It remains to be proven that no smaller number does so. Assume that there is such a number, and call it L. We know that L must be a divisor of our n/(\prod p_i^e_i), which means that at least one of the e_i numbers in it is larger. Choose any one of those, say g_i > e_i. Since every multiple of L also takes K to one, consider n/(p_i^g_i). This is obviously a multiple of L and therefore K^(n/(p_i^g_i)) = 1 (mod N). However, we know by definition of e_i that it must hold that g_i <= e_i and therefore we conclude that n/(\prod p_i^e_i) is indeed the smallest number that takes K to one, i.e. the order of K.

So what is the complexity of this algorithm? It is easy to come up with a loose upper bound but it would be fairly difficult to compare this with the complexity of the divisor-based algorithm since the time required to execute either algorithms depends heavily on the properties of N and the order we're looking for. Instead, I wrote a program to measure the running time cumulatively for the 200k values of N above 99800000 and random K. This algorithm outperformed the divisor-based one by about a factor of 19 on my computer (it ran for about 1s versus about 19s for the divisor-based algorithm). I also tried some optimizations in the implementation of the algorithm, namely special treatment of two (which allows the main loop to skip even numbers) and precomputing a list of primes and both of theese yield about a 10% improvement. Check out the code if you're interested.

I'd be very interested to know if someone can find a nice way to asymptotically quantify this difference between the algorithms. The runtime of the faster algorithm depends on both the number of prime factors in a number and the exponents of those factors. Therefore, I don't expect it is possible to relate that back to n in a tight bound, but there might be meaningful to compare the amortized cost of these algorithms over some distribution of N and K.

Thanks if you read this far, and sorry about not using LaTeX :).

Saturday, February 4, 2012

Why Vim is great (Exhibit 2: Vim is scriptable)

One great feature of Vim is that it is scriptable, which allows pretty much infinite customizability. I already blogged about a plugin I wrote for speeding up C++ coding, but there are many other far more useful plugins out there. The point is, if you're a programmer and you want to speed up something you're doing often, you can easily do it yourself if an appropriate plugin doesn't already exist or you can't find it. My example above is obviously covered by some excellent plugins, but they are much more heavyweight than what I wanted to do.

Another example where an existing script (probably) doesn't exist and that I encountered recently was reformatting some context free grammars encoded in LaTeX. I was helping organize all the exam problems from the last several years of our compiler design course into a problem book to give out to students for practice. There were probably more than 50 grammars in these problems, written in several different styles of marking nonterminals, listing rules etc. For example, these were three of the styles we had originally:

S \(\to\) aAbBa | bBaAb
A \(\to\) AaBb | Ba | a
B \(\to\) Ab | b

S \(\to\) A a A b
A \(\to\) A b
A \(\to\) d B
A \(\to\) f
B \(\to\) f

S \(\to\) AB | dC     A \(\to\) aCB | \(\varepsilon\) 
B \(\to\) eS | bC     C \(\to\) c | aB

We obviously wanted to have a uniform style for all these grammars, so I wrote a short script that reads in a grammar in any of the original styles and reformats it. Obviously, you could write a program in any language to do that, but since these grammars were part of a large document, you'd have to delimit them somehow for the external program or copy/paste them around. If you think about it, the problem was an exact fit for an editor because it is part of editing a document. Also, it's not really possible to do this with a macro that doesn't itself contain some code. In any case, writing the script took probably less than 20 minutes; far less than it would have taken me to manually convert all the grammars, not to mention far less error prone.

So how do you write these scripts? Vim has a buit-in scripting language (commonly referred to just as Vimscript) which is a dynamic language that feels sort of like Python, but has somewhat clunky syntax. I'm not going to talk about the language very much here because it's covered well elsewhere, for example in a series of IBM DeveloperWorks articles, and in Vim's help by typing :help script.

The other thing to note is that you can actually script Vim using several other (better) languages, namely Python, Ruby, Perl, Scheme and Tcl. You can find great info on these in Vim's help under python, ruby, perl, mzscheme and tcl, respectively. I've only tried the Python interface myself and found it more enjoyable than using Vimscript. However, there are some unfortunate caveats. First, you'll need Vim compiled with the appropriate flag set to enable each interface (for example +python or +python3). This in itself motivates people to write plugins in Vimscript since it is the only language guaranteed to be supported (I read somewhere that Python was supposedly included in all the binaries after Vim 7.0, but I haven't been able to verify that right now). Furthermore, debugging these external scripts is a lot harder as the error messages you get aren't very helpful.

In any case, scriptability is a great feature that opens up a lot of opportunities for increased productivity, and that's what we're all after :). Happy Vimming!