Thursday, March 11, 2010

Why Vim is great (Exhibit 1: macros across multiple buffers)

I'll give a little bit of an introduction to my Vim experience in the next paragraph and an introduction to a real life problem that I solved using this Vim "trick" recently in the paragraph after that. You can skip to paragraph four if you don't care about that and just want to see the "trick" (it should be more or less self-standing).

I've been using Vim on a daily basis for about four years now, and I'm really happy that I took the time to learn to use it and stuck it out when it was rough. There used to be a class in the first semester at my college that had a lab in Vi basics, but it wasn't really taught in a meaningful way but rather left the students to "try it out" which resulted in people trying to type something and "getting stuck" in some mode without knowing how to do anything. At the end, everyone (including me) hated it, thought it was an old editor that nobody should ever use. Later, I tried getting into it a few times and finally made it stick. I have by no means mastered it, but I feel pretty comfortable using it and I hope I can show you a few tricks that are not too basic and well known (I'm hoping to make a series of posts out of this).

I recently started maintaining a site (pro bono :) that was abandoned around 2007. The whole thing is mostly HTML only with a few PHP scripts here and there. Unfortunately, it is structured in a way that requires a lot of duplicated effort since most files are referenced in several places. Since the updates are not very frequent (about one a week) and there is a lot of existing content that is important, I decided (for now) not to port it to a CMS and just stick to the old system. I've since developed a few scripts and tools that make the update process pretty simple. The central item of the site are trip reports that include some text and a link to a picture gallery. We now use Picasa to host the pictures. Every report includes a link to the appropriate album (or several) at the bottom. This contains a clickable thumbnail of the album cover and a clickable album name below it. Both these links open the album in the same window (no target attribute). I got complaints that once you start browsing through the gallery, there's no easy way to get back to the main site. I didn't notice this because I have a habit of always using right click->open in new tab on all links. Unfortunately, Picasa doesn't allow you to add a link to your album (from searching around, it seems like a very requested feature). I thought my best option was to simply add target="_blank" to every gallery link so that it would open in a separate window. Now, since I'd uploaded two years worth of content (a lot of reports), there were a lot of links to fix.

So on to Exhibit 1: I have a ton of files that all contain two links to an album hosted on Picasa and I want them to start opening in a new window, i.e. need to add target="_blank" (there are other ways to fix this, but they all require at least some change in the link HTML). I remembered seeing a video of a similar problem a few months ago and how it could be solved with Vim. I downloaded all the files into a directory and opened them all with vim *. Then I recorded a macro to register a (with qa). The goal was to somehow identify the gallery link. Fortunately, this turned out to be really easy since both these links start with <a href="http://picasaweb. So I did a simple substitute command typing

:%s#<a \zs\zehref="http://picasaweb#target="_blank" #g

The only nontrivial part of this command are \zs and \ze which basically define where the substitute text (the target attribute) will be inserted.

And now the "magic" part. After applying this command, I typed :wn to store the changes and move to the next buffer and pressed q to stop recording. At this point, I looked at the number of remaining buffers and applied the macro that many times (50@a). The screen started blinking as Vim was crunching away and after a second or so stopped with the message "Cannot go beyond last file" which is expected due to the :wn command on the last file.

And that was it. The whole thing took me about a minute to do. Sure, I could have written a script that did the same thing but it would have probably taken a few more minutes, and, after all, Vim was made for these sort of things. I tried searching for the video that taught me this (i.e. that mentioned it - once you hear about it, it is very obvious) to give due credit, and managed to find it here (it's by Derek Wyatt). Till next time, happy Vimming :).

Monday, February 22, 2010

Top 25 Most Dangerous Programming Errors and what we're doing about them in education

The other day I stumbled upon this list of 25 most dangerous programming errors. The list was supposedly compiled based on data from 20 different organizations. Of course, this doesn't really tell us anything and by no means implies that the list is "correct", but it is somewhat in line with what is commonly talked about in the computer security literature.

The point of this post is not to discuss these errors in detail, but to see what we can do about them in CS college education. The common denominator of most of these errors is assuming user input is safe and friendly, yet this is something that I feel is not talked about during education nearly enough. At the same time, the two last items on the list (broken/risky crypto algorithm and race conditions) are something we devote a lot of time to. Specifically, a large part of the operating systems course talks about avoiding race conditions and we have a specialized course that deals with cryptography. Additionally, there's a post-graduate computer security course that also fails to address any of the errors in the "user input" category, but also focuses on cryptography and authentication. To make matters worse, the "user input" errors are much easier to understand and deal with than these more "advanced" errors.

There are other aspects to this that I won't go into now, but I'm sure that every person with a CS college degree should know about all of these errors so we need to start covering them all in class (it is too important to be left for self-study).

Wednesday, April 22, 2009

Book Review: M. Garey, D. Johnson: Computers and Intractability: A Guide to the Theory of NP-Completeness

Today I'll review the wonderful Computers and Intractability: A Guide to the Theory of NP-Completeness. I first heard about this book while watching Steven Skiena's online video lectures (which I blogged about recently) three or four years ago, where he recommends it as the best book on the topic (and actually uses it in class). I've since read it several times and used it as reference many more, and before going into the details, I'll just say I think it's awesome.

And now for the details :). This is not your standard "read before sleep" book. It's more like a math book (and math books are read with pen and paper in hand :). When I first tried reading it (that was when I had just a basic grasp on the topic), I found it pretty hard going, and had to restart a few times. This is not to say it's not well written; on the contrary, the writing is great. It just took me a while to get into the right mindset for it. If you're a fan of formalisms and Turing machines, you'll have no problems whatsoever. If you're not, but still want to understand this stuff, you'll have to make more of an effort, but I think it's well worth it. The book doesn't force formalisms down your throat for no reason, but develops the theory from the formal to the informal and more intuitive (but still correct) in a very elegant way.

The book is logically divided into three parts. In the first part, you'll learn why complexity matters and some basic terminology regarding Turing machines, languages, encoding schemes, decision problems and "real problems" (and how they all relate to each other) and the all important polynomial transformation between problems. Then you'll read about what it means for a problem to be in P and NP. After this introduction, the authors present Cook's theorem that is basically the origin of the whole story, in which the famous satisfiability problem is defined and proved to be NP-complete. What follows are proofs (via polynomial transformation) that "6 basic" NP-complete problems are indeed NP-complete. These are 3-satisfiability, 3-dimension matching, vertex cover, clique, hamiltonian circuit and partition. These transformations are a bit more involved then most (since you don't have many problems to choose from when you start building up the NP-complete set), as is the proof of Cook's theorem, but I find them mathematically beautiful (and in general, constructing NP-completeness proofs is a nice mind "game", not to mention useful in many cases :). However, on your first reading, you'll probably want to skim these and just get the intuition behind them and understand "the big picture".

After that, the authors take a step back and try to define three general techniques for constructing NP-completeness proofs (restriction, local replacement and component design) and show a few examples of each. You are then presented with fourteen exercise problems on which to try the techniques out. This ends the first part of the book (about 65 pages).

The second part of the book starts with exploring how the NP-completeness of a problem changes when you constrain it or change it. Then it explains what it means for a problem to be NP-complete in the strong sense and what are Turing reducibility and NP-hardness. The next chapter gives some basic information about dealing with NP-complete problems through approximation algorithms, but what I think is far more interesting (in the context of this book), they show how using the theory of NP-completeness you can prove that some problems can't even be approximated well (for certain definitions of "a good approximation" which I won't go into here) in polynomial time (unless P = NP). The second part ends with expanding the complexity hierarchy with some more advanced concepts like co-NP, gama-reductions and gama-completeness, #P-completeness, PSPACE, DLOGSPACE, EXPTIME etc. (these are all covered with examples and provide solid intuition about what these things are on the first reading, but need to be studied a bit more carefully for full understanding).

Finally, the third part of the book is a catalog of about 300 NP-complete problems (sorted into several categories). This is a great reference! If you have an NP-complete problem on your hands, chances are it has a name and it's on this list :) Given the age of this book, I'd imagine the current list of known NP-complete problems is somewhat bigger, but this one is comprehensive enough that you can probably at least find a problem that is close enough to your problem that you'll be able to construct a proof yourself.

So to summarize, this is an awesome book. NP-complete problems pop up just about everywhere, and it's important to be able to recognize them (and not waste too much time trying to solve them to optimality quickly). My advice for a reading algorithm: read the first part, skimming over the proofs for intuition, then reread it a month later going through the proofs with pen and paper. Then read the second part (it's not as important so you can skim it or read it for full comprehension, depending on how much you want to get out of it) and skim the catalog to get a feel for which problems are there (so you can use it as a reference). You'll recognize quite a few of them, probably some you've met at work :). Also, even if you're not in Computer Science, you might like this book from an epistemological perspective ;)

Sunday, April 19, 2009

Book Review: Andrew Tanenbaum: Computer Networks (4th edition)

Today I'm going to review the classic Computer Networks (4th Edition) by Andrew Tanenbaum.

I first read a significant portion of this book a few years ago when I took a class in computer networks at the University (it's a 4th year graduate course, considered by most to be one of the hardest in the program) that was mostly based on it. Since I'm a big fan of reading books cover to cover (which is BTW the ideal way to read this particular book), I took some time the last few weeks and did just that. Having read most of it before, and dealing with networks quite a lot in my work and spare time, I read it mostly for review value.

As the title suggests, this is a very general book, and as such holds real value. After all, in today's world, who can ignore networking? If anything, networks will become more and more important, and understanding them is already pretty much mandatory for anyone in ICT. This book will give you a great framework for dealing with all sorts of networks you might come across in the wild. For the most part, it explains how networks are built in layers, with each layer offering services to the layer above it and consuming the services of the layer below it. Two major reference models are covered: the OSI/ISO one and the TCP/IP one. You will get a solid understanding of just how much politics influences standard's bodies and better understand why some things are not as good as they could be :).

For the most part, the book then goes through the layers (of something that I've heard called the "Tanenbaum hybrid model" which is basically what is used in practice (in the Internet for example). You will learn some basic things about the physical layer like the kinds of cables in use and their characteristics, fiber, radio, microwave and satellite transmission..., encoding bits with volts. After the physical layer, there's a high level explanation of how the telephone system works. Then he goes on a detailed tour of the data link layer (which is basically a direct "wire-like" connection between two machines), talking about error correction codes (Hamming code, CRC), framing and flow control. Here he also discusses Bluetooth, Ethernet (in some detail), WLAN etc. In covering the network layer, the main focus is on routing, congestion control and internetworking. There's also a relatively detailed discussion of the network layer in the Internet where IPv4 and IPv6 are discussed (among other things). Then he goes on to cover the transport layer which mainly deals with establishing connections, retransmissions and congestion control. Here he talks a bit about UDP and TCP, which are the "main" transport layer protocols in the Internet. Finally, there's a chapter about the application layer that features pretty much anything you've heard about, from DNS and e-mail to the World Wide Web and its protocols, to multimedia.

The book finishes with about 200 pages on network security, which covers your basic cryptography and different protocols in the different layers that implement various measures. There is also a great section on further reading per topic.

One final note. Although this is the 4th edition of the book, the book was released in 2002. which is (in computer terms) ancient. To put things in perspective, Wikipedia was started in 2001., YouTube in 2005., social networks didn't really exists in 2002. etc. Still, the things covered in this book have not changed that much so this should not worry you too much.

So to summarize, this is an awesome book that I think everyone in ICT should read. It's OK as a reference, although you'll probably want to consult specialized literature for specific topics (like the relevant RFCs for Internet related stuff). It's pretty expensive, so you might want to get a used copy, but it's worth owning!

Friday, April 17, 2009

Using vector and deque "the right way"

If you're using the STL, chances are you're using vector and deque a lot. They are two of the three sequences defined in section 21.1.1 of the C++ standard (the third one being list which I won't talk about in this post).

The nice thing about dealing with these sequences is that they manage their memory automatically when you're using push_back and pop_back and you don't really have to think about it. Naturally, I did a search to see if this has been beaten dead, and sure enough, I found a great article about it by Andrew Koenig and Barbara Moo of Accelerated C++: Practical Programming by Example (C++ In-Depth Series) fame (also see Koenig lookup).

Instead of repeating everything that's in there, I'll just recap the main point. The C++ standard (in section 23.2.4.1) guarantees amortized constant time insert and erase at the end for vector. This basically means that you can grow a vector from empty to size O(n) in O(n) time using push_back, although some individual push_back calls might take linear time (the ones that cause reallocation, which consists of allocating a new block of memory, copying the old objects, destroying them and deallocating the old memory block). One subtle thing to note here is that reallocation (which should mostly be transparent and not something you worry about) invalidates pointers and iterators into the sequence.

The vector class offers some facilities that let you take a more active role in this memory management process. These first function is void reserve(size_type n) which makes the vector allocate enough memory for at least n items, effectively making all push_back calls up to n items take constant time. The other function is size_type capacity() const that lets you know the currently allocated memory for the vector.

The standard doesn't define how this is to be implemented, but by reading around and looking and source code, it seems most implementations grow the vector exponentially, with factor between 1.5 and 2 (although there might be others). Fortunately, even if you can't access the source of your implementation, you can easily test it out with a program like this:

#include <cstdio>
#include <vector>

int main() {
std::vector<int> v;

std::vector<int>::size_type old_capacity = v.capacity();
for (int i=0; i<1000; ++i) {
v.push_back(i);
const std::vector<int>::size_type new_capacity = v.capacity();
if (old_capacity != new_capacity) {
std::printf("%4u %4u\n", (unsigned)v.size(), (unsigned)v.capacity());
old_capacity = new_capacity;
}
}
std::puts("growth done");
for (int i=0; i<1000; ++i) {
v.pop_back();
const std::vector<int>::size_type new_capacity = v.capacity();
if (old_capacity != new_capacity) {
std::printf("%4u %4u\n", (unsigned)v.size(), (unsigned)v.capacity());
old_capacity = new_capacity;
}
}
std::puts("shrink done");

return 0;
}


On my machine, this outputs

1 1
2 2
3 4
5 8
9 16
17 32
33 64
65 128
129 256
257 512
513 1024
growth done
shrink done


As you can see, the factor here is obviously two. Two other points are of interest. First, the vector starts out with no allocated memory. Secondly, the memory is not released when the size shrinks. This means that the memory will be allocated until the object is destroyed, unless you explicitly call clear or resize. Also, if you use reserve with some amount just before the loop, you'll basically seed the sequence with something other then one. The point here is that you get a lot of small reallocations at the start, which does seem like a waste.

The standard actually recommends using vector "by default" (section 23.1.1.2). Browsing around Herb Sutter's Guru of the Week I stumbled upon an entry where he recommends using deque as the default. So, for a bit of background, a deque is a sequence that supports constant time insert and erase operations on both the front and the end (via push/pop front/back). It also supports random access iterators and random access via operator[] and at(), but random access is not constant time since a deque is basically a linked list of fixed size memory chunks. This works quite well for memory management as the memory chunks are small and no reallocations are required when the sequence needs to grow. Still, the point Herb is making is kind of defeated by the timing data he includes in the post since vector outperforms deque in pretty much everything (I'd say, as expected). Still, deque is a great sequence and is actually used by both stack and queue by default (for good reason) and is a great structure for implementing 0-1 BFS.

Going back to vector, I'm wondering how people use reserve in "real life". Often, I'll have a pretty good idea of how big a vector will end up after a loop, and using reserve in those cases seems like the right thing to do.

std::vector<int> v;
v.reserve(n);
for (int i=0; i<n; ++i) {
v.push_back(i);
}

Should be better then

std::vector<int> v(n);
for (int i=0; i<n; ++i) {
v[i] = i;
}

because the latter actually initializes n ints to zero in the constructor (and the benefit would increase when the items in the vector are objects with more state and possibly elaborate default constructors).

Now the strange bit. Even though I've been programming in C++ for about five years now and I feel I know it really well and consider it my "main language", I've never used it on a multi-person project. In fact, I've never done a project over a few thousand lines in it, although doing a quick grep on the code I have on my home machine shows about 95k lines of C++ code I wrote. Due to strange circumstance, I've only had a chance to work on smallish sized projects (~5k lines) in C#, JavaScript, Python, PHP and Java, and on a medium (~30k lines) project in Java. So, I'm asking you, C++ developers, what are your experiences with using vector and deque in real projects? Is reserve used and how do you estimate the right size to use? Thanks!