Monday, November 28, 2011

Windows Azure, part 2

When I first checked out Windows Azure, I was glad to find it has root containers. Unfortunately, they are almost unusable for us, since they do not allow subdirectories (see the docs for that). I learned that the semi-hard way, stumbling upon the very helpful StorageClientException: "The requested URI does not represent any resource on the server." So, although root containers technically exist, with this limitation they are of no use.

Also, while the retry policy and timeout handling in Windows Azure .NET SDK is fine, exception handling is not. While getting StorageClientException and StorageServerException is expected, the WebException is not expected at all (I thought that one should be wrapped in StorageServerException).

Other than that, though, Windows Azure .NET SDK is pretty straightforward and easy to use.

Wednesday, November 23, 2011

Windows Azure first experience

Windows Azure looks to be a fine platform, but the toolset installation could be more streamlined. First, Azure Tools for Visual Studio complained about "Error 0x80070643", which was resolved by installing Azure SDK, Libraries and Emulator before. Then, emulator told me that I do not have SQL Server 2008 installed by popping up with a helpful message that there is a "possible security problem, see here", which leaded me to a page that had nothing to do with real cause of the problem.
After that, though, everything went smoothly. Since we're mostly interested in storage side of cloud services, I'll explore Windows Azure storage and probably will write a post or two about it.

Tuesday, November 22, 2011

Basic SSD tuning for MongoDB

We explored several options when using MongoDB on an SSD and came to following conclusions:
  1. Don't turn off MongoDB journaling unless you really 100% sure -- the degree of comfort it provides you after your server restarts non-gracefully, is enough to warrant it's usage even within a replica set.
  2. Use ext4 file system, mounted with "noatime,data=writeback,nobarrier" (nobarrier didn't give measurable differences on our workload, but others say it's still a good thing). ext4 is fast when allocating files (see 3. below), along with allowing you to delay file metadata updates (reliability is already covered by MongoDB journals).
  3. Enable MongoDB options smallfiles and noprealloc (unless you're writing an application that is very heavy on inserts that is going to push the SSD to it's limits). SSDs still cost a lot of money and if you're installing 120GB or 160GB ones as we do, you don't want five empty databases occupy a gigabyte of that precious space (we run with directoryperdb=true also, it's handy for management). With smallfiles=true, noprealloc=true works just fine -- 512MiB files that get created by MongoDB are allocated in abour 300-600ms even under load, thus saving you even more space.

Improving robustness for C# MongoDB clients

I wonder if I should publish a set of tools and patches that make easier to write close-to-zero-downtime-without-users-noticing-that-half-servers-are-gone applications. Guess I'll put in a bit of effort to make it better suited for public release, like translating all the documentation comments from Russian to English :)
The basic idea for the tools is to provide a side-attached layer that gracefully handles failure and retries the operation if the tool decides that it still might succeed. While the idea is easy, it really works for something as crude as pulling the plug for half of the servers with users noticing only a slight (several seconds max) delay with their web pages load times for a few seconds.

Replication is never a proper replacement for backups

Today I almost had that special moment that makes you glad you did backups -- was going to drop the database and noticed that I'm on the wrong server less than a second before my finger finally reached that "Enter" key =) I wonder if MongoDB should have some built-in measures against this, maybe some kind of database/collection setup versioning that prevents actual data loss. Then again, I think that's too much for a general-purpose database, but still, having some kind of schema versioning with ability to do a rollback would be a nice feature for something that is already complex and proven to be robust enough for critical applications, like PostgreSQL. Still, the (learned before) lesson for today was that you should never count on replication alone for protection, software failures (both client- and server-side) still may destroy your data.

Friday, November 18, 2011

MongoDB connection affinity

When using MongoDB via C# driver (might apply to other drivers as well), if you're queuing modifications (i.e. using SafeMode.False) and expect to see the first modification done before the second (I was Remove-ing one item and Insert-ing another one with the same key), never forget to use the same connection (via MongoDatabase.RequestStart), unless would like to get unpleasant surprises, like unexplainable intermittent failures that only occur when your application gets enough load. While after having a close look it is obvious that write ordering is not preserved by default, it wasn't obvious enough when I wrote the initial code for our message queue.

Thursday, November 17, 2011

Homegrown replication

Recently, I was working on improving our homegrown file replication system that we use to allow redundant image storage. Currently it serves tens of millions of files that occupy about 4TiB of storage, about 8-10 gigabytes are added per day. In reality, it's not a "replication system" by itself, it's just a system that delays writes to unavailable targets until they become available. It turned out to be very efficient and resilient even to the "we just lost two disks" cases, without any centralized authority.
We considered both custom filesystems for both Linux and Windows, but all of them required some kind of a central management server or servers, which we rather not have. Also, we considered storing files in MongoDB GridFS, but a simple session with calculator told us that replacing a node (taking down, adding another one, syncing before oplog gets exhausted) for such volumes and large items would be prohibitive, copying a virtual disk image is much simpler and faster than doing it though the database layer, so while the idea of specialized file storage in database is very appealing, MongoDB GridFS deployment for terabyte-scale files with intensive write load requires considerable amount of preplanning that defeats the main (well, for me) feature of MongoDB -- simplicity.

Horizontal scaling

Today we finished conversion of a medium-load (~1.2k requests for dynamic content per second) application (frontend, C#+ASP.NET Web Forms+ASP.NET Web Pages) to work on two separate machines. While moving databases around and adding memcached memory was relatively easy, splitting the application that contained state (lots of internal caches) proved a bit hard, even we already have a plan for proper user affinity (cookies+ip hashing via haproxy). PHP users, for example, do not have the luxury of large-scale persistent internal state, so they do not plan for it and use external devices like memcached, message queues, etc. Our application was written in ASP.NET and used about 70 internal caches when I started working on it. Well, now it perfectly runs on two machines behind NGINX+haproxy, and where's two, there's three and more =)

Wednesday, November 16, 2011

NGINX and keepalives to backend

I wonder when NGINX will have keepalive connections to backend? Patch by Maxim Dounin was floating around for about half a year, then it made into beta, but it's still in beta now, while our media frontend server is busy creating/destroying several thousands of connections to backends instead of reusing existing ones, there are only about 20 connections needed to serve ~3.5kreq/s (current load). Well, maybe G-WAN will be better, although AFAIK it doesn't come with a prebuilt proxy caching.

Thursday, November 10, 2011

MongoDB as an MQ persistence solution

After reviewing several message queues (ZeroMQ and RabbitMQ looked fine, but ZeroMQ has no durability and is in fact more a protocol than an MQ, and RabbitMQ has that "solutionness" all around it), we decided to roll our own simple MQ, using MongoDB (multiple publishers/multiple subscribers with implicit configuration). After I completed the implementation, a reference to a nice article about MQ in MongoDB popped up on mongodb-user or mongodb-masters list, so I got assured that the idea itself was O.K. Too bad I found another article on the same topic late enough -- while I had the idea of using capped collections (and rejected it because write performance isn't that important for our MQ, while persistence is), using tailable cursors just didn't even appear in my mind, although I knew of the feature's existence. Many thanks to the author =)

Monday, November 7, 2011

Perfect world

First record. I'm going to record only technical stuff here, no personal things.