Scaling MongoDB on EC2 or should I just switch to DynamoDB?

I currently run my website on a single server with MongoDB. On my server I have two components (1) a crawler that runs hourly and appends data to my MongoDB instance (2) a web-site that reads from the crawler index and also writes to a user personalization DB. I am moving to Amazon EC2 for auto-scaling, so that web-server can auto-scale, so I can increase the number of servers as the web-traffic increases. I don't need auto-scaling for my crawler. This poses a challenge for how I use MongoDB. I'm wondering what my best option is to optimize on

  • Minimal changes to my code (the code is in perl)
  • Ability to seamlessly add/remove web-servers without worry about losing data in the DB
  • Low cost

In the short-term, the DB will certainly be able to fit in memory across all machies since it will be under 2 GB. The user personalization DB can't be rebuilt so its more important to have this, while the index can easily be re-built. The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns. This is built for speed, as I am working on an online dating site (that is searchable in many ways).

I can think of a few options

  1. Use SimpleDB for the user personalization store, and MongoDB for the index. Have the index replicate across all machines, however, I don't know too much about MongoDB replication.
  2. Move everything to SimpleDB
  3. Move everything to DynamoDB

I don't know too much about SimpleDB and/or DynamoDB. Based on articles it seems like DynamoDB would bew a natural choice, but I'm not sure about good perl support, whether I can have all columns, index, etc. Anyone have experience or have any advice?


ANSWERS:


You could host Mongo on a single server on EC2 which each of the boxes in the web farm connect to. You can then easily spin up another web instance that uses the same DB box.

We currently have three Mongo servers as we run a replica set and when we get to the point where we need to scale horizontally with Mongo we'll spin up some new instances and shard the larger collections.


I currently run my website on a single server with MongoDB.

First off, this is a big red flag. When running on production, it is always recommended to run a replica set with at least three full nodes.

Replication provides automatic redundancy and fail-over.

Ability to seamlessly add/remove web-servers without worry about losing data in the DB

MongoDB supports a concept called sharding. Sharding provides a way to scale horizontally by automatically partioning data. The partitioning is done via a shard key.

If you plan to use sharding, please read that link very carefully and recognize the limitations. For MongoDB sharding you have to select the correct key that will allow queries to be evenly distributed across the shards.

The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns.

This is going to be a problem with sharding. Sharding can only scale queries that use the shard key. A query on the shard key can be routed directly to a single machine. A query on a secondary index goes to all machines.

You have 15 different indexes, so basically all of these queries will go to all shards. That will not "auto-scale" very well at all.


Beware that at the moment EC2 does not have 64 bit small instances, making replication potentially expensive. Because MongoDB memory maps files, a 32 bit OS is not advised.


I've had very bad experiences with SimpleDB and think it's fundamentally flawed, so I would avoid it.

Three is a good white paper on how to set up MongoDB on Amazon EC2:

I suspect setting up MongoDB on EC2 is the fastest solution versus rewriting-for/migrating-to DynamoDB.

Best of luck!



 MORE:


 ? Is Amazon SimpleDB more reliable/available than DynamoDB?
 ? Amazon SimpleDB or DynamoDB
 ? Time to live of a item in dynamodb
 ? AWS SimpleDB how to setup
 ? Sorting in Amazon SimpleDB not working
 ? Pricing differences between AWS SimpleDB and DynamoDB
 ? AWS DynamoDB or SimpleDB: "SELECT * FROM posts ORDER BY date LIMIT 10"
 ? AWS DynamoDB or SimpleDB: "SELECT * FROM posts ORDER BY date LIMIT 10"
 ? AWS DynamoDB or SimpleDB: "SELECT * FROM posts ORDER BY date LIMIT 10"
 ? Dynamodb scan in sorted order