Sunday, January 24, 2010

Sharding

What is sharding...?

While working for an auction website, somebody got the idea to solve the site’s scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are calledshards.

The advantages are:

  • High availability. If one box goes down the others still operate.
  • Faster queries. Smaller amounts of data in each user group mean faster querying.
  • More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
  • You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.

How is Sharding different from traditional architectures...?

Sharding is different than traditional database architecture in several important ways; following are the key factors -

Data is denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.

This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.

Data is across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel.

Data is small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache.

Smaller sets of data are also easier to backup, restore, and manage.

Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.

It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.

Obviously the master becomes the write bottleneck and a single point of failure; and as the load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.

Sharding cleanly and elegantly solves the problems with replication.

The most recommended approach to implement database shards is using the Hibernate Shards framework. The said framework offers critical data clustering and support for horizontal partitioning along with standard Hibernate services. Which enable the businesses to keep data in more than one relational database without any add-on complexity whilst building the applications.

Other than Hibernate; shards can also be implemented with any of the following toolkits –

Well, thats all for the starters folks. Hope this was an useful read and has provided enough thoughts for your brains to work quite sometime now...

Tuesday, January 5, 2010

This is a guest post from Larry Dignan of TechRepublic’s sister site ZDNet. You can follow Larry on his ZDNet blog Between the Lines, or subscribe to the RSS feed.

Google has been awarded a U.S. patent for its floating data centers that are powered by waves and cooled by sea water.

The patent award was spotted first by SEO by the Sea. As noted previously, the floating data center idea is quite novel and makes a ton of sense. For Google these floating data centers could be a boon because there are no real estate costs or property taxes.

The offshore data centers would site 3 to 7 miles offshore and float in about 50 to 70 meters of water.



According to the abstract Google was awarded a patent (7,525,207) for:

A system includes a floating platform-mounted computer data center comprising a plurality of computing units, a sea-based electrical generator in electrical connection with the plurality of computing units, and one or more sea-water cooling units for providing cooling to the plurality of computing units.

Inventors were listed as Jimmy Clidaras, David Stiver and William Hamburgen.

The general idea is to move computing power closer to users. The larger question is whether Google will actually deploy these data center barges. Rich Miller at Data Center Knowledge writes:

Does Google have any intention of actually building these floating data centers? Many in the data center community are deeply skeptical about the concept, and find it difficult to believe that Google would ever pursue such a project.

So here’s the interesting precedent: In December 2003 Google applied for a patent for a portable data center in a shipping container, which was awarded in Oct. 2007. At last month’s Efficient Data Center Summit, we learned that Google deployed its first container data center in the fall of 2005, less than two years after filing its patent application.

Hmmm......................... :)