Jankit Shah - Important notes: 03/09

25 March 2009

Flickr Architecture

Information Sources

PHP is popular because it's free, relatively easy to program in, and has a lot of features for producing websites quickly.

http://www.php.net/">PHP (an early document)

SCALABILITY (horizontal or vertical) = ability to easily add capacity to accommodate growth. Capacity doesn’t mean speed.

Planning includes realizing what you have right NOW, and predicting what you’ll need later. Planning (what ?/why ?/when ?)
">Capacity Planning for * Linux, referring to the operating system;
* Apache, the Web server;
* MySQL, the database management system (or database server);
* PHP, the programming language.

The reason LAMP is popular is because all the components are free, in the open-source sense. This means you can horizontally scale your system at a much lower incremental cost as demand increases.

To a large extent LAMP is more an idea than specific set of technologies. It's much like AJAX in the way. Replacing each letter with a different technology doesn't change the spirit of the acronym. To build a website you need an OS (linux), you need a webserver (apache), you need a database (MySQL), and you need a client technology (PHP).

Replace Linux with Windows and you lose some cost flexibility, but you still can build your website. Replace MySQL with Postgress and you can still build your website. The advantage of the LAMP stack is that is there is a lot of expertise and help when using it and all the parts of evolved to work better together.

http://en.wikipedia.org/wiki/LAMP_(software_bundle)">LAMP

Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.

Building Scalable Web Sites by Cal Henderson from Flickr.

Database War Stories #3: Flickr by Tim O'Reilly

Cal Henderson's Talks. A lot of useful PowerPoint presentations.

Platform

# PHP

# MySQL

# Shards

# Memcached for a caching layer.

# Squid in reverse-proxy for html and images.

# Linux (RedHat)

# Smarty for templating

# Perl

# PEAR for XML and Email parsing

# ImageMagick, for image processing

# Java, for the node service

# Apache

# SystemImager for deployment

# Ganglia for distributed system monitoring

# Subcon stores essential system configuration files in a subversion repository for easy deployment to machines in a cluster.

# Cvsup for distributing and updating collections of files across a network.

The Stats

# More than 4 billion queries per day.

# ~35M photos in squid cache (total)

# ~2M photos in squid’s RAM

# ~470M photos, 4 or 5 sizes of each

# 38k req/sec to memcached (12M objects)

# 2 PB raw storage (consumed about ~1.5TB on Sunday

# Over 400,000 photos being added every day

The Architecture

# A pretty picture of Flickr's architecture can be found on this slide . A simple depiction is:
-- Pair of ServerIron's
---- Squid Caches
------ Net App's
---- PHP App Servers
------ Storage Manager
------ Master-master shards
------ Dual Tree Central Database
------ Memcached Cluster
------ Big Search Engine

- The Dual Tree structure is a custom set of changes to MySQL that allows scaling by incrementally adding masters without a ring architecture. This allows cheaper scaling because you need less hardware as compared to master-master setups which always requires double the hardware.
- The central database includes data like the 'users' table, which includes primary user
keys (a few different IDs) and a pointer to which shard a users' data can be found on.

# Use dedicated servers for static content.

# Talks about how to support Unicode.

# Use a share nothing architecture.

# Everything (except photos) are stored in the database.

# Statelessness means they can bounce people around servers and it's easier to make their APIs.

# Scaled at first by replication, but that only helps with reads.

# Create a search farm by replicating the portion of the database they want to search.

# Use horizontal scaling so they just need to add more machines.

# Handle pictures emailed from users by parsing each email is it's delivered in PHP. Email is parsed for any photos.

# Earlier they suffered from Master-Slave lag. Too much load and they had a single point of failure.

# They needed the ability to make live maintenance, repair data, and so forth, without taking the site down.

# Lots of excellent material on capacity planning. Take a look in the Information Sources for more details.

# Went to a federated approach so they can scale far into the future:
- Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone else's’ blog
- Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
- PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)

# Shards:
- Slice of the main database
- Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
- Shard assignments are from a random number for new accounts
- Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.

# Clicking a Favorite:
- Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
- Pulls my Information from cache, to get my shard location (say on shard-13)
- Starts a “distributed transaction” - to answer the question: Who favorited the photo? What are my favorites?

# Can ask question from any shard, and recover data. Its absolutely redundant.

# To get rid of replication lag…
- every page load, the user is assigned to a bucket
- if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection.

# Every users reads and writes are kept in one shard. Notion of replication lag is gone.

# Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode. To upgrade you just have to shut down half the shard, upgrade that half, and then repeat the process.

# Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.

# Average queries per page, are 27-35 SQL statements. Favorites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.

# Over 36,000 queries per second - running within capacity threshold. Burst of traffic, double 36K/qps.

# Each Shard holds 400K+ users data.
- A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit.

# Search:
- Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
- Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
- Think of it such that you’ve got Lucene-like search

# Hardware:
- EMT64 w/RHEL4, 16GB RAM
- 6-disk 15K RPM RAID-10.
- Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger).
- 2U boxes. Each shard has~120GB of data.

# Backup procedure:
- ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare.
- Snapshots are taken every night across the entire cluster of databases.
- Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea.
- However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.

# Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.

# Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.

# max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.

# Tags:
- Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags.
- Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles.

# Future Direction:
- Make it faster with real-time BCP, so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle.

Lessons Learned

# Think of your application as more than just a web application. You'll have REST APIs, SOAP APIs, RSS feeds, Atom feeds, etc.

# Go stateless. Statelessness makes for a simpler more robust system that can handle upgrades without flinching.

# Re-architecting your database sucks.

# Capacity plan. Bring capacity planning into the product discussion EARLY. Get buy-in from the $$$ people (and engineering management) that it’s something to watch.

# Start slow. Don’t buy too much equipment just because you’re scared/happy that your site will explode.

# Measure reality. Capacity planning math should be based on real things, not abstract ones.

# Build in logging and metrics. Usage stats are just as important as server stats. Build in custom metrics to measure real-world usage to server-based stats.

# Cache. Caching and RAM is the answer to everything.

# Abstract. Create clear levels of abstraction between database work, business logic, page logic, page mark-up and the presentation layer. This supports quick turn around iterative development.

# Layer. Layering allows developers to create page level logic which designers can use to build the user experience. Designers can ask for page logic as needed. It's a negotiation between the two parties.

# Release frequently. Even every 30 minutes.

# Forget about small efficiencies, about 97% of the time. Premature optimization is the root of all evil.

# Test in production. Build into the architecture mechanisms (config flags, load balancing, etc.) with which you can deploy new hardware easily into (and out of) production.

# Forget benchmarks. Benchmarks are fine for getting a general idea of capabilities, but not for planning. Artificial tests give artificial results, and the time is better used with testing for real.

# Find ceilings.
- What is the maximum something that every server can do ?
- How close are you to that maximum, and how is it trending ?
- MySQL (disk IO ?)
- SQUID (disk IO ? or CPU ?)
- memcached (CPU ? or network ?)

# Be sensitive to the usage patterns for your type of application.
- Do you have event related growth? For example: disaster, news event.
- Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year.
- 40-50% more uploads on Sundays than the rest of the week, on average

# Be sensitive to the demands of exponential growth. More users means more content, more content means more connections, more connections mean more usage.

# Plan for peaks. Be able to handle peak loads up and down the stack.

Google Architecture

Information Sources

Video: Building Large Systems at Google

Google Lab: The Google File System

Google Lab: MapReduce: Simplified Data Processing on Large Clusters

Google Lab:

http://labs.google.com/papers/bigtable.html">BigTable.

Video: BigTable: A Distributed Structured Storage System.

Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.

How Google Works by David Carr in Baseline Magazine.

Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.

Dare Obasonjo's Notes on the scalability conference.

Platform

# Linux

# A large diversity of languages: Python, Java, C++

What's Inside?

The Stats

# Estimated 450,000 low-cost commodity servers in 2006

# In 2005 Google indexed 8 billion web pages. By now, who knows?

# Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000
machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.

# Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.

# BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.

The Stack

Google visualizes their infrastructure as a three layer stack:

# Products: search, advertising, email, maps, video, chat, blogger

# Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.

# Computing Platforms: a bunch of machines in a bunch of different data centers

# Make sure easy for folks in the company to deploy at a low cost.

# Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

Reliable Storage Mechanism with GFS (Google File System)

# Reliable scalable storage is a core need of any application. GFS is their core storage platform.

# Google File System - large distributed log structured file system in which they throw in a lot of data.

# Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required:
- high reliability across data centers
- scalability to thousands of network nodes
- huge read/write bandwidth requirements
- support for large blocks of data which are gigabytes in size.
- efficient distribution of operations across nodes to reduce bottlenecks

# System has master and chunk servers.
- Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk.
- Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.

# A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.

# Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.

Do Something With the Data Using MapReduce

# Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.

# MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

# Why use MapReduce?
- Nice way to partition tasks across lots of machines.
- Handle machine failure.
- Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc.
- Computation can automatically move closer to the IO source.

# The MapReduce system has three different types of servers.
- The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
- The Map servers accept user input and performs map operations on them. The results are written to intermediate files
- The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.

# For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
- The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
- In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count.
- Shuffling aggregates key types.
- The reductions sums up all the key value pairs and produces the final answer.

# The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.

# Programs can be very small. As little as 20 to 50 lines of code.

# One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.

# Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Storing Structured Data in BigTable

# BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.

# BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.

# It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.

# Commercial databases simply don't scale to this level and they don't work across 1000s machines.

# By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.

# Machines can be added and deleted while the system is running and the whole system just works.

# Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.

# Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.

# BigTable has three different types of servers:
- The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
- The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
- The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.

# A locality group can be used to physically store related bits of data together for better locality of reference.

# Tablets are cached in RAM as much as possible.

Hardware

# When you have a lot of machines how do you build them to be cost efficient and use power efficiently?

# Use ultra cheap commodity hardware and built software on top to handle their death.

# A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.

# Linux, in-house rack design, PC class mother boards, low end storage.

# Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.

# Use a mix of collocation and their own data centers.

Misc

# Push changes out quickly rather than wait for QA.

# Libraries are the predominant way of building programs.

# Some are applications are provided as services, like crawling.

# An infrastructure handles versioning of applications so they can be release without a fear of breaking things.

Future Directions for Google

# Support geo-distributed clusters.

# Create a single global namespace for all data. Currently data is segregated by cluster.

# More and better automated migration of data and computation.

# Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

Lessons Learned

# Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.

# Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.

# Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.

# An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.

# Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.

# Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.

# Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.

# Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.

# Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

YouTube Architecture

Platform

# Apache

# Python

# Linux (SuSe)

# MySQL

# psyco, a dynamic python->C compiler

# lighttpd for video instead of Apache

What's Inside?

The Stats

# Supports the delivery of over 100 million videos per day.

# Founded 2/2005

# 3/2006 30 million video views/day

# 7/2006 100 million video views/day

# 2 sysadmins, 2 scalability software architects

# 2 feature developers, 2 network engineers, 1 DBA

Recipe for handling rapid growth

while (true)
{
identify_and_fix_bottlenecks();
drink();
sleep();
notice_new_bottleneck();
}

This loop runs many times a day.

Web Servers

# NetScalar is used for load balancing and caching static content.

# Run Apache with mod_fast_cgi.

# Requests are routed for handling by a Python application server.

# Application server talks to various databases and other informations sources to get all the data
and formats the html page.

# Can usually scale web tier by adding more machines.

# The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.

# Python allows rapid flexible development and deployment. This is critical given the competition they face.

# Usually less than 100 ms page service times.

# Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.

# For high CPU intensive activities like encryption, they use C extensions.

# Some pre-generated cached HTML for expensive to render blocks.

# Row level caching in the database.

# Fully formed Python objects are cached.

# Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.

Video Serving

# Costs include bandwidth, hardware, and power consumption.

# Each video hosted by a mini-cluster. Each video is served by more than one machine.

# Using a a cluster means:
- More disks serving content which means more speed.
- Headroom. If a machine goes down others can take over.
- There are online backups.

# Servers use the lighttpd web server for video:
- Apache had too much overhead.
- Uses epoll to wait on multiple fds.
- Switched from single process to multiple process configuration to handle more connections.

# Most popular content is moved to a CDN (content delivery network):
- CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network.
- CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.

# Less popular content (1-20 views per day) uses YouTube servers in various colo sites.
- There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed.
- Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior.
- Tune RAID controller and pay attention to other lower level issues to help.
- Tune memory on each machine so there's not too much and not too little.

Serving Video Key Points

# Keep it simple and cheap.

# Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.

# Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.

# Use simple common tools. They use most tools build into Linux and layer on top of those.

# Handle random seeks well (SATA, tweaks).

Serving Thumbnails

# Surprisingly difficult to do efficiently.

# There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.

# Thumbnails are hosted on just a few machines.

# Saw problems associated with serving a lot of small objects:
- Lots of disk seeks and problems with inode caches and page caches at OS level.
- Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea.
- A high number of requests/sec as web pages can display 60 thumbnails on page.
- Under such high loads Apache performed badly.
- Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20.
- Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache.
- With so many images setting up a new machine took over 24 hours.
- Rebooting machine took 6-10 hours for cache to warm up to not go to disk.

# To solve all their problems they started using Google's BigTable, a distributed data store:
- Avoids small file problem because it clumps files together.
- Fast, fault tolerant. Assumes its working on a unreliable network.
- Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.
- For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.

Databases

# The Early Years
- Use MySQL to store meta data like users, tags, and descriptions.
- Served data off a monolithic RAID 10 Volume with 10 disks.
- Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered.
- They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach.
- Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master.
- Updates cause cache misses which goes to disk where slow I/O causes slow replication.
- Using a replicating architecture you need to spend a lot of money for incremental bits of write performance.
- One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.

# The later years:
- Went to database partitioning.
- Split into shards with users assigned to different shards.
- Spreads writes and reads.
- Much better cache locality which means less IO.
- Resulted in a 30% hardware reduction.
- Reduced replica lag to 0.
- Can now scale database almost arbitrarily.

Data Center Strategy

# Used manage hosting providers at first. Living off credit cards so it was the only way.
# Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.
# So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
# Use 5 or 6 data centers plus the CDN.
# Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
# Video bandwidth dependent, not really latency dependent. Can come from any colo.
# For images latency matters, especially when you have 60 images on a page.
# Images are replicated to different data centers using BigTable. Code
looks at different metrics to know who is closest.

Lessons Learned

# Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.

# Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.

# Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.

# Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.

# Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.

# Constant iteration on bottlenecks:
- Software: DB, caching
- OS: disk I/O
- Hardware: memory, RAID

# You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

16 March 2009

Productivity

The work study, inclusive of methods study and time study, job evaluation, merit rating, job redesign and financial incentives which were described have been known as Productivity Techniques since several decades. Productivity in this context means a measure of the quantity of output per unit of input. The input could be the man hours spent on producing that output or it could be the number of machine hours spent or the amount of materials consumed(in number, Kg, litre or Rs etc). Basically productivity is known as the ratio between the output and the input.

Productivity = Amount of output /Amount of input

Since, in the earlier days of industrial management, it was considered very critical to control the laborers, the term productivity generally meant labor productivity. The time study, method study, incentives schemes and the like were seen as ways of managing or controlling the labor. The emphasis was on labor or, more exactly, on the laborers. The managers could understand the machines and could always run faster or slower, for a longer period or a shorter period, of course within the capabilities of the particular machine or a set of machines. They thought that they could make the machine as productive as it could possibly be: Controlling the machines was easier-after all those were mechanical things; however humans could not always be trusted to give the desired output. These need to be monitored, standards to be established for them, prescribing a method, and sometimes being enticed to produce more by giving more money for more than the ‘standard’ output. After all, human being in all truthfulness was only an extension of the machine / equipment he was operating in the production shop. Nevertheless, this ‘human machine’ seemed to go berserk voluntarily. It was not a very honest machine. It needed standards of time, method and monetary compensation to put it (him/her) in place. These managerial control methods formed a major part of the techniques for productivity.

This is not to say that in the recent times human productivity has lost its importance. Human input is a major input into any business, management, government or society. Management is, after all, by the people and of the people; what is important is that now it is more and more recognized that management is primarily for the people.

Some of the ratios for labor productivity measurement are as follows:

Workers’ productivity = Workers’ output expressed in standard hours / Number of hours (man hours) worked by the workers

A Worker’s or a group of worker’s productivity = Number of units of output / Number of days taken

Another example of labor productivity = Number of toilets cleaned in a shift / Number of cleaners

A group of workers’ productivity = Tons (or Kg) of output/ Number of workers

Labor productivity = Workers’ Output expressed in Rupee Value / Workers’ Salaries and Wage in Rupees

Productivity, as measured above, represents the efficiency of the labor. These indices show how efficiently labor is being utilized. As indicated earlier, these are the engineering indices of labor productivity. That is, labor productivity would be looked at the same way the machinery productivity is viewed.

Though, there is nothing wrong conceptually with this viewpoint, it is a limited view. These are ‘partial measures’ of productivity. Further additions are required in order to get a complete picture of productivity. Before proceeding to a more comprehensive definition, let us look at the ‘engineering’ or ‘efficiency’ definition of productivity.

What is output?

While productivity is seen as the ratio of output to input, it needs to be understood as to what constitutes the numerator ‘output’. Take the case of a car tire manufacturing company. If during the financial years 1996-97 and 1997-98 its output has been as follows:

Output Year
1996-97 1997-98

No of Tyres produced 16,000 20,000
Life of a Tyre in Km 20,000 15,000
Price of a Tyre in Rupees 2,000 1,600

Has the productivity gone up or down during 1997-98?
Assume that the level of input has seen the same during both the years.
If one looks at the number of tyres produced, the year 1997-98 has shown a 25% [(20,000 – 16,000) ÷ 16,000] increase in productivity.. However, if the output is viewed as tyre-km, then the picture is reversed. The output in the year 1996-97 was 320 million tyre-km as against only 300 million tyre km in the year 1997-98 So, the productivity on this count, has gone down by

320 – 300 / 300 x 100 = 6.66%.

If the output is viewed in monetary terms, then there is no change in the output and it is constant at Rs. 32 million.

Conditions for Good Evaluation

The Point System depends upon (1) the factors used, (2) the weightages given and (3) the degree defined. The job evaluation exercise, by the Point System, can be more reliable and truthful if:

1. The factors are so chosen that they are common to all the jobs and yet bring out the differences between one job and the other. There should not be any overlaps between the factors; each factor should be distinct from the other. Moreover, the analysis should not omit any relevant factors i.e. the factors should be comprehensive.
2. Weightages for factors are a disciplined, group judgment.
3. The degrees or levels within the factors are defined clearly.

Factor Comparison System:

This method combines the features of the Ranking System and Points System. The steps involved are as follows:

1. The ‘factors’ are selected and defined as in the Points System but, there are no degrees.
2. Few ‘key’ jobs (or Benchmark Jobs) are selected and ranked under each factor. A ‘key’ job is one which has no dispute regarding the wage rates. Generally, 10-15 key jobs will suffice.
3. The average wage rate of each key job is now converted into points, by multiplying the wage rate by an arbitrary number (in order to mask the wage rate).
4. For a key job, the points so obtained are allocated under each factor according to its importance for the job. Do this for all the jobs. While doing this do not bother about the ‘ranks’ given in step (2).
5. Based on the allocated points, derive new rankings for the jobs under each factor. Compare the new ranks with the original ranks. Wherever they do not tally, remove those jobs from the list of key jobs.
6. Repeat this process till there is no disagreement between the original and new rankings.
7. Use the key jobs positioned under each factor for comparing and positioning the remaining jobs under each factor. The total points for a job can be easily arrived at by adding the point values for the job under each factor. The points can be reconverted to the pay rate or used to determine the overall hierarchy of jobs.

Example: Let A, B, C, D, E, F and G be the key jobs selected. If the average wage rates for them are Rs 10, 14, 16, 11, 20, 13 and 16 respectively, we convert these into point values by multiplying by an arbitrary number, say 23.

A= 10×23 =230 points
B= 14×23=322 points
C= 16 x 23 =368 points
D = 11x 23= 253 points
E = 20 x23= 460 points
F= 13 x 23 = 299 points
G= 16 x 23 = 368 points

The ranking of the jobs is done under each factor. This is the original ranking.

Next, for a job the point values are distributed under each factor. For Key Job A our distribution would be 60, 85, 25, 40 and 20 points under Mental Requirement, Skills, Working Conditions, Responsibility and Physical Effort factors, respectively. This is our assessment of the relative importance of each factor for Job A. We do this for all the key jobs. Based on the points noted for each key job under a factor, compute the new rankings under that factor. This is followed for all factors.

The same could be expressed in the form of a calibrated scale for each factor. In actual practice, depending upon the total number of jobs being evaluated a few more supplementary jobs may be added to the key jobs so that the scales under each factor are clear and spread out.

The other jobs can then be positioned in the scale under each factor by comparing with the key jobs. It may be mentioned that the key jobs, in addition to being non-disputable, should preferably cover as wide a range of points as possible under each factor. They should include in themselves a wide range of existing pay scales including the highest and lowest paid jobs.

Merits and Demerits of Factors Comparison System:

One of the advantages of the Factor Comparison system is that the evaluation is based on benchmark or key jobs about which there is no dispute. Everything is in comparison to these key jobs which constitute the steps of the rating scales. Moreover, because of the very same fact, the rating scales are tailor made so to say, for an organization.
But what leaves a ring of suspicion about this system is its feature of using monetary points for arriving at these ratings scales. Due to this, many feel that this system may perpetuate existing inequalities thus negating the purpose of the job evaluation.