Some notes on scalability

I have been wanting to write this forever, also so I could remind myself of it whenever needed. And this is the kind of subject that traverses most programming languages/frameworks/systems as long as you are dealing with large quantities of data. There are probably more things one can do in order to allow an application to scale up easily, but these are the ones I have put in place in the past and it has worked out nicely so far: Read/write splitting (master/slave replication) If possible (e.g. if you are not using some kind of really restrictive database access framework (think django, as it is right now :)), put in place mechanisms to allow you to read data from multiple slave DB machines with some strategy to choose between them (round robin, last used, most load, etc.). This will allow you to scale up reading from your database. Streaming directly from the database to prevent a lot of in-memory data coming from DBI came in contact with this particular problem when having to send clients a bunch of processed data from a database. Whenever possible shift all the work on the data to the database (SQL) and then stream the data to the client through your application. What I mean by this is: don't keep data in memory in your application as most as possible. For instance MySQL, by default, will not stream the data to your application; it will actually keep the rows in memory and then send them to the client. If you have requests of 50 Mb worth of data you can see where your server is going after 30 concurrent requests (hint: morgue). CachingCache your requests whenever possible. Use an out of the box solution (like squid for example) or develop your own strategy to cache data, so you hit your databases as least as possible.Data discretization This one is related to caching. Figure out clever ways of discriminating your data so it's easier to cache. For instance, imagine you are serving a bunch of data, that people submit to you, to everyone else. Anytime someone submits something you will have data that can be sent to everyone else. Instead of making this data immediately available, make it discrete. Serve batches of data, say, once a day. This will allow you to cache daily requests easily with squid, for example. Hibernate second level cacheFor java/spring/hibernate developers out there: Do make use of hibernate's second level cache. Especially for content that does not change that often. It will pay off especially for web interfaces. For data distribution I would recommend not using hibernate at all. Think plain JDBC with some (earlier mentioned) different caching solution and streaming the results directly. Data redundancy where it makes senseDon't worry about sacrificing disk space on the database cluster in favor of faster query times. Especially good when searching content on your database. That one extra redundant column in your data table will make the search much faster than using joins all over the place. These are my thoughts on scalability. I would love to hear about your experiences and solutions on this field. Posted via email from nocivus' ramblings