Category Archives: web clips

Notes on Distributed Systems for Young Bloods – Something Similar

I’ve been thinking about the lessons distributed systems engineers learn on the job. A great deal of our instruction is through scars made by mistakes made in production traffic. These scars are useful reminders, sure, but it’d be better to have more engineers with the full count of their fingers.

New systems engineers will find the Fallacies of Distributed Computing and the CAP theorem as part of their self-education. But these are abstract pieces without the direct, actionable advice the inexperienced engineer needs to start moving[1]. It’s surprising how little context new engineers are given when they start out.

Below is a list of some lessons I’ve learned as a distributed systems engineer that are worth being told to a new engineer. Some are subtle, and some are surprising, but none are controversial. This list is for the new distributed systems engineer to guide their thinking about the field they are taking on. It’s not comprehensive, but it’s a good beginning.

The worst characteristic of this list is that it focuses on technical problems with little discussion of social problems an engineer may run into. Since distributed systems require more machines and more capital, their engineers tend to work with more teams and larger organizations. The social stuff is usually the hardest part of any software developer’s job, and, perhaps, especially so with distributed systems development.

Our background, education, and experience bias us towards a technical solution even when a social solution would be more efficient, and more pleasing. Let’s try to correct for that. People are less finicky than computers, even if their interface is a little less standardized.

Alright, here we go.

Distributed systems are different because they fail often. When asked what separates distributed systems from other fields of software engineering, the new engineer often cites latency, believing that’s what makes distributed computation hard.

But they’re wrong. What sets distributed systems engineering apart is the probability of failure and, worse, the probability of partial failure. If a well-formed mutex unlock fails with an error, we can assume the process is unstable and crash it. But the failure of a distributed mutex’s unlock must be built into the lock protocol.

Systems engineers that haven’t worked in distributed computation will come up with ideas like “well, it’ll just send the write to both machines” or “it’ll just keep retrying the write until it succeeds”. These engineers haven’t completely accepted (though they usually intellectually recognize) that networked systems fail more than systems that exist on only a single machine and that failures tend to be partial instead of total. One of the writes may succeed while the other fails, and so now how do we get a consistent view of the data? These partial failures are much harder to reason about.

Switches go down, garbage collection pauses make masters “disappear”, socket writes seem to succeed but have actually failed on the other machine, a slow disk drive on one machines causes a communication protocol in the whole cluster to crawl, and so on. Reading from local memory is simply more stable than reading across a few switches.

Design for failure.

Writing robust distributed systems costs more than writing robust single-machine systems. Creating a robust distributed solution requires more money than a single-machine solution because there are failures that only occur with many machines. Virtual machine and cloud technology make distributed systems engineering cheaper but not as cheap as being able to design, implement, and test on a computer you already own. And there are failure conditions that are difficult to replicate on a single machine. Whether it’s because they only occur on dataset sizes much larger than can be fit on a shared machine, or in the network conditions found in datacenters, distributed systems tend to need actual, not simulated, distribution to flush out their bugs. Simulation is, of course, very useful.

Robust, open source distributed systems are much less common than robust, single-machine systems. The cost of running many machines for long periods of time is a burden on open source communities. Hobbyists and dilettantes are the engines of open source software and they do not have the financial resources available to explore or fix many of the problems a distributed system will have. Hobbyists write open source code for fun in their free time and with machines they already own. It’s much harder to find open source developers who are willing to spin up, maintain, and pay for a bunch of machines.

Some of this slack has been taken up by engineers working for corporate entities. However, the priorities of their organization may not be in line with the priorities of your organization.

While some in the open source community are aware of this problem, it’s not yet solved. This is hard.

Coordination is very hard. Avoid coordinating machines wherever possible. This is often described as “horizontal scalability”. The real trick of horizontal scalability is independence – being able to get data to machines such that communication and consensus between those machines is kept to a minimum. Every time two machines have to agree on something, the service is harder to implement. Information has an upper limit to the speed it can travel, and networked communication is flakier than you think, and your idea of what constitutes consensus is probably wrong. Learning about the Two Generals and Byzantine Generals problems are useful here. (Oh, and Paxos really is very hard to implement; that’s not grumpy old engineers thinking they know better than you.)

If you can fit your problem in memory, it’s probably trivial. To a distributed systems engineer, problems that are local to one machine are easy. Figuring out how to process data quickly is harder when the data is a few switches away instead of a few pointer dereferences away. In a distributed system, the well-worn efficiency tricks documented since the beginning of computer science no longer apply. Plenty of literature and implementations are available for algorithms that run on a single machine because the majority of computation has been done on singular, uncoordinated machines. Significantly fewer exist for distributed systems.

“It’s slow” is the hardest problem you’ll ever debug. “It’s slow” might mean one or more of the number of systems involved in performing a user request is slow. It might mean one or more of the parts of a pipeline of transformations across many machines is slow. “It’s slow” is hard, in part, because the problem statement doesn’t provide many clues to location of the flaw. Partial failures, ones that don’t show up on the graphs you usually look up, are lurking in a dark corner. And, until the degradation becomes very obvious, you won’t receive as many resources (time, money, and tooling) to solve it. Dapper and Zipkin were built for a reason.

Implement backpressure throughout your system. Backpressure is the signaling of failure from a serving system to the requesting system and how the requesting system handles those failures to prevent overloading itself and the serving system. Designing for backpressure means bounding resource utilization during times of overload and times of system failure. This is one of the basic building blocks of creating a robust distributed system.

Common versions include dropping new messages on the floor (and incrementing a metric) if the system’s resources are already over-scheduled, and shipping errors back to users when the system determines it will be unable to finish the request in a given amount of time. Timeouts and exponential back-offs on connections and requests to other systems are also useful.

Without backpressure mechanisms in place, cascading failure or unintentional message loss become likely. When a system is not able to handle the failures of another, it tends to emit failures to another system that depends on it.

Find ways to be partially available. Partial availability is being able to return some results even when parts of your system is failing.

Search is an ideal case to explore here. Search systems trade-off between how good their results are and how long they will keep a user waiting. A typical search system sets a time limit on how long it will search its documents, and, if that time limit expires before all of its documents are searched, it will return whatever results it has gathered. This makes search easier to scale in the face of intermittent slowdowns, and errors because those failures are treated the same as not being able to search all of their documents. The system allows for partial results to be returned to the user and its resilience is increased.

And consider a private messaging feature in a web application. We easily believe that if private messaging is down, the image upload feature should probably keep working. So, consider designing for partial failure in the private messaging service itself. This takes some thought, of course. People are generally more okay with private messaging being down for them (and maybe some other users) than they are with all users having some of their messages go missing. If the service is overloaded or one of its machines are down, failing out just a small fraction of the userbase is preferable to missing data for a larger fraction. Being able to recognize these kinds of trade-offs in partial availability is good to have in your toolbox.

Metrics are the only way to get your job done. Exposing metrics (such as latency percentiles, increasing counters on certain actions, rates of change) is the only way to cross the gap from what you believe your system does in production and what it actually is doing. Knowing how the system’s behavior on day 20 is different from its behavior on day 15 is the difference between successful engineering and failed shamanism. Of course, metrics are necessary to understand problems and behavior, but they are not sufficient to know what to do next.

A diversion into logging. Log files are good to have, but they tend to lie. For example, it’s very common for the logging of a few error classes to take up a large proportion of a space in a log file but, in actuality, occur in a very low proportion of requests. Because logging successes is redundant in most cases (and would blow out the disk in most cases) and because engineers often guess wrong on which kinds of error classes are useful to see, log files get filled up with all sorts of odd bits and bobs. Prefer logging as if someone who has not seen the code will be reading the logs.

I’ve seen a good number of outages extended by another engineer (or myself) over-emphasizing something odd we saw in the log without first checking it against the metrics. I’ve also seen another engineer (or myself) Sherlock-Holmes’ing an entire set of failed behaviors from a handful of log lines. But note: a) we remember those successes because they are so very rare and b) you’re not Sherlock unless the metrics or the experiments back up the story.

Use percentiles, not averages. Percentiles (50th, 99th, 99.9th, 99.99th) are more accurate and informative than averages in the vast majority of distributed systems. Using a mean assumes that the metric under evaluation follows a bell curve but, in practice, this describes very few metrics an engineer cares about. “Average latency” is a commonly reported metric, but I’ve never once seen a distributed system whose latency followed a bell curve. If the metric doesn’t follow a bell curve, the average is meaningless and leads to incorrect decisions and understanding. Avoid the trap by talking in percentiles. Default to percentiles, and you’ll better understand how users really see your system.

Learn to estimate your capacity. You’ll learn how many seconds are in a day because of this. Knowing how many machines you need to perform a task is the difference between a long-lasting system, and one that needs to be replaced 3 months into its job. Or, worse, needs to be replaced before you finish productionizing it.

Consider tweets. How many tweet ids can you fit in memory on a common machine? Well, a typical machine at the end of 2012 has 24 GB of memory, you’ll need an overhead of 4-5 GB for the OS, another couple, at least, to handle requests, and a tweet id is 8 bytes. This is the kind of back of the envelope calculation you’ll find yourself doing. Jeff Dean’s Numbers Everyone Should Know slide is a good expectation-setter.

Feature flags are how infrastructure is rolled out. “Feature flags” are a common way product engineers roll out new features in a system. Feature flags are typically associated with frontend A/B testing where they are used to show a new design or feature to only some of the userbase. But they are a powerful way of replacing infrastructure as well.

Suppose you’re going from a single database to a service that hides the details of a new storage solution. Have the service wrap around the legacy storage, and ramp up writes to it slowly. With backfilling, comparison checks on read (another feature flag), and then slow ramp up of reads (yet another flag), you will have much more confidence and fewer disasters. Too many projects have failed because they went for the “big cutover” or a series of “big cutovers” that were then forced into rollbacks by bugs found too late.

Feature flags sound like a terrible mess of conditionals to a classically trained object-oriented developer or a new engineer with well-intentioned training. And the use of feature flags means accepting that having multiple versions of infrastructure and data is a norm, not an rarity. This is a deep lesson. What works well for single-machine systems sometimes falters in the face of distributed problems.

Feature flags are best understood as a trade-off, trading local complexity (in the code, in one system) for global simplicity and resilience.

Choose id spaces wisely. The space of ids you choose for your system will shape your system.

The more ids required to get to a piece of data, the more options you have in partitioning the data. The fewer ids required to get a piece of data, the easier it is to consume your system’s output.

Consider version 1 of the Twitter API. All operations to get, create, and delete tweets were done with respect to a single numeric id for each tweet. The tweet id is a simple 64-bit number that is not connected to any other piece of data. As the number of tweets goes up, it becomes clear that creating user tweet timelines and the timeline of other user’s subscriptions may be efficiently constructed if all of the tweets by the same user were stored on the same machine.

But the public API requires every tweet be addressable by just the tweet id. To partition tweets by user, a lookup service would have to be constructed. One that knows what user owns which tweet id. Doable, if necessary, but with a non-trivial cost.

An alternative API could have required the user id in any tweet look up and, initially, simply used the tweet id for storage until user-partitioned storage came online. Another alternative would have included the user id in the tweet id itself at the cost of tweet ids no longer being k-sortable and numeric.

Watch out for what kind of information you encode in your ids, explicitly and implicitly. Clients may use the structure of your ids to de-anonymize private data, crawl your system in ways you didn’t expect (auto-incrementing ids are a typical sore point), or a host of other things you won’t expect.

Exploit data-locality. The closer the processing and caching of your data is kept to its persistent storage, the more efficient your processing, and the easier it will be to keep your caching consistent and fast. Networks have more failures and more latency than pointer dereferences and fread(3).

Of course, data-locality implies locality in space, but also locality in time. If multiple users are making the same expensive request at nearly the same time, perhaps their requests can be joined into one. If multiple instances of requests for the same kind of data are made near to one another, they could be joined into one larger request. Doing so often affords lower communication overheard and easier fault management.

Writing cached data back to persistent storage is bad. This happens in more systems than you’d think. Especially ones originally designed by people less experienced in distributed systems. Many systems you’ll inherit will have this flaw. If the implementers talk about “Russian-doll caching”, you have a large chance of hitting highly visible bugs. This entry could have been left out of the list, but I have a special hate in my heart for it. A common presentation of this flaw is user information (e.g. screennames, emails, and hashed passwords) mysteriously reverting to a previous value.

Computers can do more than you think they can. In the field today, there’s plenty of misinformation about what a machine is capable of from practitioners that do not have a great deal of experience.

At the end of 2012, a light web server had 6 or more processors, 24 GB of memory and more disk space than you can use. A relatively complex CRUD application in a modern language runtime on a single machine is trivially capable of doing thousands of requests per second within a few hundred milliseconds. And that’s a deep lower bound. In terms of operational ability, hundreds of requests per second per machine is not something to brag about in most cases.

Greater performance is not hard to come by, especially if you are willing to profile your application and introduce efficiencies based on your measurements.

Use the CAP theorem to critique systems. The CAP theorem isn’t something you can build a system out of. It’s not a theorem you can take as a first principle and derive a working system from. It’s much too general in its purview, and the space of possible solutions too broad.

However, it is well-suited for critiquing a distributed system design, and understanding what trade-offs need to be made. Taking a system design and iterating through the constraints CAP puts on its subsystems will leave you with a better design at the end. For homework, apply the CAP theorem’s constraints to a real world implementation of Russian-doll caching.

One last note: Out of C, A, and P, you can’t choose CA.

Extract services. “Service” here means “a distributed system that incorporates higher-level logic than a storage system and typically has a request-response style API”. Be on the lookout for code changes that would be easier to do if the code existed in a separate service instead of in your system.

An extracted service provides the benefits of encapsulation typically associated with creating libraries. However, extracting out a service improves on creating libraries by allowing for changes to be deployed faster and easier than upgrading the libraries in its client systems. (Of course, if the extracted service is hard to deploy, the client systems are the ones that become easier to deploy.) This ease is owed to the fewer code and operational dependencies in the smaller, extracted service and the strict boundary it creates makes it harder to “take shortcuts” that a library allows for. These shortcuts almost always make it harder to migrate the internals or the client systems to new versions.

The coordination costs of using a service is also much lower than a shared library when there are multiple client systems. Upgrading a library, even with no API changes needed, requires coordinating deploys of each client system. This gets harder when data corruption is possible if the deploys are performed out of order (and it’s harder to predict that it will happen). Upgrading a library also has a higher social coordination cost than deploying a service if the client systems have different maintainers. Getting others aware of and willing to upgrade is surprisingly difficult because their priorities may not align with yours.

The canonical service use case is to hide a storage layer that will be undergoing changes. The extracted service has an API that is more convenient, and reduced in surface area compared to the storage layer it fronts. By extracting a service, the client systems don’t have to know about the complexities of the slow migration to a new storage system or format and only the new service has to be evaluated for bugs that will certainly be found with the new storage layout.

There are a great deal of operational and social issues to consider when doing this. I cannot do them justice here. Another article will have to be written.

[1] Of course, Rotem-Gal-Oz’s take on the fallacies is very good.

Much love to my reviewers Bill de hÓra, Coda Hale, JD Maturen, Micaela McDonald, and Ted Nyman. Your insight and care was invaluable.

Node.js Best Practices | Just Build Something

I’ve recently been working on a lot of Node.js projects, for myself, with my students, and for national organizations. Because I’m a University instructor I’ve been getting a lot of questions about what the best practices for Node.js are from every project I’m involved with. I’ve worked with Node.js for years and know all the best practices myself, but I had never seen a list that explained the best practices to my satisfaction. So, I have put one together taking all the best practices agreed on by the community and explaining why each practice is the best way to write Node.js code.

If you want to improve these best practices in any way please don’t hesitate to create a pull request to the GitHub repo.

Here we go.

Before you comment with TL;DR notice that below are links to a description of each best practice. You can skip around and read the ones you are most interested in. The descriptions are concise and worth reading.

  1. Always Use Asynchronous Methods
  2. Never require Modules Inside of Functions
  3. Save a reference to this Because it Changes Based on Context
  4. Always “use strict”
  5. Validate that Callbacks are Callable
  6. Callbacks Always Pass Error Parameter First
  7. Always Check for “error” in Callbacks
  8. Use Exception Handling When Errors Can Be Thorwn
  9. Use module.exports not just exports
  10. Use JSDoc
  11. Use a Process Manager like upstart, forever, or pm2
  12. Follow CommonJS Standard

Always Use Asynchronous Methods

The two most powerful aspect of Node.js are it’s non-blocking IO and asynchronous runtime. Both of these aspects of Node.js are what give it the speed and robustness to serve more requests faster than other languages.

In order to take advantage of these features you have to always use asynchronous methods in your code. Below is an example showing the good and bad way to read files from a system.

The Bad Way reads a file from disk synchronously.

var data = fs.readFileSync('/path/to/file');
// use the data from the file

The Good Way reads a file from disk asynchronously.

fs.readFile('/path/to/file', function (err, data) {
	// err will be an error object if an error occured
	// data will be the file that was read

When a synchronous function is invoked the entire runtime halts. For example, above The Bad Way halts the execution of any other code that could be running while the file is read into memory. This means no users get served during this time. If your file takes five minutes to read into memory no users get served for five minutes.

By contrast The Good Way reads the file into memory without halting the runtime by using an asynchronous method. This means that if your file takes five minutes to read into memory all your users continue to get served.

Never require Modules Inside of Functions

As stated above you should always use asynchronous methods and function calls in Node. The one exception is the require function, which imports external modules.

Node.js always runs require synchronously. This is so the module being required can require other needed modules. The Node.js developers realize that importing modules is an expensive process and so they intend for it to happen only once, when you start your Node.js application. They even cache the required modules so they won’t be requried again.

However, if you require an external module from within functions your module will be synchronously loaded when those functions run and this can cause two problems.

To explain one of the problems imagine you had a module that took 30 minuets to load, which is unreasonable, but just imagine. If that module is only needed in one route handler function it might take some time before someone triggers that route and Node.js has to require that module. When this happens the server would effectively be inaccessible for 30 minutes as that module is loaded. If this happens at peak hours several users would be unable to get any access to your server and requests will queue up.

The second problem is a bigger problem but builds on the first. If the module you require causes an error and crashes the server you may not know about the error for several days, especssially is you use this module in a rarely used route handler. No one wants a call from a client at 4AM telling them the server is down.

The solution to both of these problems is to always require modules at the top of your file, outside of any function call. Save the required module to a variable and use the variable instead of re-requiring the module. Node.js will save the module to that variable and your code will run much faster.

var _ = require('underscore');

function myFunction(someArray){

	// use underscore without the need
	// to require it again
	_.sort(someArray, function(item){
		// do something with item


module.exports.myFunction = myFunction;

Save a reference to this Because it Changes Based on Context

If you have a background with Java, ActionScript, PHP, or basically any language that uses the this keyword you might think you understand how JavaScript treats the same keyword. Unfortunately you would be wrong.

Let me tell you how this is determined officially by ECMAScript.

The this keyword evaluates to the value of the ThisBinding of the current execution context.

Basically that means that the value of the this variable is determined based on context, not encapsulation, as it is in other languages.

For example, if this is used inside a function, this references the object that invoked the function. That means that if you create a constructor function (basically a class in JavaScript) which then has methods attached to it, the this variable in those methods may not refer to the constructor function (class) they are inside of.

The above happens a lot in Node, but it might be hard to understand without seeing code.

In the code below this has two different values.

function MyClass() {
	this.myMethod = function() {

var myClass = new MyClass();
myClass.myMethod(); // this resolves as the instance of MyClass

var someFunction = myClass.myMethod;
someFunction(); // this resolves as the window in a browser and the global object in Node

The best way to solve this is to preserve this as another variable and then use that other variable instead. The most common variable names to use are _this, that, self, or root.

I personally like _this or self best because _this is easy to understand and self will be understood by anyone with Python or Ruby experience as both languages use self instead of this.

After making the changes your code should look like this.

function MyClass() {
	var self = this;
	this.myMethod = function() {

var myClass = new MyClass();
myClass.myMethod(); // self resolves as the instance of MyClass

var someFunction = myClass.myMethod;
someFunction(); // self also resolves as the instance of MyClass

self now always refers to the MyClass instance.

Always “use strict”

“use strict” is a behavior flag you can add to to first line of any JavaScript file or function. It causes errors when certain bad practices are use in your code, and disallows the use of certain functions, such as with.

Believe it or not but the best place I found to describe what JavaScript’s strict mode changes is Microsoft’s JavaScript documentation.

Validate that Callbacks are Callable

As stated before Node.js uses a lot of callbacks. Node.js is also weakly typed. The compiler allows any variable to be converted to any other data type. This lack of typing can cause one big problem.

Only functions are callable.

This means that if you pass a string to a function that needed a callback function your application will crash when it tries to execute that string.

This is obviously bad, but upon first blush there is no simple way to solve it. You could wrap the execution of all callbacks in try catch statements, or you could use if statements to determine if a callback has been passed in.

However, there is a simple way of validating that callbacks are callable which requires only one line of code, and it accounts for optional callbacks as well as checking data type.

callback = (typeof callback === 'function') ? callback : function() {};

This determines if the callback is a function. If it’s not a function for any reason it creates an empty function and sets the callback to be that function. This way all callbacks are callable and optional.

Place that line at the top of each function that receives a callback and you will never crash due to uncallable callbacks again.

Callbacks Always Pass Error Parameter First

Node.js is asynchronous, which means you usually have to use callback functions to determine when your code completes.

After writing Node.js code for a while you will want to start writing your own modules, which need callback functions to be passed in by the user of your module. If an error occurs in your module, how do you communicate that to the user of the module? If you’re a Java developer you might think you should throw an exception, but throwing an exception in Node.js could potentially shutdown the server. Instead you should package the error into an object, and pass it to the callback function as the first parameter. If no error occurred you should pass null.

By convention all callback functions are passed an error as the first parameter.

function myFunction(someArray, callback){

	// an example of an error that could occur
	// if the passed in object is
	// not the right data type
	if( !Array.isArray(someArray) ){
		var err = new TypeError('someArray must be an array');
		callback(err, null);

	// ... do other stuff

	callback(null, someData);


module.export.myFunction = myFunction;

Always Check for “error” in Callbacks

As stated above, by convention an error is always the first parameter passed to any callback function. This is great for making sure your site doesn’t crash and that you can detect errors when they happen.

Now that you know what they are you should start using them. If your database query errors out you need to check for that before using the results. I’ll give you an example.

		some: 'data'
	}, function(err, someReturnedData) {

			// don't use someReturnedData
			// it's not populated

		// do something with someReturnedData
		// we know there was no error


Use Exception Handling When Errors Can Be Thrown

Most methods in Node.js will follow the “error first” convention, but some functions don’t. These functions are not Node.js specific function, they instead come from JavaScript. There are lots of functions that can cause exceptions. One of these functions is JSON.parse which throws an error if it can’t parse a string into JSON.

How do we detect this error without crashing our server?

This is a perfect time to use a classic JavaScript try catch.

var parsedJSON;

try {
	parsedJSON = JSON.parse('some invalid JSON');
} catch (err) {
	// do something with your error

if (parsedJSON) {
	// use parsedJSON

You can now be sure that the JSON was parsed correctly before using it.

This can be even more useful when using it in modules.

function parseJSON(stringToParse, callback) {

	callback = (typeof callback === 'function') ? callback : function() {};

	try {

		var parsedJSON = JSON.parse(stringToParse);

		callback(null, parsedJSON);

	} catch (err) {

		callback(err, null);




Of course the above example is slightly contrived, however, the idea of using try catch is a very good practice.

Use module.exports not exports

You might have used module.exports and exports interchangeably thinking they are the same thing and in may cases they are. However, exports is more of a helper method that collects properties and attaches them to module.exports.

So what the problem? That sounds great.

Well don’t get too excited. exports only collects properties and attaches them if module.exports doesn’t already have existing properties. If module.exports has any properties, everything attached to exports is ignored and not attached to module.exports.

module.exports = {};

exports.someProperty = 'someValue';

someProperty won’t be exported as part of the module.

var exportedObject = require('./mod');

console.log(exportedObject); // {}

The solution is simple. Don’t use exports because it can create confusing, hard to track down bugs.

module.exports = {};

module.exports.someProperty = 'someValue';

someProperty will be exported as part of the module.

var exportedObject = require('./mod');

console.log(exportedObject); // { someProperty: 'someValue' }

Use JSDoc

JavaScript is a weakly typed language. Any variable can be passed to any function without conflict, until you try to use that function.

function multiply(num1, num2) {

	return num1 * num2;


var value = multiply('Some String', 2);

console.log(value) // NaN

Obviously this is a problem above that could easily be fixed by looking at the code. But what if the code was written by someone else and uses complex parameters that you don’t really understand. You could spend several minutes tracking down the expected data type. Worst yet it might accept multiple data types, in which case it may take you longer to track it down.

The best thing to do is use JSDoc. If you’re a Java developer you will have heard of Javadoc. JSDoc is similar and at it’s simplest adds comments above functions to describe how the function works, but it can do a lot more.

Some IDEs will even use JSDoc to make code suggestions.

Use a Process Manager like upstart or forever

Keeping a Node.js progress running can be daunting. Simply using the node command is dangerous. If your Node.js server crashes the node command won’t automatically restart the process.

However, programs like upstart and forever will.

While upstart is a general purpose init daemon forever and pm2 are specific to Node.

Follow CommonJS Standard

Node.js follows a standard for writing code that varies slightly from the standards that govern writing browser based JavaScript.

This standard is called CommonJS.

While CommonJS is far too large for me to cover here it’s worth knowing about about and learning. The most important point are that it mandates certain file organization and behavior that should be expected from the CommonJS module loader (require). It also describes how internals of the Node.js system should work.

Check it out.


That was long but hopefully worth it. These are the main best practices of Node.js that everyone should be following.

Of course there are more best practices that can be followed, like only using one module.exports per module and one return per function, but these will only help with debugging.

Now that you know these best practices I hope either have validated your current way of working or your code improves and that you can write bigger better applications with less confusion.

If you have any questions please don’t hesitate to email me at I do answer quickly and love talking about programming.

Asynchronous method queue chaining in JavaScript

Thursday May 6 2010

Chaining. It's an extremely popular pattern these days in JavaScript. It's easily achieved by continually returning a reference to the same object between linked methods. However one technique you don't often see is queueing up a chain of methods, asynchronously, by which functions can be linked together independent of a callback. This discussion, of course, came from a late work night building the @anywhere JavaScript API with two other mad scientists, Russ D'Sa (@dsa) and Dan Webb (@danwrong). Anyway, let's have a look at some historical conventions and compare them to newer ones. Imagine an iterator class that operated on arrays. It could look like this:
// no chaining

var o = new Iter(['a', 'b', 'c', 'd', 'e']);

o.filter(function(letter) {

  if (letter != 'c') { return letter; }


o.each(function(letter) {



// with chaining

new Iter(alphabet).filter(remove_letter_c).each(append);
This is a simple because we're working on a known existing object in memory (the alphabet array). However an easy way to spoil our soup is to make our methods continue to operate without existing objects. Like say, for example, a result set you had to make an async request to the server to get. Thus, imagine making this work:
In the grand scheme of things, the above example isn't too far off from from currying (which it's not). And to make another point, currying can often lead to bad coupling of code... which in its defense, is often the point as well. Some libraries even call this pattern binding... so... yeah. Anyway, to the point, here is a basic Queue implementation that can be used as a tool to build your own asynchronous method chains.
function Queue() {

  // store your callbacks

  this._methods = [];

  // keep a reference to your response

  this._response = null;

  // all queues start off unflushed

  this._flushed = false;


Queue.prototype = {

  // adds callbacks to your queue

  add: function(fn) {

    // if the queue had been flushed, return immediately

    if (this._flushed) {


    // otherwise push it on the queue

    } else {




  flush: function(resp) {

    // note: flush only ever happens once

    if (this._flushed) {



    // store your response for subsequent calls after flush()

    this._response = resp;

    // mark that it's been flushed

    this._flushed = true;

    // shift 'em out and call 'em back

    while (this._methods[0]) {




With this code, you can put it straight to work for something useful, like say, a jQuery plugin that fetches content remotely and then appends the results to your selector input. For you plugin developers our there, it would look like this...

<script src="jquery.js"></script>

<script src="async-queue.js"></script>


(function($) {

  $.fn.fetch = function(url) {

    var queue = new Queue;

    this.each(function() {

      var el = this;

      queue.add(function(resp) {





      url: url,

      dataType: 'html',

      success: function(html) {




    return this;




Then voila! You can make your DOM queries, fetch remote content, and continue your chain, asynchronously.




Here's a brief example of showing off the example above. Point being, one can only imagine the possibilities you could do. Say for example, having multiple items in the queue waiting to operate on a response. Thus imagine this...
Your internals would look like this with the Queue.

function fetchTweet(url) {

  this.queue = new Queue;

  this.tweet = "";

  var self = this;

  ajax(url, function(resp) {

    self.tweet = resp;




fetchTweet.prototype = {

  linkify: function() {

    this.queue.add(function(self) {

      self.tweet = self.tweet.replace(/b@(w{1,20}b/g, '<a href="...">$1</a>');


  return this;


  filterBadWords: function() {

    this.queue.add(function(self) {

      self.tweet = self.tweet.replace(/b(fuck|shit|piss)b/g, "");


  return this;


  appendTo: function(selector) {

    this.queue.add(function(self) {



  return this;



And with that, you can call it a night. Cheers.

Introduction to Microservices | NGINX

This is a guest post by Chris Richardson. Chris is the founder of the original, an early Java PaaS (Platform-as-a-Service) for Amazon EC2. He now consults with organizations to improve how they develop and deploy applications. He also blogs regularly about microservices at


Microservices are currently getting a lot of attention: articles, blogs, discussions on social media, and conference presentations. They are rapidly heading towards the peak of inflated expectations on the Gartner Hype cycle. At the same time, there are skeptics in the software community who dismiss microservices as nothing new. Naysayers claim that the idea is just a rebranding of SOA. However, despite both the hype and the skepticism, the Microservice architecture pattern has significant benefits – especially when it comes to enabling the agile development and delivery of complex enterprise applications.

This blog post is the first in a 7-part series about designing, building, and deploying microservices. You will learn about the approach and how it compares to the more traditional Monolithic architecture pattern. This series will describe the various elements of the Microservice architecture. You will learn about the benefits and drawbacks of the Microservice architecture pattern, whether it makes sense for your project, and how to apply it.

Let’s first look at why you should consider using microservices.

Building Monolithic Applications

Let’s imagine that you were starting to build a brand new taxi-hailing application intended to compete with Uber and Hailo. After some preliminary meetings and requirements gathering, you would create a new project either manually or by using a generator that comes with Rails, Spring Boot, Play, or Maven. This new application would have a modular hexagonal architecture, like in the following diagram:


At the core of the application is the business logic, which is implemented by modules that define services, domain objects, and events. Surrounding the core are adapters that interface with the external world. Examples of adapters include database access components, messaging components that produce and consume messages, and web components that either expose APIs or implement a UI.

Despite having a logically modular architecture, the application is packaged and deployed as a monolith. The actual format depends on the application’s language and framework. For example, many Java applications are packaged as WAR files and deployed on application servers such as Tomcat or Jetty. Other Java applications are packaged as self-contained executable JARs. Similarly, Rails and Node.js applications are packaged as a directory hierarchy.

Applications written in this style are extremely common. They are simple to develop since our IDEs and other tools are focused on building a single application. These kinds of applications are also simple to test. You can implement end-to-end testing by simply launching the application and testing the UI with Selenium. Monolithic applications are also simple to deploy. You just have to copy the packaged application to a server. You can also scale the application by running multiple copies behind a load balancer. In the early stages of the project it works well.

Marching Towards Monolithic Hell

Unfortunately, this simple approach has a huge limitation. Successful applications have a habit of growing over time and eventually becoming huge. During each sprint, your development team implements a few more stories, which, of course, means adding many lines of code. After a few years, your small, simple application will have grown into a monstrous monolith. To give an extreme example, I recently spoke to a developer who was writing a tool to analyze the dependencies between the thousands of JARs in their multi-million line of code (LOC) application. I’m sure it took the concerted effort of a large number of developers over many years to create such a beast.

Once your application has become a large, complex monolith, your development organization is probably in a world of pain. Any attempts at agile development and delivery will flounder. One major problem is that the application is overwhelmingly complex. It’s simply too large for any single developer to fully understand. As a result, fixing bugs and implementing new features correctly becomes difficult and time consuming. What’s more, this tends to be a downwards spiral. If the codebase is difficult to understand, then changes won’t be made correctly. You will end up with a monstrous, incomprehensible big ball of mud.

The sheer size of the application will also slow down development. The larger the application, the longer the start-up time is. For example, in a recent survey some developers reported start-up times as long as 12 minutes. I’ve also heard anecdotes of applications taking as long as 40 minutes to start up. If developers regularly have to restart the application server, then a large part of their day will be spent waiting around and their productivity will suffer.

Another problem with a large, complex monolithic application is that it is an obstacle to continuous deployment. Today, the state of the art for SaaS applications is to push changes into production many times a day. This is extremely difficult to do with a complex monolith since you must redeploy the entire application in order to update any one part of it. The lengthy start-up times that I mentioned earlier won’t help either. Also, since the impact of a change is usually not very well understood, it is likely that you have to do extensive manual testing. Consequently, continuous deployment is next to impossible to do.

Monolithic applications can also be difficult to scale when different modules have conflicting resource requirements. For example, one module might implement CPU-intensive image processing logic and would ideally be deployed in AWS EC2 Compute Optimized instances. Another module might be an in-memory database and best suited for EC2 Memory-optimized instances. However, because these modules are deployed together you have to compromise on the choice of hardware.

Another problem with monolithic applications is reliability. Because all modules are running within the same process, a bug in any module, such as a memory leak, can potentially bring down the entire process. Moreover, since all instances of the application are identical, that bug will impact the availability of the entire application.

Last but not least, monolithic applications make it extremely difficult to adopt new frameworks and languages. For example, let’s imagine that you have 2 million lines of code written using the XYZ framework. It would be extremely expensive (in both time and cost) to rewrite the entire application to use the newer ABC framework, even if that framework was considerably better. As a result, there is a huge barrier to adopting new technologies. You are stuck with whatever technology choices you made at the start of the project.

To summarize: you have a successful business-critical application that has grown into a monstrous monolith that very few, if any, developers understand. It is written using obsolete, unproductive technology that makes hiring talented developers difficult. The application is difficult to scale and is unreliable. As a result, agile development and delivery of applications is impossible.

So what can you do about it?

Microservices – Tackling the Complexity

Many organizations, such as Amazon, eBay, and Netflix, have solved this problem by adopting what is now known as the Microservice architecture pattern. Instead of building a single monstrous, monolithic application, the idea is to split your application into set of smaller, interconnected services.

A service typically implements a set of distinct features or functionality, such as order management, customer management, etc. Each microservice is a mini-application that has its own hexagonal architecture consisting of business logic along with various adapters. Some microservices would expose an API that’s consumed by other microservices or by the application’s clients. Other microservices might implement a web UI. At runtime, each instance is often a cloud VM or a Docker container.

For example, a possible decomposition of the system described earlier is shown in the following diagram:


Each functional area of the application is now implemented by its own microservice. Moreover, the web application is split into a set of simpler web applications (such as one for passengers and one for drivers in our taxi-hailing example). This makes it easier to deploy distinct experiences for specific users, devices, or specialized use cases.

Each back-end service exposes a REST API and most services consume APIs provided by other services. For example, Driver Management uses the Notification server to tell an available driver about a potential trip. The UI services invoke the other services in order to render web pages. Services might also use asynchronous, message-based communication. Inter-service communication will be covered in more detail later in this series.

Some REST APIs are also exposed to the mobile apps used by the drivers and passengers. The apps don’t, however, have direct access to the back-end services. Instead, communication is mediated by an intermediary known as an API Gateway. The API Gateway is responsible for tasks such as load balancing, caching, access control, API metering, and monitoring, and can be implemented effectively using NGINX. Later articles in the series will cover the API Gateway.


The Microservice architecture pattern corresponds to the Y-axis scaling of the Scale Cube, which is a 3D model of scalability from the excellent book The Art of Scalability. The other two scaling axes are X-axis scaling, which consists of running multiple identical copies of the application behind a load balancer, and Z-axis scaling (or data partitioning), where an attribute of the request (for example, the primary key of a row or identity of a customer) is used to route the request to a particular server.

Applications typically use the three types of scaling together. Y-axis scaling decomposes the application into microservices as shown above in the first figure in this section. At runtime, X-axis scaling runs multiple instances of each service behind a load balancer for throughput and availability. Some applications might also use Z-axis scaling to partition the services. The following diagram shows how the Trip Management service might be deployed with Docker running on AWS EC2.


At runtime, the Trip Management service consists of multiple service instances. Each service instance is a Docker container. In order to be highly available, the containers are running on multiple Cloud VMs. In front of the service instances is a load balancer such as NGINX that distributes requests across the instances. The load balancer might also handle other concerns such as caching, access control, API metering, and monitoring.

The Microservice architecture pattern significantly impacts the relationship between the application and the database. Rather than sharing a single database schema with other services, each service has its own database schema. On the one hand, this approach is at odds with the idea of an enterprise-wide data model. Also, it often results in duplication of some data. However, having a database schema per service is essential if you want to benefit from microservices, because it ensures loose coupling. The following diagram shows the database architecture for the example application.


Each of the services has its own database. Moreover, a service can use a type of database that is best suited to its needs, the so-called polyglot persistence architecture. For example, Driver Management, which finds drivers close to a potential passenger, must use a database that supports efficient geo-queries.

On the surface, the Microservice architecture pattern is similar to SOA. With both approaches, the architecture consists of a set of services. However, one way to think about the Microservice architecture pattern is that it’s SOA without the commercialization and perceived baggage of web service specifications (WS-) and an Enterprise Service Bus (ESB). Microservice-based applications favor simpler, lightweight protocols such as REST, rather than WS-. They also very much avoid using ESBs and instead implement ESB-like functionality in the microservices themselves. The Microservice architecture pattern also rejects other parts of SOA, such as the concept of a canonical schema.

The Benefits of Microservices

The Microservice architecture pattern has a number of important benefits. First, it tackles the problem of complexity. It decomposes what would otherwise be a monstrous monolithic application into a set of services. While the total amount of functionality is unchanged, the application has been broken up into manageable chunks or services. Each service has a well-defined boundary in the form of an RPC- or message-driven API. The Microservice architecture pattern enforces a level of modularity that in practice is extremely difficult to achieve with a monolithic code base. Consequently, individual services are much faster to develop, and much easier to understand and maintain.

Second, this architecture enables each service to be developed independently by a team that is focused on that service. The developers are free to choose whatever technologies make sense, provided that the service honors the API contract. Of course, most organizations would want to avoid complete anarchy and limit technology options. However, this freedom means that developers are no longer obligated to use the possibly obsolete technologies that existed at the start of a new project. When writing a new service, they have the option of using current technology. Moreover, since services are relatively small it becomes feasible to rewrite an old service using current technology.

Third, the Microservice architecture pattern enables each microservice to be deployed independently. Developers never need to coordinate the deployment of changes that are local to their service. These kinds of changes can be deployed as soon as they have been tested. The UI team can, for example, perform A|B testing and rapidly iterate on UI changes. The Microservice architecture pattern makes continuous deployment possible.

Finally, the Microservice architecture pattern enables each service to be scaled independently. You can deploy just the number of instances of each service that satisfy its capacity and availability constraints. Moreover, you can use the hardware that best matches a service’s resource requirements. For example, you can deploy a CPU-intensive image processing service on EC2 Compute Optimized instances and deploy an in-memory database service on EC2 Memory-optimized instances.

The Drawbacks of Microservices

As Fred Brooks wrote almost 30 years ago, there are no silver bullets. Like every other technology, the Microservice architecture has drawbacks. One drawback is the name itself. The term microservice places excessive emphasis on service size. In fact, there are some developers who advocate for building extremely fine-grained 10-100 LOC services. While small services are preferable, it’s important to remember that they are a means to an end and not the primary goal. The goal of microservices is to sufficiently decompose the application in order to facilitate agile application development and deployment.

Another major drawback of microservices is the complexity that arises from the fact that a microservices application is a distributed system. Developers need to choose and implement an inter-process communication mechanism based on either messaging or RPC. Moreover, they must also write code to handle partial failure since the destination of a request might be slow or unavailable. While none of this is rocket science, it’s much more complex than in a monolithic application where modules invoke one another via language-level method/procedure calls.

Another challenge with microservices is the partitioned database architecture. Business transactions that update multiple business entities are fairly common. These kinds of transactions are trivial to implement in a monolithic application because there is a single database. In a microservices-based application, however, you need to update multiple databases owned by different services. Using distributed transactions is usually not an option, and not only because of the CAP theorem. They simply are not supported by many of today’s highly scalable NoSQL databases and messaging brokers. You end up having to use an eventual consistency based approach, which is more challenging for developers.

Testing a microservices application is also much more complex. For example, with a modern framework such as Spring Boot it is trivial to write a test class that starts up a monolithic web application and tests its REST API. In contrast, a similar test class for a service would need to launch that service and any services that it depends upon (or at least configure stubs for those services). Once again, this is not rocket science but it’s important to not underestimate the complexity of doing this.

Another major challenge with the Microservice architecture pattern is implementing changes that span multiple services. For example, let’s imagine that you are implementing a story that requires changes to services A, B, and C, where A depends upon B and B depends upon C. In a monolithic application you could simply change the corresponding modules, integrate the changes, and deploy them in one go. In contrast, in a Microservice architecture pattern you need to carefully plan and coordinate the rollout of changes to each of the services. For example, you would need to update service C, followed by service B, and then finally service A. Fortunately, most changes typically impact only one service and multi-service changes that require coordination are relatively rare.

Deploying a microservices-based application is also much more complex. A monolithic application is simply deployed on a set of identical servers behind a traditional load balancer. Each application instance is configured with the locations (host and ports) of infrastructure services such as the database and a message broker. In contrast, a microservice application typically consists of a large number of services. For example, Hailo has 160 different services and Netflix has over 600 according to Adrian Cockcroft. Each service will have multiple runtime instances. That’s many more moving parts that need to be configured, deployed, scaled, and monitored. In addition, you will also need to implement a service discovery mechanism (discussed in a later post) that enables a service to discover the locations (hosts and ports) of any other services it needs to communicate with. Traditional trouble ticket-based and manual approaches to operations cannot scale to this level of complexity. Consequently, successfully deploying a microservices application requires greater control of deployment methods by developers, and a high level of automation.

One approach to automation is to use an off-the-shelf PaaS such as Cloud Foundry. A PaaS provides developers with an easy way to deploy and manage their microservices. It insulates them from concerns such as procuring and configuring IT resources. At the same time, the systems and network professionals who configure the PaaS can ensure compliance with best practices and with company policies. Another way to automate the deployment of microservices is to develop what is essentially your own PaaS. One typical starting point is to use a clustering solution, such as Mesos or Kubernetes in conjunction with a technology such as Docker. Later in this series we will look at how software-based application delivery approaches such as NGINX, which easily handles caching, access control, API metering, and monitoring at the microservice level, can help solve this problem.


Building complex applications is inherently difficult. A Monolithic architecture only makes sense for simple, lightweight applications. You will end up in a world of pain if you use it for complex applications. The Microservice architecture pattern is the better choice for complex, evolving applications despite the drawbacks and implementation challenges.

In later blog posts, I’ll dive into the details of various aspects of the Microservice architecture pattern and discuss topics such as service discovery, service deployment options, and strategies for refactoring a monolithic application into services.

Stay tuned…

Building a New Application?


Break complex applications into independent and highly-reliable components to increase performance and time to market. Learn about microservices in the new ebook by O'Reilly.

Download ebook

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by