Are you memoizing for a multithreaded environment?

If you’re using the idiomatic Ruby approach to memoization, like this:

def data
  @memo ||= expensive_action
end

you might not get the behavior you expect in multi-threaded environments. If more than one thread calls #data above at the same time (or before the first expensive_action is completed, to be more specific), #expensive_action will run more than once. This might cause problems for you in the following scenarios:

  1. expensive_action is very expensive (in terms of time, API limits, usage fees, locking requirements, etc.).
  2. expensive_action has side effects.
  3. expensive_action may return a different value each time, and it’s important that all threads see the same value.

This is important to understand: only scenario #1 is actually memoization. Memoization is a time optimization applied to an idempotent task. If you need a task to be run only once, you need to combine synchronization with an indicator that the task has been completed. In single-threaded Ruby, ||= is a convenient tool to achieve both goals. In multi-threaded Ruby, you need to provide the synchronization yourself.

In all common Ruby environments, a ‘sleepy’ task (one which is IO-bound, usually) must be synchronized to be memoized efficiently. However, in GIL environments (like YARV), you can memoize a CPU-bound task without using synchronization.

Keep in mind that this is only a problem for the run-time of the expensive task. If you have a method you expect to be called intermittently by many threads, or if the cost of running it several times is not much worse than the cost of running it once, you don’t need to worry about this.

A convenient way to synchronize and memoize at the same time (see below for caveats) is the following:

class SyncedAndMemoized
  include MonitorMixin
  
  def memoized
    synchronize { @memo ||= expensive_task }
  end
  # ...
end

The test

I’ve put a demonstration script on GitHub. There are three classes demonstrating different tasks and synchronization approaches. For each one, I spin up 10 threads, each of which call a method which attempts to memoize the results of an expensive task. That method also prints a message, which allows you to see how many times the expensive task was executed.

SleepMemo attempts to memoize a ‘sleepy’ task without synchronization. The task is sleeping 1 second. In all Ruby environments, the memoization fails, because even Ruby 1.8 yields to other threads during a sleep (or during IO).

BusyMemo attempts to memoize a ‘busy’ task without synchronization. The task is computing the factorial of 10000. In GIL environments (REE 1.8, YARV 1.9) the task only runs once because the first thread to start the task blocks all other threads until it’s done. In non-GIL environments (JRuby, Rubinius 2) the task runs more than once.

SyncMemo uses the standard Ruby library ‘monitor’ to provide synchronization to the memoization task. If you only have one method to memoize, or if it’s acceptable to share the lock between all methods in the class, this is a good approach. Otherwise, you’ll have to create your own Mutex or Monitor objects to do the job. The task runs only once in all Ruby environments, regardless of the threading model, existence of a GIL, length of the task, busyness/sleepiness of the task, etc. This approach incurs a synchronization cost, but it is probably cheaper than running your expensive task more than once.

Addendum

This is not necessarily simply a problem of threading. If you’re using an evented concurrency system that might yield during the execution of the expensive task (em-synchrony, for instance), you’re still at risk. If you’re interested in learning more about Ruby concurrency, or have something you’d like to contribute, please check out ruby.io.