RDF.rb 2.0 was released this spring with support for transaction scopes. To make good on the promise of this interface, 2.0.0 also ships with a fully serializable main memory quadstore. We achieve this through the use of persistent Hash Array Mapped Tries, implemented in pure Ruby by the excellent Hamster gem.

While our new implementation theoretically improves the concurrency story for RDF.rb, it isn’t thread safe. The underlying data representation may be purely functional, but the Repository itself is swimming in shared mutable state. Specifically, we have the potential for a data race during execution of code like @data = data; and, more generally, for race conditions wherever our changes depend on previous reads. Notably, this affects #transaction, as demonstrated in the following snippet:Running this in your environment is may yield different results. You may even see expected results. Nevertheless, trust me, this code is not safe.

require 'rdf'
repo = RDF::Repository.new

threads = []
err_count = 0

# make 10 threads, processing 1000 transactions each
10.times do |n|
  threads << Thread.new do
    1_000.times do |i|
      begin
        repo.transaction(mutable: true) do
          # insert a unique statement for each transaction
          insert RDF::Statement("thread_#{n}".to_sym,
                                RDF::URI('http://example.com/num'),
                                i)
        end
      rescue RDF::Transaction::TransactionError
        # count up the statements that fail in execution
        err_count += 1
      end
    end
  end
end

threads.each(&:join)

# not even close to 10_000!
repo.count + err_count # => 5587

The good news is that races are reasonably isolated. Because we have persistence at the data structure level, the only shared object we mutate is the @data variable itself, changing which object it points to. Any dreams of perfectly asynchronous concurrency are dashed by these reassignments, but to get thread safety we only need to synchronize the assignments themselves. For transactions, this means #execute; in place of the repo.transaction block above, we have:

# ...
begin
  tx = repo.transaction(mutable: true)
  tx.insert RDF::Statement("thread_#{n}".to_sym,
                           RDF::URI('http://example.com/num'),
                           i)
  mutex.synchronize { tx.execute }
rescue RDF::Transaction::TransactionError
# ...

As an implementation-specific solution, this leaves something to be desired. A better approach is to build locks directly into the transaction (a proof of concept branch for this already exists). However, it’s not immediately clear that the trade-offs here are good. The synchronization overhead is a factor, even for single-threaded transactions. For multi-threaded cases, we’ll deliver on our promise of serializable transactions, but it’s not obvious that the concurrency will manifest any performance boost. It’s equally likely that we’ll just have more failed transactions. More work—and especially rigorous benchmarking—is needed.

In the end, I suspect that giving more thought to thread safety needs in general will uncover better options.