Sinfonia is a service that allows hosts to share application data in a fault-tolerant, scalable, and consistent manner using a novel mini-transaction primitive. We read this paper because it provides an interesting alternative to message passing for building distributed systems.

Data Layout

At a high level, Sinfonia provides multiple independent linear address spaces that live on memory nodes. No structure is imposed on these address spaces, and they can contain arbitrary bytes of data. Data in Sinfonia is referenced using a pair: (memory-node-id, offset). Sinfonia does not perform automatic load balancing, and data placement is left to the application.

Mini-transactions

Operations on data take the form of mini-transactions, which are basically distributed compare and swap/read operations. A mini-transaction consists of a triple (compare-set, write-set, read-set). Elements in the compare-set are tuples: (memory-node-id, offset, length, data) where the first three fields describe the location and size of the data to compare, and the last field is the value expected at that location. The write-set is the same, except that the last field is the value to write to the location. Elements in the read-set are similar, but they omit the data field.

The mini-transaction performs the following operation atomically:

if compare-set comparisons match expected data {
Perform each write in write-set
Return each read in read-set
}

For example, an atomic compare and swap operation on the first byte at memory node 0 (where the expected value is “5” and the new value is “6”) would be:

(
{ (0, 0, 1, 5) },
{ (0, 0, 1, 6) },
{}
)

Mini-transactions are committed using a 2-phase commit protocol where application nodes are coordinators and memory nodes are participants, but mini-transactions operating on a single participant use only a single phase. This allows for operations that perform writes on one memory node as a result of a compare on another memory node. The paper gives several examples of powerful operations that can be implemented using mini-transactions, including: atomic reads across multiple memory nodes, compare and swap, acquiring multiple leases atomically, and changing data if a lease is held.

Mini-transactions were designed so that the operation itself can piggyback on the commit protocol for added performance. This does not work for arbitrary transactions, but the restricted operations available for mini-transactions allows this optimization. For example, a participant can vote “no” to commit a transaction if it knows that the coordinator will abort the transaction as a result of a value it is reading (this occurs when a comparison in the compare-set fails). Additionally, the results for reads in mini-transactions are included with the vote for committing in the first phase.

The design of mini-transactions permits this kind of optimization, but it makes certain operations impossible to fit in a single mini-transaction. For example, reading and copying data between memory nodes requires two mini-transactions (one to read the data and another that atomically checks that the data is still valid and writes the data to another memory node).

Design

Overall, Sinfonia provides an alternative to message passing for implementing distributed systems. Rather than explicitly sending messages, applications describe operations on shared data in terms of mini-transactions that are executed atomically using 2-phase commit.

Sinfonia uses logging and replication to provide fault tolerance and reduce downtime. The design of Sinfonia, however, does not support automatic load-balancing or caching. Load-balancing and caching are left to the application developer who is given some load information by the system.

Coordinator crashes are handled using a recovery coordinator, which is triggered when a transaction has not been committed or aborted after some timeout. The recovery coordinator asks each participant how it voted on that transaction, and the participant replies with its original vote if it had already voted or ABORT if it had not (remembering that it must abort this transaction in the future if the original coordinator reappears).

Sinfonia uses a write-ahead redo-log for performance and fault-tolerance for the participants. When a participant votes to commit, the transaction data is added to the redo-log which is replayed if the participant crashes. Participants also keep track of decided transactions so they know if they can commit the changes or not. Other participants must be contacted for their votes if the recovered participant does not know how the transaction was decided.

Comments

Sinfonia provides an interesting alternative to message passing in distributed systems, but requires application developer intervention to provide data locality, caching, and data placement. The paper details several optimizations, which are especially important for mini-transaction performance. The paper also includes interesting applications built on top of Sinfonia: a file system and a group communication service, which show how Sinfonia can be used to implement real-world applications.

Mini-transactions are somewhat limited in what operations they can perform, but the operations they do support can be efficiently executed.