This design really aims to target a number of issues that have arisen in the server over time.
The current design of the server has operations proceed in a semi-parallel and semi-serialised fashion. Lets consider an “add” operation:
Operation Begins
|
V
Run PRE_ADD plugins
|
V
Begin backend transaction
|
V
Check Access Controls
|
V
Run BETXN_PRE_ADD plugins
|
V
Tentative add to DB
|
V
Run BETXN_POST_ADD plugins
|
V
Commit TXN to DB
|
V
Run POST_ADD plugins
|
V
Result returned to client
Now, we have to consider that these can be running in parallel, and now we have to add many interactions into this process.
So below is the same chart with only some of the locks added in:
(locks denoted by x)
Operation Begins
|
V
Run PRE_ADD plugins X
|
V
Begin backend transaction X
|
V
Check Access Controls X
|
V
Run BETXN_PRE_ADD plugins X
|
V
Tentative add to DB
|
V
Run BETXN_POST_ADD plugins X
|
V
Commit TXN to DB
|
V
Run POST_ADD plugins X
|
V
Result returned to client
Now imagine we are runnig 30 threads or more. Every single log message, plugin call, acl check, backend operation, all need locks. We are serialised for both reads and writes effectively at many points of our operation!
As a result, we have the following server environment:
We need a way for many of these subsystems to be able to operate either in a lockless, or protected manner, without impacting other threads. The final goal is:
I propose that we use copy on write datastructures to achieve this design. A copy on write datastructure is one that allows a stable version of the tree to exist, while modifications are made on the copy. When the modifications are complete, the root node pointer of the tree is pivoted. Once the previously versions reference count drop to 0, elements for the previous tree are freed.
In the first image we see the original state of the tree. We have can take a read only transaction to txn_id 1 here.
In this image, we can see that we have a read only transaction to txn_id 1, and a write occuring in txn_id 2. Note the way that the tree branch is copied during the write, so the content of txn_id 1 is consistent through out it’s operation. After this is commited, all new read transactions would begin on txn_id 2.
Finally, we close the read transaction on txn_id 1. Since there are no more users of the txn, we can trim the unreferenced nodes, and continue with the tree state as txn_2
Multiple read transactions states can be present. In other words if you do:
Begin Read 1
Write 2
Begin Read 3
Write 4
Begin Read 4
Write 5
Close 1
Close 3
Close 4
Transaction 1, 3 and 4 are guaranteed to have their contents correct at the time the read transaction, despite there being multiple writes having occured even between read transactions.
Another property of this design is that memory placed into the tree, their lifetime is bound to the tree and it’s transactions. The tree is provided with free functions, which are correctly able to dispose of the pointers given to the tree. Provided your free function is correct, you can trust the tree to properly manage the lifetime of our references.
This means we can have the following situation. When an operation begins we can take a transaction of the active list of plugins in the tree right now. These are sorted by order of operation, so we can apply them by walking the tree. During this operation, another operation rapidly disables some plugin X from the server. The next operation would not use plugin X. The first operation would still have the reference to the node containing plugin X, so the plugin would still work. Once the use of the transaction drops to 0, the plugin would now be closed and freed as we can guarantee no future transactions are within the plugin code. If the plugin were re-opened at the same time, this would be a new plugin struct, so would not be affected by this close.
At the begining of a new connection, we take a write transaction on:
At the begining of a read operation by the client (not connection, operation) a set of transactions would be opened. This would be in:
At the begining of a write operation (add, mod, del) a write transaction would open in:
Due to the change related to transactions, plugins will need to operate differently.
Some plugins may require transactions (IE IPA’s kerberos plugin.). They will be offered a plugin hook for read begin, read close, write begin, write abort and write commit. Plugins must not block read and writes, the same property we expect. This is due to the fact we may internall need a write from within a read transaction, which would cause a deadlock with pthread_rwlock. We may change this design decision later, but it is unlikely.
Plugins would be able to drop their internal locking of config etc, because they can guarantee they are they only instance in the critical segment during a write.
Plugins would be able to drop their internal locking in reads, because if they read their config from cn=config, this is guaranteed to be protected during the critical segment of a read (IE after the operation the new config would take effect).
To summarise, there are the hard requirements of a plugin for this API:
As the plugin framework we guarantee:
We request this, because we will use those properties stated above, and there is a risk that plugins who do not follow these explicit rules may cause corruption.
NOTES: We may add a difference between a sync abort and async abort for structures that do not support async operation. We prefer async due to performance benefits.
An example of this working is that you have a plugin that registers the hooks on the transactions. In the real final version, these transactions will be tied to the global transaction manager, to help prevent double free/abort/commit etc.
The following is psuedo code for a working design:
struct foo {...};
struct plugin_txn {
ref_count ...;
struct *foo fref;
}
static struct *plugin_txn active = NULL;
fn read_operation(plugin_txn) {
return plugin_txn->fref;
}
// Reads are happening in parallel to this, but only one write!
fn write_operation(plugin_txn) {
if plugin_txn->fref is NULL {
plugin_txn->fref = new_foo();
}
plugin_txn->fref->member = ...;
}
/*
* Ref counts go in phases. First, they'll only increase when you are active and reads are taken. They will decrease
* only to 1 (never lower).
* as soon as a new write is commited, they begin to decrease *only* eventually to 0.
*/
fn dec_ref_count(plugin_txn) {
result = atomic_decrement(plugin_txn->ref_count);
if result == 0 {
free(plugin_txn->fref);
free(plugin_txn);
}
}
fn inc_ref_count(plugin_txn) {
atomic_increment(plugin_txn->ref_count);
}
fn read_begin() {
inc_ref_count(active);
return active;
}
fn read_close(plugin_txn) {
dec_ref_count(plugin_txn);
}
// Is done async, no one else saw plugin_txn that we wrote to!
fn abort_operation(plugin_txn) {
free(plugin_txn);
}
// This is done in a way that blocks new reads, so we can update active.
fn commit_operation(plugin_txn) {
dec_ref_count(active);
active = plugin_txn;
inc_ref_count(active);
}
Binds require to write post bind. There are three solutions.
This is a very large code change, with no guarantee of effectiveness.
Instead of doing this in one large change, we will do smaller changes that build towards this goal. Each change in isolation will help that component, but are working toward a final goal where we will be able to realise the full benefits of the changes.
For example, the change to a connection table will benefit us without the full set of changes. The change to having ACL in a tree will help us process operations with “less locking” contention.
It is key that as a team we distribute this work so we each have an understanding of the safety model of these structures (even if you can’t implement it, you need to know how it works). This way, we are not reliant on a single person to understand the entire system.
Not such much a risk, but something we should do. Each of these features should be extensively tested before and after. In some cases, we may reveal the before is broken!
This is not just about stability, but about performance (no regressions), security (no leaks of info, etc), and of stability (no crashes, leaks). This should involve a combination of the cmocka tests, and the lib389 tests. Additionally, we should not just test these features, but commit to broad tests like replication and load tests with these changes, to check for changes in behaviour. An example is a change to the cn=config partition may cause a change to replication that consumes the configuration.
Each feature should documents it’s risks and associated required tests. Pblock Breakup already documents these for the v3 compatability guarantee.
A concern is that we are serialising operations. This is already the case in directory server! The idea of this change is to serialise them in a way, where they are unblocked from reads, making each write individually faster, so we should see an improvement in write throughput.
Read throughput would be unblocked from writes, which should allow us to see an improvement in parallism.
Goals:
1.4.0
cleanup source tree G1:
└── ds
├── configure.ac
├── docs
│ └── ds.doxy
├── m4
├── Makefile.am
├── rfcs
├── rpm
└── src
├── libsds
├── libslapd
├── libspal
├── ns-slapd
│ └── plugins
├── ns-slapd-tools
│ ├── dbscan
│ ├── dsktune
│ ├── infadd
│ ├── ldclt
│ ├── ldif
│ ├── migratecred
│ ├── mmldif
│ ├── pwdhash
│ └── rsearch
└── nunc-stans
COW B+TRee for configuration variable storage G4
Yes, but in itself we are proposing a pretty large lock in a way. The difference is the transaction we propose is this document, while it covers the full server, does not block other threads simultaneously, so it will not have the same costs associated with a fat lock.
William Brown – wibrown at redhat.com