Date: 2021 Mar, 4th
Staus: Draft
Tickets: https://issues.redhat.com/browse/IDMDS-302
See backend redesign (initial phases)
This document is the continuation of ludwig’s work about Berkeley database removal. backend redesign (initial phases)
It focus on:
Working on Design (in other words very unstable draft)
plugin directory: .../back-ldbm/db-mdb
sources files are prefixed with mdb_
function names are prefixed with dbmdb_ (Could not use mdb_ that conflicts with ldbm libraries)
lmdb lib: are got from packages: lmdb-devel lmdb-doc lmdb-libs
Note: lmdb documentation is in file:///usr/share/doc/lmdb-doc/html/index.html once lmdb-doc is installed.
The plan with this phase is to minimize as much as possible impacts outside of the mdb plugin so we do not provide pointer on db mmap outside of the plugin The global plugin design is similar to bdb plugin
- Max Dbs question: (txn cost depends linearly of the max number of dbs )
if we split per suffix we can keep it smaller
- Consistency of txn across suffixes
The question is important for write txn as there is no way to commit them
in a single step (Is there write txn across different suffixes ?)
- Consistency of changelog (Not an issue as the changelog is already per suffix)
- Consistency with existing bdb model (today bdb_make_env is called once:
(with the <INSTALLDIR>/var/lib/dirsrv/slapd-<INSTANCE>/db path )
==> I suspect that we will have to go for a single MDB_env in first phase
The whole db is a single mmap file and lmdb does not provide any interface to list the db names two solutions
==> This will impact dbstat tool too as we cannot looks for file anymore and we needs a way to list existings suffix and existsing files in suffixes
- MAXDBS (cf mdb_env_set_maxdb)
- DBMAXSIZE (cf mdb_env_set_maxdbs)
- mdb_env_set_maxreaders: should be around 1 per working threads +
1 per agmt
so we could have an auto tuning value that will use the number ofr
working threads + 30
Note: changing these parameters requires db env closure (i.e: restart the instance in first implementation)
Here are the limits That I measured in my test.
Database type | Key max | Data max |
---|---|---|
No dup Support | 511 | > 6 GB |
Dup Support | 511 | 511 |
511 is the mdb_env_get_maxkeysize(env) hardcoded limit Got a bit more than 6 GB in a db with size = 10GB
** Note from Thierry : ** pierre: regarding LMDB keylen, IPA is using ‘eq’ index on attributes (usercertificate, publickey,..) with keys that >500bytes
- db2ldif/db2bak MDB_RDONLY
- ns-slapd 0
- offline bak2db/ldif2db MDB_NOSYNC use mdb_env_sync() and fflush(mdb_env_get_fd()) before closing the env (depending of the ldif file or the mmap size we may also use MDB_WRITEMAP flag)
- online bak2db (and maybe online db2ldif) duplicate the environment
- reindex (should probably be 0 to avoid breaking the whole db in case of error) Note we may be in trouble if ldif2db fails and there is multiple bakends
and the single env strategy id used ...
db | open flags | key | value |
---|---|---|---|
id2entry | MDB_INTEGERKEY + MDB_CREATE | entryId | entry or ‘HUGE entryId nbParts’ |
entrydn | MDB_DUPSORT | normalized dn | Flag.entryId |
index | MDB_DUPSORT + MDB_DUPFIXED + MDB_CREATE | PrefixIndex | Flag.entryId |
vlvindex | MDB_DUPSORT + MDB_DUPFIXED + MDB_CREATE | PrefixIndex | Flag.entryId |
changelog/retrochangelog | MDB_CREATE | csn | change |
#dbname | MDB_CREATE | bename/dbfilename | openFlags |
#huge | MDB_CREATE | bename’/’dbname’:’ContKeyId’.’n | EntryId.complete Key value |
#maxKeyId | max ContKeyId value |
PrefixIndex is the usual index type prefix (ie: ‘=’ or ‘*’ or ‘?’ or ‘:MatchingRuleOID:’) concat with the key value Flag.entryId is:
is: if a txn is provided by dblayer: it is a write txn and should be used for the operation in the other case:
in all cases, data that are read from the db are strdup/copied in dbi_val_t
(in phase 4a no pointers towards memory mmap are kept outside db-mdb plugin)
BDB implement vlv index by using an index with key based on entry value and the vlv search sort order i (So that the vlv index records are directly sorted according the vlv sort order) The data is the entry ID. note: the key is usually acceeded by its position in the tree (i.e RECNO) rather than by its value.
unlike bdb, ldbm does not implement a recno lookup. So we cannot use that. We use a VLV index as in bdb and for each VLV index have a second database used a cache. The cache will contains:
For each VLV index have a second database “vlv cache” database that contains the following records: Key: D{key}{data}{key-size} Data: {key-size}{data-size}{recno}{key}{data} Key: R{recno} Data: {key-size}{data-size}{recno}{key}{data} Key: OK Data: OK The data and recno records exists for all vlv records whose recno modulo RECNO_CACHE_INTERVAL(1000) is 1
When looking up in the cache for recno, we perform a lookup for nearest key smaller or equal to R{recno} Then we lookup for key/data in vlv index and move the vlv index cursor by one until we got the wanted recno.
When removing/adding a key from vlv index the cache is cleared and rebuilt at next query.
BDB is supporting to kind of bulk read operations:
In MDB there is not much support for bulk read operations (not very surprising because read operations are pretty fast anyway) The interresting point is that we could avoid copy overhead for bulk operation (because in bdb the returned data are stored in a local buffer and no more used once the cursor is released so:
** IMPORTANT NOTE ** The above descrption is not doable for bulk record read: The issue is that in changelog case the cursor get closed between the value retrival phase and the result iteration phase ==> nether keeping cursor reference of pointer towards the mmap is safe. ==> Will have to copy the keys and data into the buffer until reaching one of the following: * End of cursor * Buffer size limit * Number of record in bulk operation limit
Here are available values
size_t ms_entries
Here are what openldap monitors:
Attribute | Description | IMHO Notes |
---|---|---|
olmDbDirectory | Path name of the directory where the database environment resides | should not be a monitored value but a config one |
olmMDBPagesMax | Maximum number of pages | |
olmMDBPagesUsed | Number of pages in use | |
olmMDBPagesFree | Number of free pages | |
olmMDBReadersMax | Maximum number of readers | Is also a config attribute |
olmMDBReadersUsed | Number of readers in use | |
olmMDBEntries | Number of entries in DB |
mdb hardcode some limits about the record size:
** Note from Thierry : ** pierre: regarding LMDB keylen, IPA is using ‘eq’ index on attributes (usercertificate, publickey,..) with keys that >500bytes. So we have to somehow support search on long keys.
A solution is to replace the long key value by a smaller one (typically an hash) just before encrypting the keys (i.e in attrcrypt_encrypt_index_key) if needed:
if the key is too long: Before checking if key value must be encrypted , check the key lenght is smaller than max_key_size if it is the case: replace the value by: HASH_PREFIX ORIGINAL_KEY_PREFIX bin2hexa(hash(original_key_value)) disable filterbypass
At implementation level it means:
- Ignore it (return DBI_NOT_FOUND when looking for it)
and reject it when trying to set the key
- Split the key and have a continuation mechansim. For example:
{<Prefix>KeyPart0 ---> idPart1 (where <Prefix> is = ~ :id: as usual)
.<idPart1>.Keypart1 ---> idPart2
.<idPartN>}KeypartN ---> idEntry
#MaxPartID --> idPartN (or something else anyway that is the max id )
- Have a specific table for oversized keys in which we store:
key hash -> id + key
- use a specific index type bases on hash ## ldif import/export ##
mdb architecture impacted both the ldif export and ldif import:
- on export size the change is minor and due to the fact that the entries read from id2entry
are direct pointer towards the read-only memory map (but ldif_getline libldap function that
reads entry from ldif was directly modifying the memory to remove \\r)
The ldif_getline_ro function was rewritten without tampering the memory
and a dup_ldif_line also add to copy the line in a berval
- several issues impacted the ldif import (mostly related that we can only have 1 open write txn at a time
- and some write operation should be done in a synchronous way while most the others may be
asynchronous. (typically the operation needed to determine the children/parent relationship)
- to limit the code impact at backend level the back_txn struct was modify to add a callback
(back_special_handling_fn)
- The backend code:
- BTXNACT_INDEX_ADD, /* data is a index_update_t */
- BTXNACT_INDEX_DEL, /* data is a index_update_t */
- BTXNACT_VLV_ADD, /* data is an entry ID */
- BTXNACT_VLV_DEL, /* data is an entry ID */
- BTXNACT_ID2ENTRY_ADD, /* data is the entry */
- BTXNACT_ENTRYRDN_ADD, /* key is a srdn, data is an id */
- BTXNACT_ENTRYRDN_DEL /* key is a srdn, data is an id */
is modified to call the callback (if set) with the caller location (as listed above)
instead writing in the database.
- at db-mdb level:
- A new "writing thread is created among the threads pool that handles two operations queues"
- The usual foreman and worthreads now use a pseudo_back_txn_t (a back_txn with the callback set followed by some context to identify to worker info and the target file) when calling back the dblayer functions and they queues directly the open or write action in the right writing thread queues (and wait for success/failure if it is the synchronous queue (i.e entryid, parentid and entryrdn )
- The pseudo_back_txn_t callback also queue the operation in the right queue as above
- and the writing threads loops waiting on available operations
for each opeartion on synchronous queue: perform the operation and send the result to the calling worker thread else perform asynchronous operation.
BTW There is a third queue needed to handle the very special case of upgrade (when entries needs to be rewritten - in this case a temporary file is used to queue the operation (but thinking more about it, if will probably be safer to duplicate the id2entry toward a special db then clear id2entry and wire the provider on that special db (rather than id2entry) and perform import as usual - anyway there is nothing to upgrade right now so we have time to change that)
Entry: cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config
Parameter similar to bdb one:
Name | Default Value | Comment |
---|---|---|
nsslapd-db-home-directory | ||
nsslapd-search-bypass-filter-test | on | More a backend parameter than a bdb one |
nsslapd-serial-lock | on | More a backend parameter than a bdb one |
mdb specific parameters:
Name | Default Value | Comment |
---|---|---|
nsslapd-mdb-max-size | 0 | 0 means disk remaining size (when creating the db) |
supponted value: a number followed by a suffix (typically M/G/T) | ||
note: value is rounded down to a multiple of 10Mb | ||
nsslapd-mdb-max-readers | 0 | 0 means number of working threads + 30 |
nsslapd-mdb-max-dbs | 128 |
Added a few debugging modules to trace:
The debug module sources are in mdb_debug.* files and trigered by:
These are raw ideas (that would needs some benefit/cost evaluation)
Keeping pointer to the db mmap in dbi_val_t (avoid to duplicate each values) Could have a single txn operation wide for all operations except the search - for search operations we could use the read txn and cursor reopen feature after each entries (or group of entries)