Database Backend Redesign - Phase 4


Status

Date: 2021 Mar, 4th

Staus: Draft

Tickets: https://issues.redhat.com/browse/IDMDS-302

Motivation

See backend redesign (initial phases)

High level summary

This document is the continuation of ludwig’s work about Berkeley database removal. backend redesign (initial phases)

It focus on:

Current state

Working on Design (in other words very unstable draft)

Naming

plugin directory: .../back-ldbm/db-mdb
sources files are prefixed with mdb_
function names are prefixed with dbmdb_  (Could not use mdb_ that conflicts with ldbm libraries)
lmdb lib: are got from packages: lmdb-devel lmdb-doc lmdb-libs 

Note: lmdb documentation is in file:///usr/share/doc/lmdb-doc/html/index.html once lmdb-doc is installed.

Design

The plan with this phase is to minimize as much as possible impacts outside of the mdb plugin so we do not provide pointer on db mmap outside of the plugin The global plugin design is similar to bdb plugin

Architecteure choices

creating a single MDB_env versus one MDB_env per backend

- Max Dbs question: (txn cost depends linearly of the max number of dbs )
  if we split per suffix we can keep it smaller 
- Consistency of txn across suffixes
  The question is important for write txn as there is no way to commit them
  in a single step  (Is there write txn across different suffixes ?)
- Consistency of changelog (Not an issue as the changelog is already per suffix)
- Consistency with existing bdb model (today bdb_make_env  is called once:
  (with the <INSTALLDIR>/var/lib/dirsrv/slapd-<INSTANCE>/db path )

==> I suspect that we will have to go for a single MDB_env in first phase

db filename list

The whole db is a single mmap file and lmdb does not provide any interface to list the db names two solutions

==> This will impact dbstat tool too as we cannot looks for file anymore and we needs a way to list existings suffix and existsing files in suffixes

mdb specific config parameters

- MAXDBS  (cf mdb_env_set_maxdb)
- DBMAXSIZE (cf mdb_env_set_maxdbs)
- mdb_env_set_maxreaders: should be around 1 per working threads +
   1 per agmt 
      so we could have an auto tuning value that will use the number ofr
          working threads + 30 

Note: changing these parameters requires db env closure (i.e: restart the instance in first implementation)

mdb limitations

Here are the limits That I measured in my test.

Database type Key max Data max
No dup Support 511 > 6 GB
Dup Support 511 511

511 is the mdb_env_get_maxkeysize(env) hardcoded limit Got a bit more than 6 GB in a db with size = 10GB

** Note from Thierry : ** pierre: regarding LMDB keylen, IPA is using ‘eq’ index on attributes (usercertificate, publickey,..) with keys that >500bytes

Other mbd limitations

mdb-env-open flags

- db2ldif/db2bak MDB_RDONLY
- ns-slapd 0
- offline bak2db/ldif2db MDB_NOSYNC  use mdb_env_sync() and fflush(mdb_env_get_fd()) before closing the env (depending of the ldif file or the mmap size we may also use MDB_WRITEMAP flag)
- online bak2db (and maybe online db2ldif) duplicate the environment
- reindex (should probably be 0 to avoid breaking the whole db in case of error)    Note we may be in trouble if ldif2db fails and there is multiple bakends 
and the single env strategy id used ...

db format

db open flags key value
id2entry MDB_INTEGERKEY + MDB_CREATE entryId entry or ‘HUGE entryId nbParts’
entrydn MDB_DUPSORT normalized dn Flag.entryId
index MDB_DUPSORT + MDB_DUPFIXED + MDB_CREATE PrefixIndex Flag.entryId
vlvindex MDB_DUPSORT + MDB_DUPFIXED + MDB_CREATE PrefixIndex Flag.entryId
changelog/retrochangelog MDB_CREATE csn change
#dbname MDB_CREATE bename/dbfilename openFlags
#huge MDB_CREATE bename’/’dbname’:’ContKeyId’.’n EntryId.complete Key value
    #maxKeyId max ContKeyId value

PrefixIndex is the usual index type prefix (ie: ‘=’ or ‘*’ or ‘?’ or ‘:MatchingRuleOID:’) concat with the key value Flag.entryId is:

Type Mapping

TXN and value Handling

is: if a txn is provided by dblayer: it is a write txn and should be used for the operation in the other case:

in all cases, data that are read from the db are strdup/copied in dbi_val_t
(in phase 4a no pointers towards memory mmap are kept outside db-mdb plugin)

VLV and RECNO

BDB implement vlv index by using an index with key based on entry value and the vlv search sort order i (So that the vlv index records are directly sorted according the vlv sort order) The data is the entry ID. note: the key is usually acceeded by its position in the tree (i.e RECNO) rather than by its value.

unlike bdb, ldbm does not implement a recno lookup. So we cannot use that. We use a VLV index as in bdb and for each VLV index have a second database used a cache. The cache will contains:

For each VLV index have a second database “vlv cache” database that contains the following records: Key: D{key}{data}{key-size} Data: {key-size}{data-size}{recno}{key}{data} Key: R{recno} Data: {key-size}{data-size}{recno}{key}{data} Key: OK Data: OK The data and recno records exists for all vlv records whose recno modulo RECNO_CACHE_INTERVAL(1000) is 1

When looking up in the cache for recno, we perform a lookup for nearest key smaller or equal to R{recno} Then we lookup for key/data in vlv index and move the vlv index cursor by one until we got the wanted recno.

When removing/adding a key from vlv index the cache is cleared and rebuilt at next query.

bulk operation

BDB is supporting to kind of bulk read operations:

In MDB there is not much support for bulk read operations (not very surprising because read operations are pretty fast anyway) The interresting point is that we could avoid copy overhead for bulk operation (because in bdb the returned data are stored in a local buffer and no more used once the cursor is released so:

** IMPORTANT NOTE ** The above descrption is not doable for bulk record read: The issue is that in changelog case the cursor get closed between the value retrival phase and the result iteration phase ==> nether keeping cursor reference of pointer towards the mmap is safe. ==> Will have to copy the keys and data into the buffer until reaching one of the following: * End of cursor * Buffer size limit * Number of record in bulk operation limit

monitoring

Here are available values

Here are what openldap monitors:

Attribute Description IMHO Notes
olmDbDirectory Path name of the directory where the database environment resides should not be a monitored value but a config one
olmMDBPagesMax Maximum number of pages  
olmMDBPagesUsed Number of pages in use  
olmMDBPagesFree Number of free pages  
olmMDBReadersMax Maximum number of readers Is also a config attribute
olmMDBReadersUsed Number of readers in use  
olmMDBEntries Number of entries in DB  

Handling oversized index keys

mdb hardcode some limits about the record size:

** Note from Thierry : ** pierre: regarding LMDB keylen, IPA is using ‘eq’ index on attributes (usercertificate, publickey,..) with keys that >500bytes. So we have to somehow support search on long keys.

implemented solution

A solution is to replace the long key value by a smaller one (typically an hash) just before encrypting the keys (i.e in attrcrypt_encrypt_index_key) if needed:

if the key is too long: Before checking if key value must be encrypted , check the key lenght is smaller than max_key_size if it is the case: replace the value by: HASH_PREFIX ORIGINAL_KEY_PREFIX bin2hexa(hash(original_key_value)) disable filterbypass

At implementation level it means:

rejected alternatives

- Ignore it (return DBI_NOT_FOUND when looking for it) 
   and reject it when trying to set the key
- Split the key and have a continuation mechansim. For example:
    {<Prefix>KeyPart0   ---> idPart1 (where <Prefix> is = ~ :id: as usual)
    .<idPart1>.Keypart1 ---> idPart2
    .<idPartN>}KeypartN ---> idEntry
    #MaxPartID --> idPartN (or something else anyway that is the max id )
- Have a specific table for oversized keys in which we store:
	key hash -> id + key 
- use a specific index type bases on hash  ## ldif import/export ##
mdb architecture impacted both the ldif export and ldif import:
- on export size the change is minor and due to the fact that the entries read from id2entry 
  are direct pointer towards the read-only memory map (but ldif_getline libldap function that 
   reads entry from ldif was directly modifying the memory to remove \\r) 
  The ldif_getline_ro function was rewritten without tampering the memory
  and a dup_ldif_line also add to copy the line in a berval
- several issues impacted the ldif import (mostly related that we can only have 1 open write txn at a time
- and some write operation should be done in a synchronous way while most the others may be
  asynchronous. (typically the operation needed to determine the children/parent relationship)
- to limit the code impact at backend level the back_txn struct was modify to add a callback
   (back_special_handling_fn)
- The backend code:
    - BTXNACT_INDEX_ADD,            /* data is a index_update_t */
    - BTXNACT_INDEX_DEL,            /* data is a index_update_t */
    - BTXNACT_VLV_ADD,              /* data is an entry ID */
    - BTXNACT_VLV_DEL,              /* data is an entry ID */
    - BTXNACT_ID2ENTRY_ADD,         /* data is the entry */
    - BTXNACT_ENTRYRDN_ADD,         /* key is a srdn, data is an id */
    - BTXNACT_ENTRYRDN_DEL          /* key is a srdn, data is an id */
  is modified to call the callback (if set) with the caller location (as listed above) 
  instead writing in the database.
- at db-mdb level:
    - A new "writing thread is created among the threads pool that handles two operations queues"
    - The usual foreman and worthreads now use a pseudo_back_txn_t (a back_txn with the callback set followed by some context to identify to worker info and the target file) when calling back the dblayer functions and they queues directly the open or write action in the right writing thread queues  (and wait for success/failure if it is the synchronous queue (i.e entryid, parentid and entryrdn ) 
   - The pseudo_back_txn_t callback also queue the operation in the right queue as above
   - and the writing threads loops waiting on available operations
       for each opeartion on synchronous queue: perform the operation and send the result to the calling worker thread else perform asynchronous operation.
        
 BTW There is a third queue needed to handle the very special case of upgrade (when entries needs to be rewritten - in this case a temporary file is used to queue the operation  (but thinking more about it, if will probably be safer to duplicate the id2entry toward a special db then clear id2entry and wire the provider on that special db (rather than id2entry) and perform import as usual - anyway there is nothing to upgrade right now so we have time to change that)

Config entry

Entry: cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config

Parameter similar to bdb one:

Name Default Value Comment
nsslapd-db-home-directory /var/lib/dirsrv/slapd-/dbDIR>  
nsslapd-search-bypass-filter-test on More a backend parameter than a bdb one
nsslapd-serial-lock on More a backend parameter than a bdb one

mdb specific parameters:

Name Default Value Comment
nsslapd-mdb-max-size 0 0 means disk remaining size (when creating the db)
    supponted value: a number followed by a suffix (typically M/G/T)
    note: value is rounded down to a multiple of 10Mb
nsslapd-mdb-max-readers 0 0 means number of working threads + 30
nsslapd-mdb-max-dbs 128  

debuging

Added a few debugging modules to trace:

The debug module sources are in mdb_debug.* files and trigered by:

Phases

Ideas about future improvements

These are raw ideas (that would needs some benefit/cost evaluation)

Last modified on 23 August 2021