Database Backend Redesign - Phase 4


Status

Date: 2021 Mar, 4th

Staus: Draft

Tickets: https://issues.redhat.com/browse/IDMDS-302

Motivation

See backend redesign (initial phases)

High level summary

This document is the continuation of ludwig’s work about Berkeley database removal. backend redesign (initial phases)

It focus on:

Current state

Working on Design (in other words very unstable draft)

Naming

plugin directory: .../back-ldbm/db-mdb
sources files are prefixed with mdb_
function names are prefixed with dbmdb_  (Could not use mdb_ that conflicts with ldbm libraries)
lmdb lib: are got from packages: lmdb-devel lmdb-doc lmdb-libs 

Note: lmdb documentation is in file:///usr/share/doc/lmdb-doc/html/index.html once lmdb-doc is installed.

Design

The plan with this phase is to minimize as much as possible impacts outside of the mdb plugin so we do not provide pointer on db mmap outside of the plugin The global plugin design is similar to bdb plugin

Architecteure choices

creating a single MDB_env versus one MDB_env per backend

- Max Dbs question: (txn cost depends linearly of the max number of dbs )
  if we split per suffix we can keep it smaller 
- Consistency of txn across suffixes
  The question is important for write txn as there is no way to commit them
  in a single step  (Is there write txn across different suffixes ?)
- Consistency of changelog (Not an issue as the changelog is already per suffix)
- Consistency with existing bdb model (today bdb_make_env  is called once:
  (with the <INSTALLDIR>/var/lib/dirsrv/slapd-<INSTANCE>/db path )

==> I suspect that we will have to go for a single MDB_env in first phase

db filename list

The whole db is a single mmap file and lmdb does not provide any interface to list the db names two solutions

==> This will impact dbstat tool too as we cannot looks for file anymore and we needs a way to list existings suffix and existsing files in suffixes

mdb specific config parameters

- MAXDBS  (cf mdb_env_set_maxdb)
- DBMAXSIZE (cf mdb_env_set_maxdbs)
- mdb_env_set_maxreaders: should be around 1 per working threads +
   1 per agmt 
      so we could have an auto tuning value that will use the number ofr
          working threads + 30 

Note: changing these parameters requires db env closure (i.e: restart the instance in first implementation)

mdb limitations

Here are the limits That I measured in my test.

Database type Key max Data max
No dup Support 511 > 6 GB
Dup Support 511 511

511 is the mdb_env_get_maxkeysize(env) hardcoded limit Got a bit more than 6 GB in a db with size = 10GB

** Note from Thierry : ** pierre: regarding LMDB keylen, IPA is using ‘eq’ index on attributes (usercertificate, publickey,..) with keys that >500bytes

Other mbd limitations

mdb-env-open flags

- db2ldif/db2bak MDB_RDONLY
- ns-slapd 0
- offline bak2db/ldif2db MDB_NOSYNC  use mdb_env_sync() and fflush(mdb_env_get_fd()) before closing the env (depending of the ldif file or the mmap size we may also use MDB_WRITEMAP flag)
- online bak2db (and maybe online db2ldif) duplicate the environment
- reindex (should probably be 0 to avoid breaking the whole db in case of error)    Note we may be in trouble if ldif2db fails and there is multiple bakends 
and the single env strategy id used ...

db format

db open flags key value
id2entry MDB_INTEGERKEY + MDB_CREATE entryId entry or ‘HUGE entryId nbParts’
entrydn MDB_DUPSORT normalized dn Flag.entryId
index MDB_DUPSORT + MDB_DUPFIXED + MDB_CREATE PrefixIndex Flag.entryId
vlvindex MDB_DUPSORT + MDB_DUPFIXED + MDB_CREATE PrefixIndex Flag.entryId
changelog/retrochangelog MDB_CREATE csn change
#dbname MDB_CREATE bename/dbfilename openFlags
#huge MDB_CREATE bename’/’dbname’:’ContKeyId’.’n EntryId.complete Key value
    #maxKeyId max ContKeyId value

PrefixIndex is the usual index type prefix (ie: ‘=’ or ‘*’ or ‘?’ or ‘:MatchingRuleOID:’) concat with the key value Flag.entryId is:

Type Mapping

TXN and value Handling

is: if a txn is provided by dblayer: it is a write txn and should be used for the operation in the other case:

in all cases, data that are read from the db are strdup/copied in dbi_val_t
(in phase 4a no pointers towards memory mmap are kept outside db-mdb plugin)

VLV and RECNO

BDB implement vlv index by using an index with key based on entry value and the vlv search sort order i (So that the vlv index records are directly sorted according the vlv sort order) The data is the entry ID. note: the key is usually acceeded by its position in the tree (i.e RECNO) rather than by its value.

unlike bdb, ldbm does not implement a recno lookup. So we cannot use that. We use a VLV index as in bdb and for each VLV index have a second database used a cache. The cache will contains:

For each VLV index have a second database “vlv cache” database that contains the following records: Key: D{key}{data}{key-size} Data: {key-size}{data-size}{recno}{key}{data} Key: R{recno} Data: {key-size}{data-size}{recno}{key}{data} Key: OK Data: OK The data and recno records exists for all vlv records whose recno modulo RECNO_CACHE_INTERVAL(1000) is 1

When looking up in the cache for recno, we perform a lookup for nearest key smaller or equal to R{recno} Then we lookup for key/data in vlv index and move the vlv index cursor by one until we got the wanted recno.

When removing/adding a key from vlv index the cache is cleared and rebuilt at next query.

bulk operation

BDB is supporting to kind of bulk read operations:

In MDB there is not much support for bulk read operations (not very surprising because read operations are pretty fast anyway) The interresting point is that we could avoid copy overhead for bulk operation (because in bdb the returned data are stored in a local buffer and no more used once the cursor is released so:

** IMPORTANT NOTE ** The above descrption is not doable for bulk record read: The issue is that in changelog case the cursor get closed between the value retrival phase and the result iteration phase ==> nether keeping cursor reference of pointer towards the mmap is safe. ==> Will have to copy the keys and data into the buffer until reaching one of the following: * End of cursor * Buffer size limit * Number of record in bulk operation limit

monitoring

Here are available values

Here are what openldap monitors:

Attribute Description IMHO Notes
olmDbDirectory Path name of the directory where the database environment resides should not be a monitored value but a config one
olmMDBPagesMax Maximum number of pages  
olmMDBPagesUsed Number of pages in use  
olmMDBPagesFree Number of free pages  
olmMDBReadersMax Maximum number of readers Is also a config attribute
olmMDBReadersUsed Number of readers in use  
olmMDBEntries Number of entries in DB  

Config entry

Entry: cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config

Parameter similar to bdb one:

Name Default Value Comment
nsslapd-db-home-directory /var/lib/dirsrv/slapd-/dbDIR>  
nsslapd-search-bypass-filter-test on More a backend parameter than a bdb one
nsslapd-serial-lock on More a backend parameter than a bdb one

mdb specific parameters:

Name Default Value Comment
nsslapd-mdb-max-size 0 0 means disk remaining size (when creating the db)
    supponted value: a number followed by a suffix (typically M/G/T)
    note: value is rounded down to a multiple of 10Mb
nsslapd-mdb-max-readers 0 0 means number of working threads + 30
nsslapd-mdb-max-dbs 128  

debuging

Added a few debugging modules to trace:

The debug module sources are in mdb_debug.* files and trigered by:

Phases

Ideas about future improvements

These are raw ideas (that would needs some benefit/cost evaluation)

Last modified on 29 June 2021