Date: 2020 Nov, 26th
Staus: Draft
Tickets: https://issues.redhat.com/browse/IDMDS-302
See backend redesign (initial phases)
This document is the continuation of ludwig’s work about Berkeley database removal. backend redesign (initial phases)
It focus on:
Part of the changes are already pushed in the upstream branch.
( The plugin implementation and most dblayer wrapper are already coded) When initializing the dblayer API, the value of nsslapd-backend-implement configuration parameter is used to load value plugin then to call value_init function that fills a set of callbacks in li->priv.
The dependency that remains are:
* dblayer\_log\_print ==> must be moved in implementation dependant plugin ==> should remove the callback
* Values compare call back ==> must be moved in implementation dependant plugin as they works directly on db values
* idl\_new\_compare\_dups -
* entryrdn\_compare\_dups
* Same with compare key callback ???? Should also remove use of DBTcmp ???
move the monitoring statistics in bdb plugin and add wrapper at dblayer level * perfctrs_update should be moved in bdb and wrapper added * perfctrs_terminate: should be split: memory cleanup should stay at backend level but statistics should be clear at bdb plugin level. This will also allow to get rid of the dblayer_db_uses_* functions that checks for existing feature * remove old macros in dblayer that are already useless: * DB_OPEN * TXN_BEGIN * TXN_COMMIT * TXN_ABORT * TXN_CHECKPOINT * MEMP_STAT * MEMP_TRICKLE * LOG_ARCHIVE * LOG_FLUSH
Include file: dbimpl.h
Name | Role | Opaque | Old bdb name |
---|---|---|---|
dbi_env_t | The global environment | PseudoOpaque(1) | DB_ENV |
dbi_db_t | A database instance | PseudoOpaque(1) | DB |
dbi_txn_t | A transaction | Yes(3) | DB_TXN |
dbi_cursor_t | A cursor (i.e: iterator on DB data) | PseudoOpaque(1) | DBC |
dbi_data_t | A key or a value | No | DBT |
dbi_cb_t | Contains all DB implementation callbacks | No | N/A |
(1) DB_ENV is used as opaque struct except dbenv->get_open_flags that is used in db_uses_feature that should be moved in bdb plugin anyway
(2) already used as an opaque struct
PseudoOpaque type are: Typedef struct { DBI_CB *cb;The callbacks void *<name>;The implementation opaque struct (name is env,db or cursor) void *plg_ctx;A context that implementation plugin is free to use. (may be not needed) } PseudoOpaque
They are used because the code sometime use function that only have access to underlying element
And not the upper layer context (i.e cursor without backend or li_instance)
typedef struct {
DBI\_CB *cb;
DBI\_MEM\_OPTION flags;
void *data;
size\_t size;
void *ctx; /* Context handled by db implementation plugin */
} DBI_DATA;
typedef struct {
struct DBI\_CB *cb;
void *cursor;
} DBI_CURSOR;
DBI_OP /* Represents a cursor operation */
‘Name’ | ‘Role’ | ‘Old bdb function’ | ‘Old bdb value’ |
---|---|---|---|
DBI_OP_MOVE_TO_KEY | Move cursor to first record having the key and get its value | c_get | DB_SET |
DBI_OP_MOVE_NEAR_KEY | Move cursor to record having smallest key greater or equal than the specified one. Then it gets the record | c_get | DB_SET_RANGE |
DBI_OP_MOVE_TO_DATA | Move cursor to key+value record | c_get | DB_GET_BOTH |
DBI_OP_MOVE_NEAR_DATA | Move cursor to record having specified key and smallest data greater or equal than the specified data and get the value | c_get | DB_GET_BOTH_RANGE |
DBI_OP_MOVE_TO_RECNO | Move record to specified record number then get it. | c_get | DB_SET_RECNO |
DBI_OP_MOVE_TO_LAST | Move cursor to last record then get it. | c_get | DB_LAST |
DBI_OP_GET | Get current record number. | get | DB_GET |
DBI_OP_GET_RECNO | Get current record number. | c_get | DB_GET_RECNO |
DBI_OP_NEXT | Move cursor to next record then get it. | c_get | DB_NEXT |
DBI_OP_NEXT_DATA | Move cursor to next record having the same key then get the value. | c_get | DB_NEXT_DUP |
DBI_OP_NEXT_KEY | Move cursor to next record having different key then get the record. | c_get | DB_NEXT_NODUP |
DBI_OP_PREV | Move cursor to previous record then get it. | c_get | DB_PREV |
DBI_OP_PUT | Insert new key-data | put | DB_PUT |
DBI_OP_REPLACE | Overwrite current position value | c_put | DB_CURRENT |
DBI_OP_ADD | Insert new key-data if it does not already exists | put | DB_NODUPDATA |
DBI_OP_ADD | Insert new key-data if it does not already exists | c_put | DB_NODUPDATA |
DBI_OP_DEL | Delete key-data record | del | 0 |
DBI_OP_DEL | Delete record at cursor position | c_del | 0 |
DBI_OP_CLOSE | Close cursor | c_close | N/A |
Value handling options
‘Name’ | ‘Role’ | ‘Old bdb value’ |
---|---|---|
DBI_MEM_USER | Tell impl plugin to neither alloc nor free the memory | DB_DBT_USERMEM |
DBI_MEM_MALLOC | Tell impl plugin to alloc and free the memory | DB_DBT_MALLOC |
DBI_MEM_REALLOC | Tell impl plugin to reuse or realloc the memory | DB_DBT_REALLOC |
error codes
‘Name’ | ‘Role’ | ‘Old bdb value’ |
---|---|---|
DBI_RC_SUCCESS | No error | 0 |
DBI_RC_NOMEM | Memory allocation error (usually it does not happen because slapi_ch_malloc cannot returns NULL) |
DB_BUFFER_SMALL |
DBI_RC_KEYEXIST | Key exists and duplicate keys are not allowed. | DB_KEYEXIST |
DBI_RC_RETRY | Transient error: operation should be retried. | DB_LOCK_DEADLOCK |
DBI_RC_NOTFOUND | Record not found: Key does not exists. | DB_NOTFOUND |
DBI_RC_RUNRECOVERY | Recovery must be performed. | DB_RUNRECOVERY |
DBI_RC_OTHER | Other database errors | N/A |
Note: the implementation plugin should log an error with error code and error text when getting an error that cannot be mapped ( To ease diagnostic in case of unexpected error )
(TODO: get the callback name and prototype from dblayer.h and put them in this document to have the full API
Name | Role | Old bdb value |
---|---|---|
dblayer_start_fn_t *dblayer_start_fn | ||
dblayer_close_fn_t *dblayer_close_fn | ||
dblayer_instance_start_fn_t *dblayer_instance_start_fn | ||
dblayer_backup_fn_t *dblayer_backup_fn | ||
dblayer_verify_fn_t *dblayer_verify_fn | ||
dblayer_db_size_fn_t *dblayer_db_size_fn | ||
dblayer_ldif2db_fn_t *dblayer_ldif2db_fn | ||
dblayer_db2ldif_fn_t *dblayer_db2ldif_fn | ||
dblayer_db2index_fn_t *dblayer_db2index_fn | ||
dblayer_cleanup_fn_t *dblayer_cleanup_fn | ||
dblayer_upgradedn_fn_t *dblayer_upgradedn_fn | ||
dblayer_upgradedb_fn_t *dblayer_upgradedb_fn | ||
dblayer_restore_fn_t *dblayer_restore_fn | ||
dblayer_txn_begin_fn_t *dblayer_txn_begin_fn | ||
dblayer_txn_commit_fn_t *dblayer_txn_commit_fn | ||
dblayer_txn_abort_fn_t *dblayer_txn_abort_fn | ||
dblayer_get_info_fn_t *dblayer_get_info_fn | ||
dblayer_set_info_fn_t *dblayer_set_info_fn | ||
dblayer_back_ctrl_fn_t *dblayer_back_ctrl_fn | ||
dblayer_get_db_fn_t *dblayer_get_db_fn | ||
dblayer_delete_db_fn_t *dblayer_delete_db_fn | ||
dblayer_rm_db_file_fn_t *dblayer_rm_db_file_fn | ||
dblayer_import_fn_t *dblayer_import_fn | ||
dblayer_load_dse_fn_t *dblayer_load_dse_fn | ||
dblayer_config_get_fn_t *dblayer_config_get_fn | ||
dblayer_config_set_fn_t *dblayer_config_set_fn | ||
instance_config_set_fn_t *instance_config_set_fn | ||
instance_config_entry_callback_fn_t *instance_add_config_fn | ||
instance_config_entry_callback_fn_t *instance_postadd_config_fn | ||
instance_config_entry_callback_fn_t *instance_del_config_fn | ||
instance_config_entry_callback_fn_t *instance_postdel_config_fn | ||
instance_cleanup_fn_t *instance_cleanup_fn | ||
instance_create_fn_t *instance_create_fn | ||
instance_create_fn_t *instance_register_monitor_fn | ||
instance_search_callback_fn_t *instance_search_callback_fn | ||
dblayer_auto_tune_fn_t *dblayer_auto_tune_fn |
Callbacks not yet implemented
Name | Role | Old bdb value |
---|---|---|
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Move cursor and get record | cursor->c_get |
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Add/replace a record | cursor->c_put |
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Remove a record | cursor->c_del |
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Close a record | cursor->c_close |
dblayer_new_cursor(be,db,txn, cursor) | Should store the backend in cldb_Handle to retrieve it. | db->cursor(db, db_txn, &cursor, 0); |
dblayer_db_op(DBI_DB *db, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Move cursor and get record | db->get |
dblayer_db_op(be, DBI_DB *db, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Add/replace a record | db->put |
dblayer_db_op(be, DBI_DB *db, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Delete a record | db->del |
dblayer_get_db_id | db->fname | |
dblayer_init_bulk_op(DBI_DATA *bulk) | Initialize iterator for bulk operation | DB_MULTIPLE_INIT |
dblayer_next_bulk_op(DBI_DATA *bulk, DBI_DATA *key, DBI_DATA *data) | Get next operation from bulk operation | DB_MULTIPLE_NEXT |
I wonder if we should keep the callback definition. at the dblayer level.
IMHO it should be better to define a callback struct in dbimpl.h i.e DBI_CB because:
That is the plugin that implements the dbimpl API callbacks and calls libdb functions. The important points are:
bdb_dbival2dbt(key, &bdb_key, PR_FALSE); /* Convert dbi_val_t to DBT before the libdb call */ bdb_dbival2dbt(data, &bdb_data, PR_FALSE); rc=some_native_libdb_function(..., &bdb_key, &bdb_data, ...); bdb_dbt2dbival(&bdb_key, key, PR_TRUE); /* Convert back the DBT to dbi_val after the libdb call */ bdb_dbt2dbival(&bdb_data, data, PR_TRUE); return bdb_map_error(__FUNCTION__, rc);
* When calling backend function from the plugin (with DBT values):
bdb_dbt2dbival(&key, &dbikey, PR_FALSE); idl = idl_fetch(be, db, &dbikey, NULL, NULL, &ret); bdb_dbival2dbt(&dbikey, &key, PR_TRUE);
Note: In both case isresponse is set to PR_FALSE before the operation and PR_TRUE after it. if a key or data get alloced/realloced, the original key/data get freed (if the value flags allows it)
dup_cmp_fn callback As these callbacks are directly called within libdb (i.e using DBT) they have been moved within the dnb-bdb plugin and rather than directly setting the callback in upper layer, there is a dbimpl function to set some specific function.
bdb_map_error function convert some well known error to the DBI counterpart. for other error a generic value is returned after having logged the bdb native error.
Proposed solution
* Solution 1
* Remap the errors to generic values
* Add a function in bdb that remap the value (should be a simple switch) If the value cannot be mapped we could:
* add a string in thread local storage and return DBI\_RC\_OTHER The string should contains the original return code and its associated message (i.e: bdb error code: %d : %s", native\_rc, db\_strerror(native\_rc))
* Modify dblayer\_strerror to print a message for generic errors and if DBI\_RC\_OTHER to generate a message from the thread local data string.
* This solution has the advantage that:
* it does not impact the back-ldm/changelog code (except for dblayer\_strerror)
* It is quite efficient in the usual case as it handles a switch with few values
* Keep the ability to diagnose errors in the unexpected case
* The drawbacks:
* Message can be wrong if creative error handling is performed (i.e
rc1 = dblayer\_xxx(li, ...) rc2 = dblayer\_xxx(li, ...) log(dblayer\_strerror(rc1)) prints rc2 message if both values are are DBI\_RC\_OTHER)
Should double check that when hitting unexpected errors we just logs an error message and aborts the operation (as it is possible that we abort the txn before logging the errr)
* Error handling should be done in the same thread than the operation (This is IMHO the case)
* Solution 2 I thought about keeping the db code as it, but then it implies a lot of changes as we need to access the db plugin to determine what action to do or to log the error. (but the dblayer instance context is not always easily available when the message is logged)
* Solution 3 Same as solution 1 but without storing data in thread local storage: problem is that we got clueless in case of unexpected database error. (unless an error message is logged by the plugin (Note: that is finally the implemented solution))
These questions will need to be solved in phase 4.
VLV and RECNO Not an issue for this phase but it will be an issue when writing the lmdb implementation plugin. (i.e Phase4) (So far I have no idead how how to implement efficiently the DBI_OP_GET_RECNO (i.e: DB_GET_RECNO) and DBI_OP_MOVE_TO_RECNO (i.e DB_SET_RECNO) operation on lmdb
VLV search the index records by record number bdb is able to do that on btree database but lmdb does not offer this feature. The bad thing is that this numbering is directly brought by the VLV LDAP RFC draft so that is not something that we can easely change.
I wonder if having vlv index would still then be useful ( maybe only to avoid having to sort the entries )
(And paged control could also benefit of the chache to avoid having to rebuild the complete request.
Read transaction support ns-slapd do not use read only txn with bdb (read operation are transactionless) while lmdb requires them. We should determine the txn strategy: * Having a single read txn for the whole ldap read operation. * Having a read txn for every db read operation (is that efficient ?) * Mixed approch: having a read txn for specific functions (like building idl from an index) Anyway it is not an issue for this phase (The only concern in phase 3 is that the architecture should be flexible enough to easely support that evolution) My feeling is that in phase 4 we simply use the read txn inside the lmdb plugin: * generating a read txn for single db operation if no txn is provided * generating a read txn for single cursor creation until cursor deletion if no txn is provided and copy the db keys and values results in the dbi_val_t buffer (as bdb does with the DBT buffer) This is not the most performant but it is fast to implement and it mimick current bdb behavior. Then once bdb is out we could have a perf improvement phase to boost the read operation by using global txn and avoid needing to duplicate the key and values. (no need to duplicate the data returned by the db as they stays mmaped until txn is aborted/commited) and offer a better consistency than current model. But we cannot do it while bdb plugin is still around because of the risk of deadlock and excesive retries on bdb
*The phase 3 is about being able to remove the bdb dependencies (i.e being able to build ns-slapd libbck-ldbm and replication without the bdb include and lib) Due to the size of these changes (FYI: Phase 3a already impacts 53 files), it seems better to split the phase in sub phases: