Date: 2020 Nov, 26th
Staus: Mostly Implemented
Tickets: https://issues.redhat.com/browse/IDMDS-302
See backend redesign (initial phases)
This document is the continuation of ludwig’s work about Berkeley database removal. backend redesign (initial phases)
It focus on:
Error code handling and logging. Should find a way to get it independent of the database implementation while still be able to provide relevant data in case of unexpected trouble (Common errors are handled by dbimpl API - unusual errors are logged within the db plugin )
move the monitoring statistics in bdb plugin and add wrapper at dblayer level * perfctrs_update should be moved in bdb and wrapper added * perfctrs_terminate: should be split: memory cleanup should stay at backend level but statistics should be clear at bdb plugin level. This will also allow to get rid of the dblayer_db_uses_* functions that checks for existing feature * remove old macros in dblayer that are already useless: * DB_OPEN * TXN_BEGIN * TXN_COMMIT * TXN_ABORT * TXN_CHECKPOINT * MEMP_STAT * MEMP_TRICKLE * LOG_ARCHIVE * LOG_FLUSH
dbimpl API is part of libback-ldbm and dbimpl API users needs to include dbimpl.h and link with libback-ldbm
When initializing the dblayer API (or when requesting a private access to a file), the value of nsslapd-backend-implement configuration parameter is used to call value_init function (within libback-ldbm) that fills a set of callbacks in li->priv.
Include file: dbimpl.h
Name | Role | Opaque | Old bdb name |
---|---|---|---|
dbi_env_t | The global environment | PseudoOpaque(1) | DB_ENV |
dbi_db_t | A database instance | PseudoOpaque(1) | DB |
dbi_txn_t | A transaction | Yes(3) | DB_TXN |
dbi_cursor_t | A cursor (i.e: iterator on DB data) | PseudoOpaque(1) | DBC |
dbi_data_t | A key or a value | No | DBT |
dbi_cb_t | Contains all DB implementation callbacks | No | N/A |
(1) DB_ENV is used as opaque struct except dbenv->get_open_flags that is used in db_uses_feature that should be moved in bdb plugin anyway
(2) already used as an opaque struct
PseudoOpaque type are: Typedef struct { DBI_CB *cb;The callbacks void *<name>;The implementation opaque struct (name is env,db or cursor) void *plg_ctx;A context that implementation plugin is free to use. (may be not needed) } PseudoOpaque
They are used because the code sometime use function that only have access to underlying element
And not the upper layer context (i.e cursor without backend or li_instance)
typedef struct {
DBI\_CB *cb;
DBI\_MEM\_OPTION flags;
void *data;
size\_t size;
void *ctx; /* Context handled by db implementation plugin */
} DBI_DATA;
typedef struct {
struct DBI\_CB *cb;
void *cursor;
} DBI_CURSOR;
DBI_OP /* Represents a cursor operation */
‘Name’ | ‘Role’ | ‘Old bdb function’ | ‘Old bdb value’ |
---|---|---|---|
DBI_OP_MOVE_TO_KEY | Move cursor to first record having the key and get its value | c_get | DB_SET |
DBI_OP_MOVE_NEAR_KEY | Move cursor to record having smallest key greater or equal than the specified one. Then it gets the record | c_get | DB_SET_RANGE |
DBI_OP_MOVE_TO_DATA | Move cursor to key+value record | c_get | DB_GET_BOTH |
DBI_OP_MOVE_NEAR_DATA | Move cursor to record having specified key and smallest data greater or equal than the specified data and get the value | c_get | DB_GET_BOTH_RANGE |
DBI_OP_MOVE_TO_RECNO | Move record to specified record number then get it. | c_get | DB_SET_RECNO |
DBI_OP_MOVE_TO_FIRST | Move cursor to first record then get it. | c_get | DB_FIRST |
DBI_OP_MOVE_TO_LAST | Move cursor to last record then get it. | c_get | DB_LAST |
DBI_OP_GET | Get record from key. | get | DB_GET |
DBI_OP_GET_RECNO | Get current record number. | c_get | DB_GET_RECNO |
DBI_OP_NEXT | Move cursor to next record then get it. | c_get | DB_NEXT |
DBI_OP_NEXT_DATA | Move cursor to next record having the same key then get the value. | c_get | DB_NEXT_DUP |
DBI_OP_NEXT_KEY | Move cursor to next record having different key then get the record. | c_get | DB_NEXT_NODUP |
DBI_OP_PREV | Move cursor to previous record then get it. | c_get | DB_PREV |
DBI_OP_PUT | Insert new key-data | put | DB_PUT |
DBI_OP_REPLACE | Overwrite current position value | c_put | DB_CURRENT |
DBI_OP_ADD | Insert new key-data if it does not already exists | put | DB_NODUPDATA |
DBI_OP_ADD | Insert new key-data if it does not already exists | c_put | DB_NODUPDATA |
DBI_OP_DEL | Delete key-data record | del | 0 |
DBI_OP_DEL | Delete record at cursor position | c_del | 0 |
DBI_OP_CLOSE | Close cursor | c_close | N/A |
dbi_val_t flags
Name | Role | Berkeley db flags |
---|---|---|
0 | data should be alloc or realloc | DB_DBT_MALLOC (if data is NULL) or DB_DBT_REALLOC |
DBI_VF_PROTECTED | data should not be freed | |
DBI_VF_DONTGROW | data should not be realloced | N/A |
DBI_VF_DONTGROW+DBI_VF_PROTECTED | data should not be realloced | DB_DBT_USERMEM |
DBI_VF_READONLY | data should not be modified | DB_DBT_READONLY |
dbi_val_t flags to DBT flags mapping
‘dbi_val_t’ | ‘DBT’ |
---|---|
0 | DB_DBT_MALLOC ( or DB_DBT_REALLOC |
DBI_VF_PROTECTED | data should not be freed |
dbi_bulk_t flags
Name | Role |
---|---|
DBI_VF_BULK_DATA | Bulk operation on data only |
DBI_VF_BULK_RECORD | Bulk operation on key+data |
error codes
Name | Role | Old bdb value |
---|---|---|
DBI_RC_SUCCESS | No error | 0 |
DBI_RC_NOMEM | Memory allocation error (usually it does not happen because slapi_ch_malloc cannot returns NULL) |
DB_BUFFER_SMALL |
DBI_RC_KEYEXIST | Key exists and duplicate keys are not allowed. | DB_KEYEXIST |
DBI_RC_RETRY | Transient error: operation should be retried. | DB_LOCK_DEADLOCK |
DBI_RC_NOTFOUND | Record not found: Key does not exists. | DB_NOTFOUND |
DBI_RC_RUNRECOVERY | Recovery must be performed. | DB_RUNRECOVERY |
DBI_RC_OTHER | Other database errors | N/A |
Note: the implementation plugin should log an error with error code and error text when getting an error that cannot be mapped ( To ease diagnostic in case of unexpected error )
(TODO: get the callback name and prototype from dblayer.h and put them in this document to have the full API
Name | Role | Old bdb value |
---|---|---|
dblayer_start_fn_t *dblayer_start_fn | ||
dblayer_close_fn_t *dblayer_close_fn | ||
dblayer_instance_start_fn_t *dblayer_instance_start_fn | ||
dblayer_backup_fn_t *dblayer_backup_fn | ||
dblayer_verify_fn_t *dblayer_verify_fn | ||
dblayer_db_size_fn_t *dblayer_db_size_fn | ||
dblayer_ldif2db_fn_t *dblayer_ldif2db_fn | ||
dblayer_db2ldif_fn_t *dblayer_db2ldif_fn | ||
dblayer_db2index_fn_t *dblayer_db2index_fn | ||
dblayer_cleanup_fn_t *dblayer_cleanup_fn | ||
dblayer_upgradedn_fn_t *dblayer_upgradedn_fn | ||
dblayer_upgradedb_fn_t *dblayer_upgradedb_fn | ||
dblayer_restore_fn_t *dblayer_restore_fn | ||
dblayer_txn_begin_fn_t *dblayer_txn_begin_fn | ||
dblayer_txn_commit_fn_t *dblayer_txn_commit_fn | ||
dblayer_txn_abort_fn_t *dblayer_txn_abort_fn | ||
dblayer_get_info_fn_t *dblayer_get_info_fn | ||
dblayer_set_info_fn_t *dblayer_set_info_fn | ||
dblayer_back_ctrl_fn_t *dblayer_back_ctrl_fn | ||
dblayer_get_db_fn_t *dblayer_get_db_fn | ||
dblayer_delete_db_fn_t *dblayer_delete_db_fn | ||
dblayer_rm_db_file_fn_t *dblayer_rm_db_file_fn | ||
dblayer_import_fn_t *dblayer_import_fn | ||
dblayer_load_dse_fn_t *dblayer_load_dse_fn | ||
dblayer_config_get_fn_t *dblayer_config_get_fn | ||
dblayer_config_set_fn_t *dblayer_config_set_fn | ||
instance_config_set_fn_t *instance_config_set_fn | ||
instance_config_entry_callback_fn_t *instance_add_config_fn | ||
instance_config_entry_callback_fn_t *instance_postadd_config_fn | ||
instance_config_entry_callback_fn_t *instance_del_config_fn | ||
instance_config_entry_callback_fn_t *instance_postdel_config_fn | ||
instance_cleanup_fn_t *instance_cleanup_fn | ||
instance_create_fn_t *instance_create_fn | ||
instance_create_fn_t *instance_register_monitor_fn | ||
instance_search_callback_fn_t *instance_search_callback_fn | ||
dblayer_auto_tune_fn_t *dblayer_auto_tune_fn | ||
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Move cursor and get record | cursor->c_get |
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Add/replace a record | cursor->c_put |
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Remove a record | cursor->c_del |
dblayer_cursor_op(DBI_CUR *cur, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Close a record | cursor->c_close |
dblayer_new_cursor(be,db,txn, cursor) | Should store the backend in cldb_Handle to retrieve it. | db->cursor(db, db_txn, &cursor, 0); |
dblayer_db_op(DBI_DB *db, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Move cursor and get record | db->get |
dblayer_db_op(be, DBI_DB *db, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Add/replace a record | db->put |
dblayer_db_op(be, DBI_DB *db, DBI_OP op, DBI_DATA *key, DBI_DATA *data) | Delete a record | db->del |
dblayer_get_db_id | db->fname | |
dblayer_init_bulk_op(DBI_DATA *bulk) | Initialize iterator for bulk operation | DB_MULTIPLE_INIT |
dblayer_next_bulk_op(DBI_DATA *bulk, DBI_DATA *key, DBI_DATA *data) | Get next operation from bulk operation | DB_MULTIPLE_NEXT |
That is the plugin that implements the dbimpl API callbacks and calls libdb functions. The important points are:
bdb_dbival2dbt(key, &bdb_key, PR_FALSE); /* Convert dbi_val_t to DBT before the libdb call */ bdb_dbival2dbt(data, &bdb_data, PR_FALSE); rc=some_native_libdb_function(..., &bdb_key, &bdb_data, ...); bdb_dbt2dbival(&bdb_key, key, PR_TRUE); /* Convert back the DBT to dbi_val after the libdb call */ bdb_dbt2dbival(&bdb_data, data, PR_TRUE); return bdb_map_error(__FUNCTION__, rc);
bdb_dbt2dbival(&key, &dbikey, PR_FALSE); idl = idl_fetch(be, db, &dbikey, NULL, NULL, &ret); bdb_dbival2dbt(&dbikey, &key, PR_TRUE);
Note: In both case isresponse is set to PR_FALSE before the operation and PR_TRUE after it. if a key or data get alloced/realloced, the original key/data get freed (if the value flags allows it)
dup_cmp_fn callback As these callbacks are directly called within libdb (i.e using DBT) they have been moved within the dnb-bdb plugin and rather than directly setting the callback in upper layer, there is a dbimpl function to set some specific function.
bdb_map_error function convert some well known error to the DBI counterpart. for other error a generic value is returned after having logged the bdb native error.
value handling:
Proposed solution
* Solution 1
* Remap the errors to generic values
* Add a function in bdb that remap the value (should be a simple switch) If the value cannot be mapped we could:
* add a string in thread local storage and return DBI\_RC\_OTHER The string should contains the original return code and its associated message (i.e: bdb error code: %d : %s", native\_rc, db\_strerror(native\_rc))
* Modify dblayer\_strerror to print a message for generic errors and if DBI\_RC\_OTHER to generate a message from the thread local data string.
* This solution has the advantage that:
* it does not impact the back-ldm/changelog code (except for dblayer\_strerror)
* It is quite efficient in the usual case as it handles a switch with few values
* Keep the ability to diagnose errors in the unexpected case
* The drawbacks:
* Message can be wrong if creative error handling is performed (i.e
rc1 = dblayer\_xxx(li, ...) rc2 = dblayer\_xxx(li, ...) log(dblayer\_strerror(rc1)) prints rc2 message if both values are are DBI\_RC\_OTHER)
Should double check that when hitting unexpected errors we just logs an error message and aborts the operation (as it is possible that we abort the txn before logging the errr)
* Error handling should be done in the same thread than the operation (This is IMHO the case)
* Solution 2 I thought about keeping the db code as it, but then it implies a lot of changes as we need to access the db plugin to determine what action to do or to log the error. (but the dblayer instance context is not always easily available when the message is logged)
* Solution 3 Same as solution 1 but without storing data in thread local storage: problem is that we got clueless in case of unexpected database error. (unless an error message is logged by the plugin (Note: that is finally the implemented solution))
These questions will need to be solved in phase 4.
VLV and RECNO Not an issue for this phase but it will be an issue when writing the lmdb implementation plugin. (i.e Phase4) (So far I have no idead how how to implement efficiently the DBI_OP_GET_RECNO (i.e: DB_GET_RECNO) and DBI_OP_MOVE_TO_RECNO (i.e DB_SET_RECNO) operation on lmdb
VLV search the index records by record number bdb is able to do that on btree database but lmdb does not offer this feature. The bad thing is that this numbering is directly brought by the VLV LDAP RFC draft so that is not something that we can easely change.
I wonder if having vlv index would still then be useful ( maybe only to avoid having to sort the entries )
(And paged control could also benefit of the chache to avoid having to rebuild the complete request.
Read transaction support ns-slapd do not use read only txn with bdb (read operation are transactionless) while lmdb requires them. We should determine the txn strategy: * Having a single read txn for the whole ldap read operation. * Having a read txn for every db read operation (is that efficient ?) * Mixed approch: having a read txn for specific functions (like building idl from an index) Anyway it is not an issue for this phase (The only concern in phase 3 is that the architecture should be flexible enough to easely support that evolution) My feeling is that in phase 4 we simply use the read txn inside the lmdb plugin: * generating a read txn for single db operation if no txn is provided * generating a read txn for single cursor creation until cursor deletion if no txn is provided and copy the db keys and values results in the dbi_val_t buffer (as bdb does with the DBT buffer) This is not the most performant but it is fast to implement and it mimick current bdb behavior. Then once bdb is out we could have a perf improvement phase to boost the read operation by using global txn and avoid needing to duplicate the key and values. (no need to duplicate the data returned by the db as they stays mmaped until txn is aborted/commited) and offer a better consistency than current model. But we cannot do it while bdb plugin is still around because of the risk of deadlock and excesive retries on bdb
*The phase 3 is about being able to remove the bdb dependencies (i.e being able to build ns-slapd libbck-ldbm and replication without the bdb include and lib) Due to the size of these changes (FYI: Phase 3a already impacts 53 files), it seems better to split the phase in sub phases: