With very large databases, some queries go through a lot of work to build huge ID lists for filter components with many matching IDs. For example, a search for “(&(objectclass=inetorgperson)(uid=foo))” may build a huge idlist for “objectclass=inetorgperson” only to throw it away to intersect it with “uid=foo”. In these cases, it would be useful to be able to tell the indexing code to use a different idlistscanlimit for certain indexes, or use no idlist at all. In the above case, it would be useful to tell the indexing code to skip building an idlist for objectclass=inetorgperson, but still use the default idlistscanlimit for other objectclass searches (e.g. objectclass=groupOfNames).
Another example of this problem is https://github.com/389ds/389-ds-base/issues/811 - if there are several million IDs for each of the objectclass= filter components, being able to skip id list generation for the objectclass values would make that query very fast.
Unfortunately, nsslapd-idlistscanlimit is a blunt instrument - it applies to all indexes in the database. We currently have no way to specify a different idlistscanlimit for different types of search filters.
In these cases, we have a search filter which has an attribute with a large number of matching values, and another attribute with a very small number of matching values. We need to have the ability to control the max ID list size for a given index, including the type of index (eq, sub), and even the equality value - a different max ID list size for specific values of objectclass - for example, in the above, a size of 0 for objectclass=inetorgperson, but use the default for objectclass=groupOfNames.
If there are 10 million entries that match objectclass=inetorgperson, then a search for (&(objectclass=inetorgperson)(uid=foo)) will build an ID list containing 10m IDs for objectclass=inetorgperson, then throw it away to intersect it with uid=foo. If there are many simultaneous searches like this, the amount of CPU and RAM resources consumed by the server is great, and the performance suffers.
In some cases, the user may want to allow a very large ID list for some searches, in order that they are indexed and fast, but not increase the nsslapd-idlistscanlimit for all searches. For example, an application wants to be able to search for users by pattern (substring), and wants that search to be as fast as possible (i.e. indexed, not look through).
The user wants the search for (sn=sm*) to be very fast (indexed), and so wants to have a very high idlistscanlimit for the “sn” “sub” index (there may be hundreds of thousands of users that match “sm*” - “smith”, “small”, etc.), but wants to have a very low idlistscanlimit for other searches.
We cannot reuse or re-purpose nsslapd-idlistscanlimit, so the proposed solution is to add a new allowed attribute to the nsIndex objectclass: nsIndexIDListScanLimit
nsIndexIDListScanLimit: limit=NNN [type=eq[,sub,...]] [flags=AND[,XXX..]] [values=val[,val...]]
This is a multi-valued, DirectoryString syntax attribute, which is allowed (optional) for the nsIndex objectclass.
type=eq values=abc
because “abc” is not integer syntax.
If the values contain spaces, commas, nulls, other values which require escapes, the LDAP filter escape syntax should be used - backslash ‘\’ followed by the 2 hex digit code for the character. For example, DN syntax values:
dn: cn=uniquemember,cn=index,cn=userRoot,cn=ldbm database,cn=plugins,cn=config
changetype: modify
replace: nsIndexIDListScanLimit
nsIndexIDListScanLimit: limit=0 type=eq values=uid=kvaughan\2Cou=People\2Cdc=example\2Cdc=com,uid=rdaugherty\2Cou=People\2Cdc=example\2Cdc=com
the commas in the DN values are replaced with ”\2C”
The values are processed as if they were filter values - so for “sub” values, values like “values=*sm*ith*” will be processed as if they were values in a substring search filter like (sn=*sm*ith*)
dn: cn=objectclass,...
objectclass: nsIndex
nsIndexType: eq
nsIndexIDListScanLimit: limit=0 type=eq flags=AND values=inetOrgPerson
nsIndexIDListScanLimit: limit=1 type=eq values=inetOrgPerson
nsIndexIDListScanLimit: limit=2 type=eq flags=AND
nsIndexIDListScanLimit: limit=3 type=eq
nsIndexIDListScanLimit: limit=4 flags=AND
nsIndexIDListScanLimit: limit=5
For example, in the objectclass case, you might want to have a maxsize of 0 for inetorgperson:
dn: cn=objectclass,cn=index,...
objectclass: nsIndex
nsIndexType: eq
nsIndexIDListScanLimit: limit=0 type=eq values=inetorgperson
This means, build no ID list (“0”) for “objectclass=inetorgperson”, but use the default nsslapd-idlistscanlimit value for other objectclass=$value searches.
Or, to apply the limit only to the use of objectclass=inetorgperson in AND clauses
dn: cn=objectclass,cn=index,...
objectclass: nsIndex
nsIndexType: eq
nsIndexIDListScanLimit: limit=0 type=eq flags=AND values=inetorgperson
This means, build no ID list (“0”) for “objectclass=inetorgperson” when used in an AND clause like (&(objectclass=inetorgperson)(uid=someuser)), but use the default nsslapd-idlistscanlimit value for other objectclass=$value searches, like (objectclass=inetorgperson) or in an OR clause like (|(objectclass=inetorgperson)(objectclass=posixAccount))
Or, for a substring example:
dn: cn=givenName,cn=index,...
objectclass: nsIndex
nsIndexType: sub
nsIndexIDListScanLimit: limit=-1 type=sub values=ab*
This means, use an unlimited max ID list size for (givenname=ab*) but use the default for other givenname substring searches.
The struct attrinfo used to store per-attribute index information has been extended with a new member ai_idlistinfo which holds the parsed nsIndexIDListSizeLimit values for the index. This is implemented as a DataList object. ai_idlistinfo is only created if there is at least one valid nsIndexIDListSizeLimit. There is much additional parsing code added to ldbm_attr.c. The list of values is first checked for syntax, then converted to index keys (normalized) and stored as a Slapi_ValueSet. This allows fast value lookup during the indexing.
NOTE: There is no result checking for dynamic index setting over LDAP. The client has no way to know if the change was valid or not. The only way to tell currently is to look at the errors log for errors. Tests should be implemented to examine the errors log.
The server already had support for using a different idlistscanlimit (called allidslimit in the index code) for each lookup. An additional function was inserted into index_read_ext_allids() - index_get_allids(). When an index lookup happens, the values from the search filter are converted to keys in the same format. If the index specified nsIndexIDListSizeLimit with values, each index value is looked up in the Slapi_ValueSet to find a match. The index type (eq, sub) are also compared, and the flags. Each match is scored, and the idlistscanlimit corresponding to the best match is returned.
If nsIndexIDListScanLimit is not specified for an index, the nsslapd-idlistscanlimit value for the database is used.
The nsIndexIDListScanLimit setting is fully dynamic - live changes take place immediately.
There should be no impact on replication.
There should be no impact on update/upgrade.
This will not change any dependencies.
Documentation. We will need to document the new attribute in the various guides to which it is applicable, and in the Release Notes. We will need to advertise this to other groups heavily using 389 such as DogTag and FreeIPA.