- Indexer service
- Introduction
- Indexer API access
- Re-index
- Reindex given records
- Troubleshoot indexing issues
The Indexer API provides a mechanism for indexing documents that contain structured or unstructured data. Documents and indices are saved in a separate persistent store that is optimized for search operations. The Indexer API can index any number of documents.
The indexer contains index attributes defined in the schema. The schema can be created during record ingestion in the Managed Planning Data Foundation by the Schema service. The Indexer service also adds a number of Managed Planning Data Foundation meta attributes, such as ID, kind, parent, acl, namespace, type, version, legaltags, and index to each record during indexing.
Required roles
The Indexer service requires that users and service accounts have dedicated roles in order to use it. Users must be a member of
users.datalake.adminsorusers.datalake.ops. You can assign roles using the Entitlements service. Refer to the API documentation for specific requirements.In addition to service roles, users must be a member of the data groups to access the data.
Required headers
The Managed Planning Data Foundation stores data in different partitions, depending on the different accounts in the OSDU system.
A user may belong to more than one account. As a user, after logging into the OSDU portal, you need to select the account you wish to be active. Also, when using the Search APIs, you need to specify the active account in the header called
data-partition-id. The correctdata-partition-idcan be obtained from the CFS service. Thedata-partition-idenables the search within the mapped partition. For example:data-partition-id: opendesOptional headers
The
correlation-idis a traceable ID to track the journey of a single request. Thecorrelation-idcan be a GUID on the header with a key. It is a best practice to provide the correlation-id so the request can be tracked through all of the services.correlation-id: 1e0fef08-22fd-49b1-a5cc-dffa21bc0b70
If the service is initiating the request, an ID should be generated. If the correlation-id is not provided, then a new ID is generated by the service so that the request is traceable.
The Re-index API allows you to re-index a
kindwithout re-ingesting the records using the Storage API. Reindexing a kind is an asynchronous operation. When a user calls this API, it responds with HTTP 200 if it can launch the re-indexing or the appropriate error code if it cannot. The current status of the indexing can be tracked by calling the Search API and making a query with this particular kind. Note that it may take a few seconds to a few hours to finish the re-indexing because multiple factors contribute to latency, such as the number of records in the kind and the current load on the Indexer service, etc.Note: If a kind has been previously indexed with particular schema and if you wish to apply the schema changes during re-indexing, the previous kind index has to be deleted using the Index Delete API. Without this clean-up, the Reindex API will use the same schema and overwrite the records with the same IDs.
Prerequisite: Users must be a member of
users.datalake.adminsgroup.POST /api/indexer/v2/reindex { "kind": "opendes:welldb:wellbore:1.0.0" }**Curl**
curl --request POST \ --url '/api/indexer/v2/reindex' \ --header 'accept: application/json' \ --header 'authorization: Bearer <JWT>' \ --header 'content-type: application/json' \ --header 'data-partition-id: opendes' \ --data '{ "kind": "opendes:welldb:wellbore:1.0.0" }'The Full re-index API allows you to re-index an entire partition without re-ingesting the records using the Storage API. Similar to the Reindex API for a specific kind, it is an asynchronous operation. When a user calls this API, it responds with HTTP 200 if it can launch the re-indexing or the appropriate error code if it cannot. Note that it may take few seconds to a few hours to finish the re-indexing because multiple factors contribute to latency, such as the number of records in the kind and the current load on the Indexer service, etc. It may return the response, 502 Bad Gateway, due to the connection time out set up. Do not call the full Re-index API immediately. The re-index operation may still be running in the backend. The current status of the indexing can be tracked by calling the Search API.
The Full Re-index API takes a parameter named "force_clean" which cleans up all previous indices and re-indexes the records. The default value set is "false".
Note: This parameter fully removes all indexes for all the kinds, and there is also a re-index limitation in OSDU:
kinds with more than 250K records do not get fully re-indexed. If your kind has more records than this, you will need to re-ingest your records after calling the re-index to have all the records queryable by attributes and available.Prerequisite: Users must be a member of
users.datalake.opsgroup.Query parameters:
force_clean
(optional, Boolean) If there is any inconsistency between the storage records and the indexed records, you can use this query parameter to synchronize them. Iftrue, it will drop the current indexed data, apply latest schema changes & re-index records. Iffalse, reindex API will apply the latest schema and overwrite records with the same ids. Default value isfalse.**Curl**
curl --request PATCH \ --url '/api/indexer/v2/reindex?force_clean=false' \ --header 'accept: application/json' \ --header 'authorization: Bearer <JWT>' \ --header 'content-type: application/json' \ --header 'data-partition-id: opendes'
Reindex records API allows users to re-index the given records by providing the record ids without re-ingesting the records via storage API. Reindexing a kind is an asynchronous operation and when a user calls this API, it will respond with HTTP 202 if it can launch the re-indexing or appropriate error code if it cannot. The response body indicates which given records were re-indexed and which ones were not found in storage. It supports up to 1000 records per API call.
POST /api/indexer/v2/reindex/records HTTP/1.1
{
"recordIds": ["opendes:work-product-component--WellLog:17763fcc18864f4f8eab62e320f8913d", "opendes:work-product-component--WellLog:566edebc-1a9f-4f4d-9a30-ed458e959ac7"]
}**Curl**
curl --request POST \
--url '/api/indexer/v2/reindex/records' \
--header 'accept: application/json' \
--header 'authorization: Bearer <JWT>' \
--header 'content-type: application/json' \
--header 'data-partition-id: opendes' \
--data '{
"recordIds": ["opendes:work-product-component--WellLog:17763fcc18864f4f8eab62e320f8913d", "opendes:work-product-component--WellLog:566edebc-1a9f-4f4d-9a30-ed458e959ac7"]
}'Users must be a member of users.datalake.admins group.
recordIds
(required, Array of String) Storage records to be re-indexed.
{
"reIndexedRecords": [
"opendes:work-product-component--WellLog:566edebc-1a9f-4f4d-9a30-ed458e959ac7"
],
"notFoundRecords": [
"opendes:work-product-component--WellLog:17763fcc18864f4f8eab62e320f8913d"
]
}The Indexer service adds internal metadata to each record which registers the status of the indexing. The metadata includes the status and the last indexing date and time. This additional meta block helps to see the details of the indexing. The format of the index meta block is as follows:
{
"index": {
"trace": [
String,
String
],
"statusCode": Integer,
"lastUpdateTime": Datetime
}
}Example:
{
"results": [
{
"index": {
"trace": [
"datetime parsing error: unknown format for attribute: endDate | value: 9000-01-01T00:00:00.0000000",
"datetime parsing error: unknown format for attribute: startDate | value: 1990-01-01T00:00:00.0000000"
],
"statusCode": 400,
"lastUpdateTime": "2018-11-16T01:44:08.687Z"
}
}
],
"totalCount": 31895
} Details of the index block:
trace: This field collects all the issues related to the indexing and concatinates using '|'. This is a String field.
statusCode: This field determines the category of the error. This is an integer field. It can have the following values:
- 200 - All OK.
- 404 - Schema is missing in the Schema service.
- 400 - Some fields were not properly mapped with the schema defined. For example, schema defined as
intfor field but the input record had the attribute value astext.
lastUpdateTime: This field captures the last time the record was updated by the Indexer service. This is datetime field so you can do range queries on this field.
You can query the index status using the following example query:
curl --request POST \
--url /search/v2/query \
--header 'Authorization: Token' \
--header 'Content-Type: application/json' \
--header 'Data-Partition-Id: Data partition id' \
--data '{"kind": "*:*:*:*","query": "index.statusCode:404","returnedFields": ["index"]}'
Note: By default, the API response excludes the 'index' attribute block. The user must specify 'index' as the 'returnedFields" in order to see it in the response.The above query returns all records that had problems due to fields mismatch.
Refer to the Search service documentation for examples on different kinds of search queries.