SearchElasticsearch

SearchElasticsearch 2.8.0

Bundle: org.apache.nifi | nifi-elasticsearch-restapi-nar
Description: A processor that allows the user to repeatedly run a paginated query (with aggregations) written with the Elasticsearch JSON DSL. Search After/Point in Time queries must include a valid "sort" field. The processor will retrieve multiple pages of results until either no more results are available or the Pagination Keep Alive expiration is reached, after which the query will restart with the first page of results being retrieved.
Tags: elasticsearch, elasticsearch7, elasticsearch8, elasticsearch9, json, page, query, scroll, search
Input Requirement: FORBIDDEN
Supports Sensitive Dynamic Properties: false

Additional Details for SearchElasticsearch 2.8.0
SearchElasticsearch

This processor is intended for use with the Elasticsearch JSON DSL and Elasticsearch 5.X and newer. It is designed to be able to take a JSON query (e.g. from Kibana) and execute it as-is against an Elasticsearch cluster in a paginated manner. Like all processors in the “restapi” bundle, it uses the official Elastic client APIs, so it supports leader detection.

The query to execute must be provided in the Query configuration property.

The query is paginated in Elasticsearch using one of the available methods - “Scroll” or “Search After” (optionally with a “Point in Time” for Elasticsearch 7.10+ with XPack enabled). The number of results per page can be controlled using the size parameter in the Query JSON. For Search After functionality, a sort parameter must be present within the Query JSON.

Search results and aggregation results can be split up into multiple flowfiles. Aggregation results will only be split at the top level because nested aggregations lose their context (and thus lose their value) if separated from their parent aggregation. Additionally, the results from all pages can be combined into a single flowfile (but the processor will only load each page of data into memory at any one time).

The following is an example query that would be accepted:
```
{
  "query": {
    "size": 10000,
    "sort": {
      "product": "desc"
    },
    "match": {
      "restaurant.keyword": "Local Pizzaz FTW Inc"
    }
  },
  "aggs": {
    "weekly_sales": {
      "date_histogram": {
        "field": "date",
        "interval": "week"
      },
      "aggs": {
        "items": {
          "terms": {
            "field": "product",
            "size": 10
          }
        }
      }
    }
  }
}
```
Query Pagination Across Processor Executions

This processor runs on a schedule in order to execute the same query repeatedly. Once a paginated query has been initiated within Elasticsearch, this processor will continue to retrieve results for that same query until no further results are available. After that point, a new paginated query will be initiated using the same Query JSON.

If the results are “Combined” from this processor, then the paginated query will run continually within a single invocation until no more results are available (then the processor will start a new paginated query upon its next invocation). If the results are “Split” or “Per Page”, then each invocation of this processor will retrieve the next page of results until either there are no more results or the paginated query expires within Elasticsearch.

Resetting Queries / Clearing Processor State

Local State is used to track the progress of a paginated query within this processor. If there is need to restart the query completely or change the processor configuration after a paginated query has already been started, be sure to “Clear State” of the processor once it has been stopped and before restarting.

Duplicate Results

This processor does not attempt to de-duplicate results between queries, for example if the same query runs twice and (some or all of) the results are identical, the output will contain these same results for both invocations. This might happen if the NiFi Primary Node changes while a page of data is being retrieved, or if the processor state is cleared, then the processor is restarted.

This processor will continually run the same query unless the processor properties are updated, so unless the data in Elasticsearch has changed, the same data will be retrieved multiple times.

Properties

Aggregation Results Format
Format of Aggregation output.
Display Name

Aggregation Results Format

Description

Format of Aggregation output.

API Name

Aggregation Results Format

Default Value

FULL

Allowable Values
- FULL
- BUCKETS_ONLY
- METADATA_ONLY
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Aggregation Results Split
Output a flowfile containing all aggregations or one flowfile for each individual aggregation.
Display Name

Aggregation Results Split

Description

Output a flowfile containing all aggregations or one flowfile for each individual aggregation.

API Name

Aggregation Results Split

Default Value

splitUp-no

Allowable Values
- PER_HIT
- PER_RESPONSE
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Aggregations
One or more query aggregations (or "aggs"), in JSON syntax. Ex: {"items": {"terms": {"field": "product", "size": 10}}}
Display Name

Aggregations

Description

One or more query aggregations (or "aggs"), in JSON syntax. Ex: {"items": {"terms": {"field": "product", "size": 10}}}

API Name

Aggregations

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Query Definition Style is set to any of [build]
Client Service
An Elasticsearch client service to use for running queries.

Display Name

Client Service

Description

An Elasticsearch client service to use for running queries.

API Name

Client Service

Service Interface

org.apache.nifi.elasticsearch.ElasticSearchClientService

Service Implementations

org.apache.nifi.elasticsearch.ElasticSearchClientServiceImpl

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Fields
Fields of indexed documents to be retrieved, in JSON syntax. Ex: ["user.id", "http.response.*", {"field": "@timestamp", "format": "epoch_millis"}]
Display Name

Fields

Description

Fields of indexed documents to be retrieved, in JSON syntax. Ex: ["user.id", "http.response.*", {"field": "@timestamp", "format": "epoch_millis"}]

API Name

Fields

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Query Definition Style is set to any of [build]
Index
The name of the index to use.

Display Name

Index

Description

The name of the index to use.

API Name

Index

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

true
Max JSON Field String Length
The maximum allowed length of a string value when parsing a JSON document or attribute.

Display Name

Max JSON Field String Length

Description

The maximum allowed length of a string value when parsing a JSON document or attribute.

API Name

Max JSON Field String Length

Default Value

20 MB

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Output No Hits
Output a "hits" flowfile even if no hits found for query. If true, an empty "hits" flowfile will be output even if "aggregations" are output.
Display Name

Output No Hits

Description

Output a "hits" flowfile even if no hits found for query. If true, an empty "hits" flowfile will be output even if "aggregations" are output.

API Name

Output No Hits

Default Value

false

Allowable Values
- true
- false
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Pagination Keep Alive
Pagination "keep_alive" period. Period Elasticsearch will keep the scroll/pit cursor alive in between requests (this is not the time expected for all pages to be returned, but the maximum allowed time for requests between page retrievals).
Display Name

Pagination Keep Alive

Description

Pagination "keep_alive" period. Period Elasticsearch will keep the scroll/pit cursor alive in between requests (this is not the time expected for all pages to be returned, but the maximum allowed time for requests between page retrievals).

API Name

Pagination Keep Alive

Default Value

10 mins

Expression Language Scope

Not Supported

Sensitive

false

Required

true

Dependencies
- Pagination Type is set to any of [pagination-pit, pagination-scroll]
Pagination Type
Pagination method to use. Not all types are available for all Elasticsearch versions, check the Elasticsearch docs to confirm which are applicable and recommended for your service.
Display Name

Pagination Type

Description

Pagination method to use. Not all types are available for all Elasticsearch versions, check the Elasticsearch docs to confirm which are applicable and recommended for your service.

API Name

Pagination Type

Default Value

pagination-scroll

Allowable Values
- SCROLL
- SEARCH_AFTER
- POINT_IN_TIME
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Query
A query in JSON syntax, not Lucene syntax. Ex: {"query":{"match":{"somefield":"somevalue"}}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.
Display Name

Query

Description

A query in JSON syntax, not Lucene syntax. Ex: {"query":{"match":{"somefield":"somevalue"}}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

API Name

Query

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Query Definition Style is set to any of [full]
Query Attribute
If set, the executed query will be set on each result flowfile in the specified attribute.

Display Name

Query Attribute

Description

If set, the executed query will be set on each result flowfile in the specified attribute.

API Name

Query Attribute

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false
Query Clause
A "query" clause in JSON syntax, not Lucene syntax. Ex: {"match":{"somefield":"somevalue"}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.
Display Name

Query Clause

Description

A "query" clause in JSON syntax, not Lucene syntax. Ex: {"match":{"somefield":"somevalue"}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

API Name

Query Clause

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Query Definition Style is set to any of [build]
Query Definition Style
How the JSON Query will be defined for use by the processor.
Display Name

Query Definition Style

Description

How the JSON Query will be defined for use by the processor.

API Name

Query Definition Style

Default Value

full

Allowable Values
- FULL_QUERY
- BUILD_QUERY
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Restart On Finish
Whether the processor should start another search with the same query once a paginated search has completed.
Display Name

Restart On Finish

Description

Whether the processor should start another search with the same query once a paginated search has completed.

API Name

Restart On Finish

Default Value

true

Allowable Values
- true
- false
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Script Fields
Fields to created using script evaluation at query runtime, in JSON syntax. Ex: {"test1": {"script": {"lang": "painless", "source": "doc['price'].value * 2"}}, "test2": {"script": {"lang": "painless", "source": "doc['price'].value * params.factor", "params": {"factor": 2.0}}}}
Display Name

Script Fields

Description

Fields to created using script evaluation at query runtime, in JSON syntax. Ex: {"test1": {"script": {"lang": "painless", "source": "doc['price'].value * 2"}}, "test2": {"script": {"lang": "painless", "source": "doc['price'].value * params.factor", "params": {"factor": 2.0}}}}

API Name

Script Fields

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Query Definition Style is set to any of [build]
Search Results Format
Format of Hits output.
Display Name

Search Results Format

Description

Format of Hits output.

API Name

Search Results Format

Default Value

FULL

Allowable Values
- FULL
- SOURCE_ONLY
- METADATA_ONLY
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Search Results Split
Output a flowfile containing all hits or one flowfile for each individual hit or one flowfile containing all hits from all paged responses.
Display Name

Search Results Split

Description

Output a flowfile containing all hits or one flowfile for each individual hit or one flowfile containing all hits from all paged responses.

API Name

Search Results Split

Default Value

splitUp-no

Allowable Values
- PER_HIT
- PER_RESPONSE
- PER_QUERY
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Size
The maximum number of documents to retrieve in the query. If the query is paginated, this "size" applies to each page of the query, not the "size" of the entire result set.
Display Name

Size

Description

The maximum number of documents to retrieve in the query. If the query is paginated, this "size" applies to each page of the query, not the "size" of the entire result set.

API Name

Size

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Query Definition Style is set to any of [build]
Sort
Sort results by one or more fields, in JSON syntax. Ex: [{"price" : {"order" : "asc", "mode" : "avg"}}, {"post_date" : {"format": "strict_date_optional_time_nanos"}}]
Display Name

Sort

Description

Sort results by one or more fields, in JSON syntax. Ex: [{"price" : {"order" : "asc", "mode" : "avg"}}, {"post_date" : {"format": "strict_date_optional_time_nanos"}}]

API Name

Sort

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Query Definition Style is set to any of [build]
Type
The type of this document (used by Elasticsearch for indexing and searching).

Display Name

Type

Description

The type of this document (used by Elasticsearch for indexing and searching).

API Name

Type

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dynamic Properties

The name of the HTTP request header
Prefix: HEADER: - adds the specified property name/value as a HTTP request header in the Elasticsearch request. If the Record Path expression results in a null or blank value, the HTTP request header will be omitted.

Name

The name of the HTTP request header

Description

Prefix: HEADER: - adds the specified property name/value as a HTTP request header in the Elasticsearch request. If the Record Path expression results in a null or blank value, the HTTP request header will be omitted.

Value

A Record Path expression to retrieve the HTTP request header value

Expression Language Scope

ENVIRONMENT
The name of a URL query parameter to add
Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body. For SCROLL type queries, these parameters are only used in the initial (first page) query as the Elasticsearch Scroll API does not support the same query parameters for subsequent pages of data.

Name

The name of a URL query parameter to add

Description

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body. For SCROLL type queries, these parameters are only used in the initial (first page) query as the Elasticsearch Scroll API does not support the same query parameters for subsequent pages of data.

Value

The value of the URL query parameter

Expression Language Scope

ENVIRONMENT

State Management

Scopes	Description
LOCAL	The pagination state (scrollId, searchAfter, pitId, hitCount, pageCount, pageExpirationTimestamp) is retained in between invocations of this processor until the Scroll/PiT has expired (when the current time is later than the last query execution plus the Pagination Keep Alive interval).

System Resource Considerations

Resource	Description
MEMORY	Care should be taken on the size of each page because each response from Elasticsearch will be loaded into memory all at once and converted into the resulting flowfiles.

Relationships

Name	Description
aggregations	Aggregations are routed to this relationship.
failure	All flowfiles that fail for reasons unrelated to server availability go to this relationship.
hits	Search hits are routed to this relationship.
retry	All flowfiles that fail due to server/cluster availability go to this relationship.

Writes Attributes

Name	Description
mime.type	application/json
aggregation.name	The name of the aggregation whose results are in the output flowfile
aggregation.number	The number of the aggregation whose results are in the output flowfile
page.number	The number of the page (request), starting from 1, in which the results were returned that are in the output flowfile
hit.count	The number of hits that are in the output flowfile
elasticsearch.query.error	The error message provided by Elasticsearch if there is an error querying the index.

SearchElasticsearch 2.8.0

SearchElasticsearch

Query Pagination Across Processor Executions

Resetting Queries / Clearing Processor State

Duplicate Results