Optimize Joins
Data fusion joins are processed in the order in which they are specified in the UI. This affects the resulting data and the performance of the join. In addition, the type of join you select affects whether fusion processing time is optimized.
Joins are usually performed in-memory. However, when join processing can be pushed down to the data connectors to perform, fusion processing time is greatly reduced. Composer supports pushdown join processing in the following ways.
If a data connector supports pushdown joins and if the data to be joined comes from the same data source connection, Composer pushes the join operation to the underlying data connectors and allows them to join the data instead. Several examples are given later.
If the type of join is an inner join and aggregate functions SUM, MIN, MAX, or COUNT are used in the data, the Composer engine intelligently pushes the aggregate queries to the underlying data connectors, thus reducing the amount of data that needs to be processed. In these cases, the aggregation is performed first before the data is joined. This aggregate pushdown occurs when joining data from the same or from different data sources.
Because most joins are performed in-memory, a configurable limit has been placed on the number of records that can be processed from each joined source. This limit is initially set at 1,000,000 records per joined data source and can be configured by your Composer administrator or supervisor using the qe.zengine.edc.rows.limit
property in the query-engine.properties
file. See Manage the Composer Query Engine. When this threshold is exceeded, no data is shown on the visuals containing the fused data and a message appears indicating that the threshold (maximum row number) is exceeded. If you find you are hitting this limit, use filters on the visual or dashboard to reduce the number of records processed and shown.
Support for this feature by Composer connectors is shown in the following table.
Key:Y - Supported; N - Not Supported; N/A - not applicable
Examples
Example 1
In the following two fusion data sources, Fusion Data Source 1 will be pushed to the Impala connector to perform, whereas Fusion Data Source 2 will be performed in-memory because the two data sources use different Impala connections.
Fusion Data Source 1 join: Impala-Data-Source1-using-Impala-Connection-1 Impala-Data-Source2-using-Impala-Connection-1
Fusion Data Source 2 join: Impala-Data-Source1-using-Impala-Connection-1 Impala-Data-Source3-using-Impala-Connection-2
Example 2
The following multisource fusion example has more than one join defined. Assuming both joins are inner joins, join 1 will be performed by the Impala connector and join 2 will be performed in-memory.
Fusion Data Sourceinner join 1: Impala-Data-Source1-using-Impala-Connection-1 Impala-Data-Source2-using-Impala-Connection-1 inner join 2: Impala-Data-Source1-using-Impala-Connection-1 Elasticsearch-Data-Source1-using Elasticsearch-Connection-1
The following diagram depicts the relationship of the joins in the fused data source:
Example 3
If the join order from Example 2 is switched as shown below and if the first join is changed to a left join, neither join can be performed by data connectors. They are both performed in-memory.
Fusion Data Sourceleft join 1: Elasticsearch-Data-Source1-using-Elasticsearch-Connection-1 Impala-Data-Source1-using-Impala-Connection-1 inner join 2: Impala-Data-Source1-using-Impala-Connection-1 Impala-Data-Source2-using-Impala-Connection-1
The following diagram depicts the relationship of the joins in the fused data source:
Comments
0 comments
Please sign in to leave a comment.