Discussion:
[GitHub] vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
(too old to reply)
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235994435



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java
##########
@@ -262,41 +273,31 @@ public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities udfUtili

Map<SchemaPath, ColumnStatistics> columnStatisticsMap = statCollector.collectColStat(schemaPathsInExpr);

- if (filterPredicate == null) {
- ErrorCollector errorCollector = new ErrorCollectorImpl();
- LogicalExpression materializedFilter = ExpressionTreeMaterializer.materializeFilterExpr(
- filterExpr, columnStatisticsMap, errorCollector, functionImplementationRegistry);
-
- if (errorCollector.hasErrors()) {
- logger.error("{} error(s) encountered when materialize filter expression : {}",
- errorCollector.getErrorCount(), errorCollector.toErrorString());
- return null;
- }
- logger.debug("materializedFilter : {}", ExpressionStringBuilder.toString(materializedFilter));
-
- Set<LogicalExpression> constantBoundaries = ConstantExpressionIdentifier.getConstantExpressionSet(materializedFilter);
- filterPredicate = ParquetFilterBuilder.buildParquetFilterPredicate(materializedFilter, constantBoundaries, udfUtilities);
-
- if (filterPredicate == null) {
- return null;
- }
- }
-
- ParquetFilterPredicate.RowsMatch match = ParquetRGFilterEvaluator.matches(filterPredicate, columnStatisticsMap, rowGroup.getRowCount(), parquetTableMetadata, rowGroup.getColumns(), schemaPathsInExpr);
+ ParquetFilterPredicate.RowsMatch match = ParquetRGFilterEvaluator.matches(filterPredicate,
+ columnStatisticsMap, rowGroup.getRowCount(), parquetTableMetadata, rowGroup.getColumns(), schemaPathsInExpr);
if (match == ParquetFilterPredicate.RowsMatch.NONE) {
continue; // No row comply to the filter => drop the row group
}
- rowGroup.setRowsMatch(match);
+ if (matchAllRowGroupsLocal) {

Review comment:
Thanks, done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235995816



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java
##########
@@ -310,13 +311,60 @@ public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities udfUtili
AbstractParquetGroupScan cloneGroupScan = cloneWithFileSelection(qualifiedFilePath);
cloneGroupScan.rowGroupInfos = qualifiedRGs;
cloneGroupScan.parquetGroupScanStatistics.collect(cloneGroupScan.rowGroupInfos, cloneGroupScan.parquetTableMetadata);
+ cloneGroupScan.matchAllRowGroups = matchAllRowGroupsLocal;
return cloneGroupScan;

} catch (IOException e) {
logger.warn("Could not apply filter prune due to Exception : {}", e);
return null;
}
}
+
+ /**
+ * Returns parquet filter predicate built from specified {@code filterExpr}.
+ *
+ * @param filterExpr filter expression to build
+ * @param udfUtilities udf utilities
+ * @param functionImplementationRegistry context to find drill function holder
+ * @param optionManager option manager
+ * @param omitUnsupportedExprs whether expressions which cannot be converted
+ * may be omitted from the resulting expression
+ * @return parquet filter predicate
+ */
+ public ParquetFilterPredicate getParquetFilterPredicate(LogicalExpression filterExpr,
+ UdfUtilities udfUtilities, FunctionImplementationRegistry functionImplementationRegistry,
+ OptionManager optionManager, boolean omitUnsupportedExprs) {
+ // used first row group to receive fields list
+ RowGroupInfo rowGroup = rowGroupInfos.iterator().next();

Review comment:
Thanks, done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235994025



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java
##########
@@ -85,6 +85,8 @@

private List<EndpointAffinity> endpointAffinities;
private ParquetGroupScanStatistics parquetGroupScanStatistics;
+ // whether all row groups of this group scan fully matches the filter

Review comment:
Thanks, fixed

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235997768



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFilterBuilder.java
##########
@@ -71,18 +72,24 @@
*
* @return parquet filter predicate
*/
- public static ParquetFilterPredicate buildParquetFilterPredicate(LogicalExpression expr, final Set<LogicalExpression> constantBoundaries, UdfUtilities udfUtilities) {
- LogicalExpression logicalExpression = expr.accept(new ParquetFilterBuilder(udfUtilities), constantBoundaries);
+ public static ParquetFilterPredicate buildParquetFilterPredicate(LogicalExpression expr,
+ Set<LogicalExpression> constantBoundaries, UdfUtilities udfUtilities, boolean omitUnsupportedExprs) {
+ LogicalExpression logicalExpression =
+ expr.accept(new ParquetFilterBuilder(udfUtilities, omitUnsupportedExprs), constantBoundaries);
if (logicalExpression instanceof ParquetFilterPredicate) {
return (ParquetFilterPredicate) logicalExpression;
+ } else if (logicalExpression instanceof TypedFieldExpr) {
+ // Calcite simplifies `= true` expression to field name, wrap it with is true predicate
+ return (ParquetFilterPredicate) ParquetIsPredicate.createIsPredicate(FunctionGenerationHelper.IS_TRUE, logicalExpression);

Review comment:
This change was added because now we try to convert every expression, especially arguments of `AND` operator.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235997054



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java
##########
@@ -172,14 +170,30 @@ protected void doOnMatch(RelOptRuleCall call, FilterPrel filter, ProjectPrel pro


Stopwatch timer = logger.isDebugEnabled() ? Stopwatch.createStarted() : null;
- final GroupScan newGroupScan = groupScan.applyFilter(conditionExp, optimizerContext,
+ AbstractParquetGroupScan newGroupScan = groupScan.applyFilter(conditionExp, optimizerContext,
optimizerContext.getFunctionRegistry(), optimizerContext.getPlannerSettings().getOptions());
if (timer != null) {
logger.debug("Took {} ms to apply filter on parquet row groups. ", timer.elapsed(TimeUnit.MILLISECONDS));
timer.stop();
}

- if (newGroupScan == null ) {
+ if (newGroupScan == null) {
+ if (groupScan.isMatchAllRowGroups()) {
+ RelNode child = project == null ? scan : project;
+ // if current row group fully matches filter,
+ // but row group pruning wasn't happened, removes filter.

Review comment:
thanks, changed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235995999



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFilterBuilder.java
##########
@@ -63,6 +63,7 @@
static final Logger logger = LoggerFactory.getLogger(ParquetFilterBuilder.class);

private final UdfUtilities udfUtilities;
+ private final boolean omitUnsupportedExprs;

Review comment:
Thanks, added.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235995661



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java
##########
@@ -310,13 +311,60 @@ public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities udfUtili
AbstractParquetGroupScan cloneGroupScan = cloneWithFileSelection(qualifiedFilePath);
cloneGroupScan.rowGroupInfos = qualifiedRGs;
cloneGroupScan.parquetGroupScanStatistics.collect(cloneGroupScan.rowGroupInfos, cloneGroupScan.parquetTableMetadata);
+ cloneGroupScan.matchAllRowGroups = matchAllRowGroupsLocal;
return cloneGroupScan;

} catch (IOException e) {
logger.warn("Could not apply filter prune due to Exception : {}", e);
return null;
}
}
+
+ /**
+ * Returns parquet filter predicate built from specified {@code filterExpr}.
+ *
+ * @param filterExpr filter expression to build
+ * @param udfUtilities udf utilities
+ * @param functionImplementationRegistry context to find drill function holder
+ * @param optionManager option manager
+ * @param omitUnsupportedExprs whether expressions which cannot be converted
+ * may be omitted from the resulting expression
+ * @return parquet filter predicate
+ */
+ public ParquetFilterPredicate getParquetFilterPredicate(LogicalExpression filterExpr,

Review comment:
`applyFilter()` method from the previous code returns `null` if the filter wasn't created from first row group.
I agree with you that schema change may break filter pushdown, but currently, we cannot predict that the filter built from one row group will be suitable for other ones.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235997432



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java
##########
@@ -155,39 +170,53 @@ protected void doOnMatch(RelOptRuleCall call, FilterPrel filter, ProjectPrel pro


Stopwatch timer = logger.isDebugEnabled() ? Stopwatch.createStarted() : null;
- final GroupScan newGroupScan = groupScan.applyFilter(conditionExp,optimizerContext,
+ AbstractParquetGroupScan newGroupScan = groupScan.applyFilter(conditionExp, optimizerContext,
optimizerContext.getFunctionRegistry(), optimizerContext.getPlannerSettings().getOptions());
if (timer != null) {
logger.debug("Took {} ms to apply filter on parquet row groups. ", timer.elapsed(TimeUnit.MILLISECONDS));
timer.stop();
}

- if (newGroupScan == null ) {
+ if (newGroupScan == null) {
+ if (groupScan.isMatchAllRowGroups()) {
+ RelNode child = project == null ? scan : project;
+ // if current row group fully matches filter,
+ // but row group pruning wasn't happened, removes filter.
+ if (nonConvertedPredList.size() == 0) {
+ call.transformTo(child);
+ } else if (nonConvertedPredList.size() < predList.size()) {

Review comment:
For the case when `nonConvertedPredList.size() == predList.size()`, none of the predicates participated in filter pushdown, so `call.transformTo()` shouldn't be called for this case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-23 17:46:50 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r235997009



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java
##########
@@ -134,12 +132,29 @@ protected void doOnMatch(RelOptRuleCall call, FilterPrel filter, ProjectPrel pro

// get a conjunctions of the filter condition. For each conjunction, if it refers to ITEM or FLATTEN expression
// then we could not pushed down. Otherwise, it's qualified to be pushed down.
- final List<RexNode> predList = RelOptUtil.conjunctions(condition);
+ final List<RexNode> predList = RelOptUtil.conjunctions(RexUtil.toCnf(filter.getCluster().getRexBuilder(), condition));

Review comment:
We need to convert initial expression to conjunctive normal form, so it will be splitted into predicates more precisely and they will be divided into predicates which are supported by parquet filter pushdown and predicates which aren't.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-25 11:50:27 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r236069655



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java
##########
@@ -155,39 +170,53 @@ protected void doOnMatch(RelOptRuleCall call, FilterPrel filter, ProjectPrel pro


Stopwatch timer = logger.isDebugEnabled() ? Stopwatch.createStarted() : null;
- final GroupScan newGroupScan = groupScan.applyFilter(conditionExp,optimizerContext,
+ AbstractParquetGroupScan newGroupScan = groupScan.applyFilter(conditionExp, optimizerContext,
optimizerContext.getFunctionRegistry(), optimizerContext.getPlannerSettings().getOptions());
if (timer != null) {
logger.debug("Took {} ms to apply filter on parquet row groups. ", timer.elapsed(TimeUnit.MILLISECONDS));
timer.stop();
}

- if (newGroupScan == null ) {
+ if (newGroupScan == null) {
+ if (groupScan.isMatchAllRowGroups()) {
+ RelNode child = project == null ? scan : project;
+ // if current row group fully matches filter,
+ // but row group pruning wasn't happened, removes filter.
+ if (nonConvertedPredList.size() == 0) {
+ call.transformTo(child);
+ } else if (nonConvertedPredList.size() < predList.size()) {

Review comment:
In this case, `else` will include both `nonConvertedPredList.size() < predList.size()` and `nonConvertedPredList.size() == predList.size()` cases, but as I pointed in the comment above, we shouldn't do anything for the last case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services
GitBox
2018-11-25 11:50:27 UTC
Permalink
vvysotskyi commented on a change in pull request #1552: DRILL-6865: Query returns wrong result when filter pruning happens
URL: https://github.com/apache/drill/pull/1552#discussion_r236069944



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFilterBuilder.java
##########
@@ -63,6 +63,7 @@
static final Logger logger = LoggerFactory.getLogger(ParquetFilterBuilder.class);

private final UdfUtilities udfUtilities;
+ private final boolean omitUnsupportedExprs;

Review comment:
Added to its Javadoc case when it should be used.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
***@infra.apache.org


With regards,
Apache Git Services

Loading...