Skip to Main Content
Zaloni Ideas
Status On the Roadmap
Created by Sanjay Yadav
Created on Dec 23, 2021

To be able to do selective Data Quality and Data Profiling

To be able to do selective Data Quality and Data Profiling.

Customer use case where they may have tables with huge data, large number of partitions.

In such cases, it would be of help to be able to do selective DP or DQ(like using a Where clause in hive query)

Customer Impact Major inconvenience
  • PRODUCT MANAGEMENT RESPONSE
    Feb 15, 2022

    We are working to handle large datasets in a better way, during our re-write of Data Profiling in Spark.

    We are also making some improvements to our incremental profiling capabilities and would be happy to review these changes with any interested customers.

  • Attach files
  • Jeet Medhi
    Dec 23, 2021

    Current Challenge:

    We have a set of tables that have more than 3,000,000,000 rows and while running data profiling on these tables, the Yarn Resource Manager would become unresponsive. We get the below error:

    1ERROR org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator - = 2Container complete event for unknown container = 3container_e646_1635575024973_2452737_01_007282 4503:15:07.609 [RMCommunicator Allocator] ERROR = 6org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator - Container = 7complete event for unknown container = 8container_e646_1635575024973_2452737_01_007614

    One way of overcoming these is to allocate more resources to the yarn resource manager by tweaking the below properties in Cloudera Manager

    1yarn.nodemanager.resource.memory-mb 2yarn.scheduler.maximum-allocation-mb

    However, this is not a feasible solution to change properties in CM for every large table and in Production environments.

    Proposed Solution:

    To be able to run Data Profiling in chunks of data, or by using a Where clause, so that the whole table not is loaded in one MR job, creating the Yarn resource constraint.

    The same use case can be applied to Data Quality action as well.

  • Jeet Medhi
    Dec 23, 2021

    Customer have hive tables which have more than 4 billion rows.. In such cases, while running Data Profiling yarn RM is unresponsive.. One way to proceed is to increase the Yarn resource manager memory from CM but, this is not feasible in Prod for a single table.

    So, would need a way to partially profile data or to run the job in smaller chunks

  • Jeet Medhi
    Dec 23, 2021

    One such issue with large table seen is in PS-34177