Zaloni Ideas
Status Future consideration
Created by Shyam Sundar Baishya
Created on Aug 3, 2020

Identify hadoop resource consumption by each zdp/arena workflow step

Hi,

As we know that most of the hive queries/mapreduce jobs that are triggers either from any BI tool / ZDP or Arena spawn an yarn job in hadoop ecosystem whose information can be retrieved using yarn API.And once we get the yarn job details we can extract valuable information like vcore used,memory used,time taken,user,etc .Once we identify these information we can display them in a separate UI (like Monitoring dashboard ) .Now question is how can we link each yarn application information with a workflow instance's step that is executed via ZDP/Arena ?

Linking workflow instance step with yarn application:

If we can append ( set ) the below property along with the job that we submit into the cluster via zdp then the DAG_NAME for each yarn application can be user defined . For example if we execute a hadoop job with the below property then the DAG_NAME property in the yarn application will be populated as provided.

For any mapreduce job submitted from ZDP we can set it as below :

-Dmapred.job.name=<WORKFLOW_ID>_<STEP_ID>_<BEDROCK_INSTANCE_ID>

For any hive query that we submit from ZDP we can set it as below :

set hive.query.name=<WORKFLOW_ID>_<STEP_ID>_<BEDROCK_INSTANCE_ID>

Construct a row like below and insert the details into a new mysql table :

APP_ID,JOB_NAME,QUEUE,FINAL_STATUS,USER,JOB_START_TIME,JOB_FINISHED_TIME,NO_OF_MINUTES_JOB_RUN,NO_OF_SEC_JOB_RUN, MEMORY-SECONDS,VCORE-SECONDS

application_1590079866110_1301137,1293_182903_3441,"etl_universal","SUCCEEDED","bdsa_ingest",2020-07-07 20:16:21,2020-07-07 20:17:00,0,39, 1334691,211

where DAG_NAME or JOB_NAME <1293_182903_3441> is basically <WORKFLOW_ID>_<STEP_ID>_<BEDROCK_INSTANCE_ID> .

Now that we have this information in place and we have set the yarn job name as we want we can have a module to call the yarn api at the end of each workflow instance(a hidden step) and extract this WORKFLOW_ID,INSTANCE_ID and STEP_ID and other parameters and populate the same into a mysql or ES from where a dash board can be populated.

Below are few snapshot which is already tested from our side


Customer Impact Nice to have
  • Attach files