The real-time data representation often requires quality content. In most cases, users flooded with multiple options or methodologies that may overwhelm the user to opt for the right techniques.
The quick notes from many geeks over the internet is a good start and at some instances, the conciseness of the article might meet our requirement; some other cases, it may direct with some pointers.
I believe that knowledge sharing is an effective method to grow with mutual terms. Over time, I realized how important to contribute and document the experience in a way to contribute to the communities.
This is an effort to recollect my 14 years of IT experience in the areas of data, data integration, data warehouse, and derive various data solutions through Hadoop eco-systems.
In this article, I will walk-through:
- Discuss the pre-requisites
- Migration from traditional data warehouse to Hadoop system
- Tools and techniques
Note:I touch-base some of tools or technology over period. In this series, I will walk-through Hadoop, HDFS, Sqoop, scala, python, Hive, Hbase, Spark2, mapreduce, kafka, flumes, Oozie, Unix, Shell, SSH, sql server, oracle, Teradata, informatica power center, Informatica Data Quality, Syncsort, Salseforce, Data bricks, Dremio, beeline, API, execution plan, spark physical joins, automation and best practices.
In this section, you will see how to build a data product:
- Integrate data that are available in existing Data warehouses, legacy applications and near real time sources.
- Implement Data standardization
- Discuss Business logic and reporting needs
- Prepare data for rendering reports in Hadoop platform
Note: We will keep infrastructure needs, planning and platform setup out of scope from this article.
In this section, let us discuss the steps:
- Setup a data lake and Ingest data
- Data processing using Apache Spark
- Automate using Apache Oozie
Let us discuss the aforementioned steps:
The enterprise data can be of structured, semi-structured, and unstructured data. In our case, we will consider structured and semi-structured data. To perform the data ingestion, we would need to build two types of the process;
- Sqoop import for handling structured data; sources could be relational or delimited files
- Spark streaming by reading a kafka topic
To start with, you can perform sqoop import using sqoop command from unix installed with a release of a hadoop installed. In order to support variety of sources like RDBMS, flat files and complex files you can build a framework with any of scripting or programming languages like shell scripting or advanced java based programs. With no bias decision will rely on the availability of resources and the amount of time you are ready to spend on building the reusable component and you are the right to one to decide what is right for you. In either case you will realize its benefits as short time versus long term respectively.
Here is the sample sqoop commands for relational source;
# sqoop import –connect “jdbc:oracle:thin:@<hostname:datasource>” –username BIGDATA_USER_T –password-file /user/tbdingd01/sqoop/BIGDATA_USER_T –query ‘select 1 from dual where $CONDITIONS’ -m1 –target-dir /tmp/kiran/test_ora.txt
Once the sqoop import runs successfully you can either create an externa table on hive with required table configurations or perform Hadoop file operations. By default sqoop imports file in text file format with comma as delimiter.
In the process of building a data product one would end-up applying many resource-intensive analytical operations on a medium to large data-set in an efficient way. Apache Spark is the bet in this scenario to perform faster job execution by caching data in memory and enabling parallelism in a distributed data environments.
Components involved in Spark implementation:
- Initialize spark session using scala program
- Ingest data from data lake through hive queries
- Apply business logic using scala constructs or hive queries
- Load data into HDFS or Hive targets
- Execute spark programs through spark submit
We will learn the basics of implementation using Scala. Please refer this git branch click here
Let us see the simple spark submit and meaning of each configuration items
spark-submit –class org.apache.sparksample.sample \
–master yarn-client \
–num-executors 1 \
–driver-memory 512m \
–executor-memory 512m \
–executor-cores 1 \
Please refer the link Running Sample Spark 2.x Applications for apache documentation.
Automate using oozie workflow
There are different ways to automate the data pipelines. In this section, we will see the required components to configure Oozie:
- Sample Oozie property file
- Sample Oozie workflow
- Nodes in Oozie workflow
- Starting Oozie Workflow
Note: As Oozie do not support spark2, we will try to Customize Oozie Workflow to support Spark2 and submit the workflow through SSH.
Please refer my git oozie sample branch for the xml and configuration files to build your oozie workflow. You can build the workflow using xml file or the oozie workflow manager through ambari url. Once you build the workflow, I prefer to validate once using workflow import from hdfs path. You should be able to see the workflow view as below as long as your workflow.xml file is valid.
The simple oozie workflow job submit will look as below:
oozie job -oozie $OozieURL -config $jobPropertiesName -run -DnameNode=$nameNode -DjobTracker=$jobTracker -DresourceManager=$resourceManager -DjdbcUrl=$JDBCUrl -Dhs2Principal=$hs2Principal -DqueueName=$QueueName -DhdpVersion=$HdpVersion -DzeroCountTblVar=$ZeroCountTblVar -DsshHost=$SSHHost -DmyFolder=$myScriptFolder`
oozie job -oozie http://hostname:11000/oozie -config /appl1/jobs/oozie_sample.properties -run -DnameNode=hdfs://hostname-DjobTracker=hostname:8050 -DresourceManager=hostname:8050 -DjdbcUrl=jdbc:hive2://hostname:2181,hostname2:2181,hostname3:2181/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;transportMode=http;httpPath=cliservice; -Dhs2Principal=hive/_HOST@hostname -DqueueName=default -DhdpVersion=22.214.171.124-2 -DsshHost=hostname -DmyFolder=oozie_sample
Oozie does support re-run of failed workflows and below is a sample:
oozie job -oozie http://hostname:11000/oozie -config /appl1/jobs/script/rerun/wf_oozie_sample.xml -rerun 0000123-19234432643631-oozie-oozi-W
0000123-19234432643631-oozie-oozi-W is the job id you can find it on the failed workflow on the oozie monitor info.
We have variety of ways to get things done, I have opted simplest way may be there are better ways to do build Hadoop data pipelines, enable logging and schedule the jobs. I have tried best to keep the scripts simple, clean and follow some basic standards for easier quality learning exchange.
If you find it interesting or wish to leave a comment, please feel free to provide your inputs to improve the quality of my content.