Prerequisites

Servers

There are 3 VMs of CentOS 7 as following.

Hostname 

IP Address 

Type 

 ebdp-po-dkr10d.sys.comcast.net

147.191.72.175

Master

 ebdp-po-dkr11d.sys.comcast.net

147.191.72.176 

Slave 

 ebdp-po-dkr12d.sys.comcast.net

147.191.74.184 

Slave 


JDK 1.8

JDK 1.8 u131 was installed. JAVA_HOME should be set as following.
# echo $JAVA_HOME
/usr/java/jdk1.8.0_131
cs

User for Hadoop

hduser was created to run Hadoop daemon across all the nodes.

Passwordless SSH

The master node should be able to log in to slaves via SSH without password.
First, a SSH key was created with hduser on every node as following.
# su - hduser
# ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
cs


Then, the public key of each node needs to be registered in authorized_keys of other nodes (including itself).

Here is the example of ~/.ssh/authorized_keys.

ssh-dss AAAA...HD3no= hduser@ebdp-po-dkr10d.sys.comcast.net
ssh-dss AAAA...YBnYs= hduser@ebdp-po-dkr11d.sys.comcast.net
ssh-dss AAAA...mREIg== hduser@ebdp-po-dkr12d.sys.comcast.net
cs


The permission should be changed as following.

# chmod go-w $HOME $HOME/.ssh
# chmod 600 $HOME/.ssh/authorized_keys
# chown hduser $HOME/.ssh/authorized_keys
cs


Finally, to access other nodes with a shortcut, ~/.ssh/config should contain the following modifications.

(Note that "localhost" should be adjusted according to its hostname.)

Host dk10
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
Host dk11
    HostName ebdp-po-dkr11d.sys.comcast.net
    User hduser
Host dk12
    HostName ebdp-po-dkr12d.sys.comcast.net
    User hduser
Host localhost
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
cs


Installation

Download

Hadoop Package tarballs can be downloaded from http://hadoop.apache.org.
What I selected is version 2.7.5 and there are many mirrors at the link below.


Untar the tarball at a directory

My location is /app/bigdata/hadoop.
In anticipation of future release, I created a symlink which is pointing to the version 2.7.5.

Environment Variables

Here is the example of env variables which were set in .bashrc of hduser.
export JAVA_HOME=/usr/java/jdk1.8.0_131
export HADOOP_HOME=/app/bigdata/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HDFS_NAMENODE_USER="hduser"
export HDFS_DATANODE_USER="hduser"
export HDFS_SECONDARYNAMENODE_USER="hduser"
export YARN_RESOURCEMANAGER_USER="hduser"
export YARN_NODEMANAGER_USER="hduser"
cs


Configurations

The configuration files are located at $HADOOP_CONF_DIR (/app/bigdata/hadoop/etc/hadoop).

Masters and Slaves

Files named "masters" and "slaves" should be created in every node at $HADOOP_CONF_DIR.

It just lists up the nodes of master and slave.

# echo "ebdp-po-dkr10d.sys.comcast.net" >> $HADOOP_CONF_DIR/masters
# echo "ebdp-po-dkr11d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
# echo "ebdp-po-dkr12d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
cs


core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://ebdp-po-dkr10d.sys.comcast.net:54310</value>
    <description>The name of the default file system.</description>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>
cs


hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication</description>
  </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/app/bigdata/hadoop/namedir</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/app/bigdata/hadoop/datadir</value>
        <final>true</final>
    </property>
 
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:50090</value>
    </property>
</configuration>
cs


mapred-site.xml

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:54311</value>
        <description>Map Reduce jobtracker</description>
    </property>
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/app/bigdata/hadoop/mapred-localdir</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/app/bigdata/hadoop/mapred-systemdir</value>
  </property>
</configuration>
cs


yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8025</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8035</value>
  </property>
</configuration>
cs


Format Namenode

Note that this is required on the master node only.
# hadoop namenode -format
cs


Launch Hadoop Daemons

This is required on the master node only, too.
# cd $HADOOP_HOME
# ./sbin/start-dfs.sh
....
# ./sbin/start-yarn.sh
....
cs


Note that "$HADOOP_HOME/sbin/start-all.sh" is equivalent to the 2 commands above but it is deprecated.


Verification

Main Webpage


JPS

On master,
# jps
22480 NameNode
23558 Jps
22874 ResourceManager
22700 SecondaryNameNode
cs


On slave,

# jps
12448 DataNode
12811 Jps
12590 NodeManager
cs


Hadoop Command

# hadoop fs -df -h
Filesystem                   Size  Used  Available  Use%
hdfs://hadoop-master:54310  2.0 T   8 K      1.9 T    0%
cs


Download


I downloaded spark-2.2.1-bin-hadoop2.7.

Installation

Spark can run on Yarn or as standalone.
Running on Yarn assumes that Hadoop is running on the cluster.
Standalone does not require Hadoop but Spark should be installed on every node of the cluster.

Here is the configuration of cluster.

Hostname 

Spark 

Hadoop 

 ebdp-po-dkr10d.sys.comcast.net

Master 

Master 

 ebdp-po-dkr11d.sys.comcast.net

Worker 

Slave 

 ebdp-po-dkr12d.sys.comcast.net

Worker

Slave 


Note that the following procedure was done with "hduser" account of each node.


Untar the tarball

The downloaded tarball was decompressed to /app/bigdata/spark-2.2.1-bin-hadoop2.7.
In addition, a symlink - /app/bigdata/spark - was created to point out the original directory.

Env Variables

Following env variables were added to /etc/profile.
export JAVA_HOME=/usr/java/jdk1.8.0_131
export HADOOP_HOME=/app/bigdata/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HDFS_NAMENODE_USER="hduser"
export HDFS_DATANODE_USER="hduser"
export HDFS_SECONDARYNAMENODE_USER="hduser"
export YARN_RESOURCEMANAGER_USER="hduser"
export YARN_NODEMANAGER_USER="hduser"
export SPARK_HOME=/app/bigdata/spark
export PATH=$PATH:$SPARK_HOME/bin
cs

Configuration

$SPARK_HOME/conf/slaves should be created as listing up the worker nodes.
ebdp-po-dkr11d.sys.comcast.net
ebdp-po-dkr12d.sys.comcast.net
cs

Test

Running on Yarn

Cluster Mode

spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    10
cs


Client Mode

spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode client \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    10
cs

Standalone

First, you need to run the Spark cluster from the master node.

$SPARK_HOME/start_all.sh
cs


Then, you can test an example code as following.

spark-submit \
     --master spark://ebdp-po-dkr10d.sys.comcast.net:7077 \
     --class org.apache.spark.examples.SparkPi \
     $SPARK_HOME/examples/jars/spark-examples*.jar \
     100
cs


References


+ Recent posts