Big Data/Apache Hadoop
Installation of Hadoop 2.7.5 on CentOS 7 Cluster
bailey..
2018. 3. 9. 08:49
Prerequisites
Servers
There are 3 VMs of CentOS 7 as following.
Hostname |
IP Address |
Type |
ebdp-po-dkr10d.sys.comcast.net |
147.191.72.175 |
Master |
ebdp-po-dkr11d.sys.comcast.net |
147.191.72.176 |
Slave |
ebdp-po-dkr12d.sys.comcast.net |
147.191.74.184 |
Slave |
JDK 1.8
JDK 1.8 u131 was installed. JAVA_HOME should be set as following.
# echo $JAVA_HOME /usr/java/jdk1.8.0_131 | cs |
User for Hadoop
hduser was created to run Hadoop daemon across all the nodes.
Passwordless SSH
The master node should be able to log in to slaves via SSH without password.
First, a SSH key was created with hduser on every node as following.
# su - hduser # ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa | cs |
Then, the public key of each node needs to be registered in authorized_keys of other nodes (including itself).
Here is the example of ~/.ssh/authorized_keys.
ssh-dss AAAA...HD3no= hduser@ebdp-po-dkr10d.sys.comcast.net ssh-dss AAAA...YBnYs= hduser@ebdp-po-dkr11d.sys.comcast.net ssh-dss AAAA...mREIg== hduser@ebdp-po-dkr12d.sys.comcast.net | cs |
The permission should be changed as following.
# chmod go-w $HOME $HOME/.ssh # chmod 600 $HOME/.ssh/authorized_keys # chown hduser $HOME/.ssh/authorized_keys | cs |
Finally, to access other nodes with a shortcut, ~/.ssh/config should contain the following modifications.
(Note that "localhost" should be adjusted according to its hostname.)
Host dk10 HostName ebdp-po-dkr10d.sys.comcast.net User hduser Host dk11 HostName ebdp-po-dkr11d.sys.comcast.net User hduser Host dk12 HostName ebdp-po-dkr12d.sys.comcast.net User hduser Host localhost HostName ebdp-po-dkr10d.sys.comcast.net User hduser | cs |
Installation
Download
Hadoop Package tarballs can be downloaded from http://hadoop.apache.org.
What I selected is version 2.7.5 and there are many mirrors at the link below.
Untar the tarball at a directory
My location is /app/bigdata/hadoop.
In anticipation of future release, I created a symlink which is pointing to the version 2.7.5.
Environment Variables
Here is the example of env variables which were set in .bashrc of hduser.
export JAVA_HOME=/usr/java/jdk1.8.0_131 export HADOOP_HOME=/app/bigdata/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin export HDFS_NAMENODE_USER="hduser" export HDFS_DATANODE_USER="hduser" export HDFS_SECONDARYNAMENODE_USER="hduser" export YARN_RESOURCEMANAGER_USER="hduser" export YARN_NODEMANAGER_USER="hduser" | cs |
Configurations
The configuration files are located at $HADOOP_CONF_DIR (/app/bigdata/hadoop/etc/hadoop).
Masters and Slaves
Files named "masters" and "slaves" should be created in every node at $HADOOP_CONF_DIR.
It just lists up the nodes of master and slave.
# echo "ebdp-po-dkr10d.sys.comcast.net" >> $HADOOP_CONF_DIR/masters # echo "ebdp-po-dkr11d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves # echo "ebdp-po-dkr12d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves | cs |
core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://ebdp-po-dkr10d.sys.comcast.net:54310</value> <description>The name of the default file system.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> </configuration> | cs |
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>/app/bigdata/hadoop/namedir</value> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>/app/bigdata/hadoop/datadir</value> <final>true</final> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>ebdp-po-dkr10d.sys.comcast.net:50090</value> </property> </configuration> | cs |
mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value>ebdp-po-dkr10d.sys.comcast.net:54311</value> <description>Map Reduce jobtracker</description> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapred.local.dir</name> <value>/app/bigdata/hadoop/mapred-localdir</value> </property> <property> <name>mapred.system.dir</name> <value>/app/bigdata/hadoop/mapred-systemdir</value> </property> </configuration> | cs |
yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>ebdp-po-dkr10d.sys.comcast.net:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>ebdp-po-dkr10d.sys.comcast.net:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>ebdp-po-dkr10d.sys.comcast.net:8035</value> </property> </configuration> | cs |
Format Namenode
Note that this is required on the master node only.
# hadoop namenode -format | cs |
Launch Hadoop Daemons
This is required on the master node only, too.
# cd $HADOOP_HOME # ./sbin/start-dfs.sh .... # ./sbin/start-yarn.sh .... | cs |
Note that "$HADOOP_HOME/sbin/start-all.sh" is equivalent to the 2 commands above but it is deprecated.
Verification
Main Webpage
JPS
On master,
# jps 22480 NameNode 23558 Jps 22874 ResourceManager 22700 SecondaryNameNode | cs |
On slave,
# jps 12448 DataNode 12811 Jps 12590 NodeManager | cs |
Hadoop Command
# hadoop fs -df -h Filesystem Size Used Available Use% hdfs://hadoop-master:54310 2.0 T 8 K 1.9 T 0% | cs |