Big Data/Apache Hadoop

Installation of Hadoop 2.7.5 on CentOS 7 Cluster

bailey.. 2018. 3. 9. 08:49

Prerequisites

Servers

There are 3 VMs of CentOS 7 as following.

Hostname	IP Address	Type
ebdp-po-dkr10d.sys.comcast.net	147.191.72.175	Master
ebdp-po-dkr11d.sys.comcast.net	147.191.72.176	Slave
ebdp-po-dkr12d.sys.comcast.net	147.191.74.184	Slave

JDK 1.8

JDK 1.8 u131 was installed. JAVA_HOME should be set as following.

# echo $JAVA_HOME
/usr/java/jdk1.8.0_131
cs

User for Hadoop

hduser was created to run Hadoop daemon across all the nodes.

Passwordless SSH

The master node should be able to log in to slaves via SSH without password.

First, a SSH key was created with hduser on every node as following.

# su - hduser
# ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
cs

Then, the public key of each node needs to be registered in authorized_keys of other nodes (including itself).

Here is the example of ~/.ssh/authorized_keys.

ssh-dss AAAA...HD3no= hduser@ebdp-po-dkr10d.sys.comcast.net
ssh-dss AAAA...YBnYs= hduser@ebdp-po-dkr11d.sys.comcast.net
ssh-dss AAAA...mREIg== hduser@ebdp-po-dkr12d.sys.comcast.net
cs

The permission should be changed as following.

# chmod go-w $HOME $HOME/.ssh
# chmod 600 $HOME/.ssh/authorized_keys
# chown hduser $HOME/.ssh/authorized_keys
cs

Finally, to access other nodes with a shortcut, ~/.ssh/config should contain the following modifications.

(Note that "localhost" should be adjusted according to its hostname.)

Host dk10
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
Host dk11
    HostName ebdp-po-dkr11d.sys.comcast.net
    User hduser
Host dk12
    HostName ebdp-po-dkr12d.sys.comcast.net
    User hduser
Host localhost
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
cs

Installation

Download

Hadoop Package tarballs can be downloaded from http://hadoop.apache.org.

What I selected is version 2.7.5 and there are many mirrors at the link below.

http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz

Untar the tarball at a directory

My location is /app/bigdata/hadoop.

In anticipation of future release, I created a symlink which is pointing to the version 2.7.5.

Environment Variables

Here is the example of env variables which were set in .bashrc of hduser.

export JAVA_HOME=/usr/java/jdk1.8.0_131
export HADOOP_HOME=/app/bigdata/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HDFS_NAMENODE_USER="hduser"
export HDFS_DATANODE_USER="hduser"
export HDFS_SECONDARYNAMENODE_USER="hduser"
export YARN_RESOURCEMANAGER_USER="hduser"
export YARN_NODEMANAGER_USER="hduser"
cs

Configurations

The configuration files are located at $HADOOP_CONF_DIR (/app/bigdata/hadoop/etc/hadoop).

Masters and Slaves

Files named "masters" and "slaves" should be created in every node at $HADOOP_CONF_DIR.

It just lists up the nodes of master and slave.

# echo "ebdp-po-dkr10d.sys.comcast.net" >> $HADOOP_CONF_DIR/masters
# echo "ebdp-po-dkr11d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
# echo "ebdp-po-dkr12d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
cs

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://ebdp-po-dkr10d.sys.comcast.net:54310</value>
    <description>The name of the default file system.</description>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>
Colored by Color Scripter
cs

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication</description>
  </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/app/bigdata/hadoop/namedir</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/app/bigdata/hadoop/datadir</value>
        <final>true</final>
    </property>
 
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:50090</value>
    </property>
</configuration>
Colored by Color Scripter
cs

mapred-site.xml

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:54311</value>
        <description>Map Reduce jobtracker</description>
    </property>
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/app/bigdata/hadoop/mapred-localdir</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/app/bigdata/hadoop/mapred-systemdir</value>
  </property>
</configuration>
Colored by Color Scripter
cs

yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8025</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8035</value>
  </property>
</configuration>
Colored by Color Scripter
cs

Format Namenode

Note that this is required on the master node only.

# hadoop namenode -format
cs

Launch Hadoop Daemons

This is required on the master node only, too.

# cd $HADOOP_HOME
# ./sbin/start-dfs.sh
....
# ./sbin/start-yarn.sh
....
cs

Note that "$HADOOP_HOME/sbin/start-all.sh" is equivalent to the 2 commands above but it is deprecated.

Verification

Main Webpage

http://ebdp-po-dkr10d.sys.comcast.net:50070

JPS

On master,

# jps
22480 NameNode
23558 Jps
22874 ResourceManager
22700 SecondaryNameNode
cs

On slave,

# jps
12448 DataNode
12811 Jps
12590 NodeManager
cs

Hadoop Command

# hadoop fs -df -h
Filesystem                   Size  Used  Available  Use%
hdfs://hadoop-master:54310  2.0 T   8 K      1.9 T    0%
cs

저작자표시 비영리 변경금지 (새창열림)