Big Data

Installation of Hadoop 2.7.5 on CentOS 7 Cluster 2018.03.09
Kafka ACLs 2018.03.07
Installation Spark on CentOS 7 Cluster 2018.03.01

Installation of Hadoop 2.7.5 on CentOS 7 Cluster

bailey.. 2018. 3. 9. 08:49

2018. 3. 9. 08:49

Prerequisites

Servers

There are 3 VMs of CentOS 7 as following.

Hostname	IP Address	Type
ebdp-po-dkr10d.sys.comcast.net	147.191.72.175	Master
ebdp-po-dkr11d.sys.comcast.net	147.191.72.176	Slave
ebdp-po-dkr12d.sys.comcast.net	147.191.74.184	Slave

JDK 1.8

JDK 1.8 u131 was installed. JAVA_HOME should be set as following.

# echo $JAVA_HOME
/usr/java/jdk1.8.0_131
cs

User for Hadoop

hduser was created to run Hadoop daemon across all the nodes.

Passwordless SSH

The master node should be able to log in to slaves via SSH without password.

First, a SSH key was created with hduser on every node as following.

# su - hduser
# ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
cs

Then, the public key of each node needs to be registered in authorized_keys of other nodes (including itself).

Here is the example of ~/.ssh/authorized_keys.

ssh-dss AAAA...HD3no= hduser@ebdp-po-dkr10d.sys.comcast.net
ssh-dss AAAA...YBnYs= hduser@ebdp-po-dkr11d.sys.comcast.net
ssh-dss AAAA...mREIg== hduser@ebdp-po-dkr12d.sys.comcast.net
cs

The permission should be changed as following.

# chmod go-w $HOME $HOME/.ssh
# chmod 600 $HOME/.ssh/authorized_keys
# chown hduser $HOME/.ssh/authorized_keys
cs

Finally, to access other nodes with a shortcut, ~/.ssh/config should contain the following modifications.

(Note that "localhost" should be adjusted according to its hostname.)

Host dk10
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
Host dk11
    HostName ebdp-po-dkr11d.sys.comcast.net
    User hduser
Host dk12
    HostName ebdp-po-dkr12d.sys.comcast.net
    User hduser
Host localhost
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
cs

Installation

Download

Hadoop Package tarballs can be downloaded from http://hadoop.apache.org.

What I selected is version 2.7.5 and there are many mirrors at the link below.

http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz

Untar the tarball at a directory

My location is /app/bigdata/hadoop.

In anticipation of future release, I created a symlink which is pointing to the version 2.7.5.

Environment Variables

Here is the example of env variables which were set in .bashrc of hduser.

export JAVA_HOME=/usr/java/jdk1.8.0_131
export HADOOP_HOME=/app/bigdata/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HDFS_NAMENODE_USER="hduser"
export HDFS_DATANODE_USER="hduser"
export HDFS_SECONDARYNAMENODE_USER="hduser"
export YARN_RESOURCEMANAGER_USER="hduser"
export YARN_NODEMANAGER_USER="hduser"
cs

Configurations

The configuration files are located at $HADOOP_CONF_DIR (/app/bigdata/hadoop/etc/hadoop).

Masters and Slaves

Files named "masters" and "slaves" should be created in every node at $HADOOP_CONF_DIR.

It just lists up the nodes of master and slave.

# echo "ebdp-po-dkr10d.sys.comcast.net" >> $HADOOP_CONF_DIR/masters
# echo "ebdp-po-dkr11d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
# echo "ebdp-po-dkr12d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
cs

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://ebdp-po-dkr10d.sys.comcast.net:54310</value>
    <description>The name of the default file system.</description>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>
Colored by Color Scripter
cs

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication</description>
  </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/app/bigdata/hadoop/namedir</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/app/bigdata/hadoop/datadir</value>
        <final>true</final>
    </property>
 
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:50090</value>
    </property>
</configuration>
Colored by Color Scripter
cs

mapred-site.xml

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:54311</value>
        <description>Map Reduce jobtracker</description>
    </property>
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/app/bigdata/hadoop/mapred-localdir</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/app/bigdata/hadoop/mapred-systemdir</value>
  </property>
</configuration>
Colored by Color Scripter
cs

yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8025</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8035</value>
  </property>
</configuration>
Colored by Color Scripter
cs

Format Namenode

Note that this is required on the master node only.

# hadoop namenode -format
cs

Launch Hadoop Daemons

This is required on the master node only, too.

# cd $HADOOP_HOME
# ./sbin/start-dfs.sh
....
# ./sbin/start-yarn.sh
....
cs

Note that "$HADOOP_HOME/sbin/start-all.sh" is equivalent to the 2 commands above but it is deprecated.

Verification

Main Webpage

http://ebdp-po-dkr10d.sys.comcast.net:50070

JPS

On master,

# jps
22480 NameNode
23558 Jps
22874 ResourceManager
22700 SecondaryNameNode
cs

On slave,

# jps
12448 DataNode
12811 Jps
12590 NodeManager
cs

Hadoop Command

# hadoop fs -df -h
Filesystem                   Size  Used  Available  Use%
hdfs://hadoop-master:54310  2.0 T   8 K      1.9 T    0%
cs

저작자표시 비영리 변경금지 (새창열림)

Kafka ACLs

2018. 3. 7. 08:41

Installation Spark on CentOS 7 Cluster

bailey.. 2018. 3. 1. 05:45

2018. 3. 1. 05:45

Download

http://spark.apache.org/downloads.html

I downloaded spark-2.2.1-bin-hadoop2.7.

Installation

Spark can run on Yarn or as standalone.

Running on Yarn assumes that Hadoop is running on the cluster.

Standalone does not require Hadoop but Spark should be installed on every node of the cluster.

Here is the configuration of cluster.

Hostname	Spark	Hadoop
ebdp-po-dkr10d.sys.comcast.net	Master	Master
ebdp-po-dkr11d.sys.comcast.net	Worker	Slave
ebdp-po-dkr12d.sys.comcast.net	Worker	Slave

Note that the following procedure was done with "hduser" account of each node.

Untar the tarball

The downloaded tarball was decompressed to /app/bigdata/spark-2.2.1-bin-hadoop2.7.

In addition, a symlink - /app/bigdata/spark - was created to point out the original directory.

Env Variables

Following env variables were added to /etc/profile.

export JAVA_HOME=/usr/java/jdk1.8.0_131
export HADOOP_HOME=/app/bigdata/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HDFS_NAMENODE_USER="hduser"
export HDFS_DATANODE_USER="hduser"
export HDFS_SECONDARYNAMENODE_USER="hduser"
export YARN_RESOURCEMANAGER_USER="hduser"
export YARN_NODEMANAGER_USER="hduser"
export SPARK_HOME=/app/bigdata/spark
export PATH=$PATH:$SPARK_HOME/bin
cs

Configuration

$SPARK_HOME/conf/slaves should be created as listing up the worker nodes.

ebdp-po-dkr11d.sys.comcast.net
ebdp-po-dkr12d.sys.comcast.net
cs

Test

Running on Yarn

Please refer to the following link to understand the deploy mode - cluster and client.

https://linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/#understand-client-and-cluster-mode

Cluster Mode

spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    10
Colored by Color Scripter
cs

Client Mode

spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode client \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    10
Colored by Color Scripter
cs

Standalone

First, you need to run the Spark cluster from the master node.

$SPARK_HOME/start_all.sh
cs

Then, you can test an example code as following.

spark-submit \
     --master spark://ebdp-po-dkr10d.sys.comcast.net:7077 \
     --class org.apache.spark.examples.SparkPi \
     $SPARK_HOME/examples/jars/spark-examples*.jar \
     100
Colored by Color Scripter
cs

References

http://daeson.tistory.com/279

저작자표시 비영리 변경금지 (새창열림)

PREV 이전 1 NEXT 다음

Movin' on

Big Data

Installation of Hadoop 2.7.5 on CentOS 7 Cluster

Prerequisites

Servers

JDK 1.8

User for Hadoop

Passwordless SSH

Installation

Download

Untar the tarball at a directory

Environment Variables

Configurations

Masters and Slaves

core-site.xml

hdfs-site.xml

mapred-site.xml

yarn-site.xml

Format Namenode

Launch Hadoop Daemons

Verification

Main Webpage

JPS

Hadoop Command

Kafka ACLs

Installation Spark on CentOS 7 Cluster

Download

Installation

Untar the tarball

Env Variables

Configuration

Test

Running on Yarn

Cluster Mode

Client Mode

Standalone

References

+ Recent posts

티스토리툴바