Installation

yum install -y squid
cs


Configuration

Here is the example of squid configuration (/etc/squid/squid.conf) in order to allow all requests.
visible_hostname localhost
acl all src 0.0.0.0/0.0.0.0
http_access allow all
http_port 3128
cs


You need to restart the squid service after changing the configuration.

systemctl restart squid.service
cs


Configuration on Client

The following env variables should be set.
Note that the assumption is you already have HTTP_PROXY_HOSTNAME.
export http_proxy=http://$HTTP_PROXY_HOSTNAME:3128
export https_proxy=http://$HTTP_PROXY_HOSTNAME:3128
cs


Test

If your host cannot access to public network (e.g. google.com) without proxy, the result of curl must be as following.
# curl -v http://www.google.com
* About to connect() to www.google.com port 80 (#0)
*   Trying 172.217.3.228...
cs


After the proxy configuration, the result should be as following.

# curl -v http://www.google.com
* About to connect() to proxy ebdp-po-dkr10d.sys.comcast.net port 3128 (#0)
*   Trying 147.191.72.175...
* Connected to ebdp-po-dkr10d.sys.comcast.net (147.191.72.175) port 3128 (#0)
> GET http://www.google.com/ HTTP/1.1
> User-Agent: curl/7.29.0
> Host: www.google.com
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 OK
....
cs


EOF can be used for creating a file with multiple lines in shell script as following.

cat << EOF > /tmp/yourfilehere
These contents will be written to the file.
        This line is indented.
EOF
cs


https://stackoverflow.com/questions/2953081/how-can-i-write-a-heredoc-to-a-file-in-bash-script


However, this cannot be applied to Dockerfile.

Instead, the following way can be used in a similar way.

RUN echo $'[user]\n\
    email = bumjoon_kim@comcast.com\n\
    name = Bumjoon Kim\n[push]\n\
    default = current\n' >> /root/.gitconfig
cs

Prerequisites

Servers

There are 3 VMs of CentOS 7 as following.

Hostname 

IP Address 

Type 

 ebdp-po-dkr10d.sys.comcast.net

147.191.72.175

Master

 ebdp-po-dkr11d.sys.comcast.net

147.191.72.176 

Slave 

 ebdp-po-dkr12d.sys.comcast.net

147.191.74.184 

Slave 


JDK 1.8

JDK 1.8 u131 was installed. JAVA_HOME should be set as following.
# echo $JAVA_HOME
/usr/java/jdk1.8.0_131
cs

User for Hadoop

hduser was created to run Hadoop daemon across all the nodes.

Passwordless SSH

The master node should be able to log in to slaves via SSH without password.
First, a SSH key was created with hduser on every node as following.
# su - hduser
# ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
cs


Then, the public key of each node needs to be registered in authorized_keys of other nodes (including itself).

Here is the example of ~/.ssh/authorized_keys.

ssh-dss AAAA...HD3no= hduser@ebdp-po-dkr10d.sys.comcast.net
ssh-dss AAAA...YBnYs= hduser@ebdp-po-dkr11d.sys.comcast.net
ssh-dss AAAA...mREIg== hduser@ebdp-po-dkr12d.sys.comcast.net
cs


The permission should be changed as following.

# chmod go-w $HOME $HOME/.ssh
# chmod 600 $HOME/.ssh/authorized_keys
# chown hduser $HOME/.ssh/authorized_keys
cs


Finally, to access other nodes with a shortcut, ~/.ssh/config should contain the following modifications.

(Note that "localhost" should be adjusted according to its hostname.)

Host dk10
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
Host dk11
    HostName ebdp-po-dkr11d.sys.comcast.net
    User hduser
Host dk12
    HostName ebdp-po-dkr12d.sys.comcast.net
    User hduser
Host localhost
    HostName ebdp-po-dkr10d.sys.comcast.net
    User hduser
cs


Installation

Download

Hadoop Package tarballs can be downloaded from http://hadoop.apache.org.
What I selected is version 2.7.5 and there are many mirrors at the link below.


Untar the tarball at a directory

My location is /app/bigdata/hadoop.
In anticipation of future release, I created a symlink which is pointing to the version 2.7.5.

Environment Variables

Here is the example of env variables which were set in .bashrc of hduser.
export JAVA_HOME=/usr/java/jdk1.8.0_131
export HADOOP_HOME=/app/bigdata/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HDFS_NAMENODE_USER="hduser"
export HDFS_DATANODE_USER="hduser"
export HDFS_SECONDARYNAMENODE_USER="hduser"
export YARN_RESOURCEMANAGER_USER="hduser"
export YARN_NODEMANAGER_USER="hduser"
cs


Configurations

The configuration files are located at $HADOOP_CONF_DIR (/app/bigdata/hadoop/etc/hadoop).

Masters and Slaves

Files named "masters" and "slaves" should be created in every node at $HADOOP_CONF_DIR.

It just lists up the nodes of master and slave.

# echo "ebdp-po-dkr10d.sys.comcast.net" >> $HADOOP_CONF_DIR/masters
# echo "ebdp-po-dkr11d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
# echo "ebdp-po-dkr12d.sys.comcast.net" >> $HADOOP_CONF_DIR/slaves
cs


core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://ebdp-po-dkr10d.sys.comcast.net:54310</value>
    <description>The name of the default file system.</description>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>
cs


hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication</description>
  </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/app/bigdata/hadoop/namedir</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/app/bigdata/hadoop/datadir</value>
        <final>true</final>
    </property>
 
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:50090</value>
    </property>
</configuration>
cs


mapred-site.xml

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>ebdp-po-dkr10d.sys.comcast.net:54311</value>
        <description>Map Reduce jobtracker</description>
    </property>
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/app/bigdata/hadoop/mapred-localdir</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/app/bigdata/hadoop/mapred-systemdir</value>
  </property>
</configuration>
cs


yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8025</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>ebdp-po-dkr10d.sys.comcast.net:8035</value>
  </property>
</configuration>
cs


Format Namenode

Note that this is required on the master node only.
# hadoop namenode -format
cs


Launch Hadoop Daemons

This is required on the master node only, too.
# cd $HADOOP_HOME
# ./sbin/start-dfs.sh
....
# ./sbin/start-yarn.sh
....
cs


Note that "$HADOOP_HOME/sbin/start-all.sh" is equivalent to the 2 commands above but it is deprecated.


Verification

Main Webpage


JPS

On master,
# jps
22480 NameNode
23558 Jps
22874 ResourceManager
22700 SecondaryNameNode
cs


On slave,

# jps
12448 DataNode
12811 Jps
12590 NodeManager
cs


Hadoop Command

# hadoop fs -df -h
Filesystem                   Size  Used  Available  Use%
hdfs://hadoop-master:54310  2.0 T   8 K      1.9 T    0%
cs


+ Recent posts