How to Install and Run Apache Hadoop on AWS EC2 (Single-Node-Cluster)

2 minute read

Introduction

In this blog, I’ll guide you through the steps to install and run Apache Hadoop on an AWS EC2 instance as a single-node cluster. If you haven’t set up an EC2 instance yet, check out my previous Blog->Guide to creating an EC2 instance before proceeding.

Prerequisites

  • An active AWS account
  • An EC2 instance (preferably Ubuntu)
  • Basic knowledge of Linux command line

Step 1: Update and Install Java

First, make sure your EC2 instance is up-to-date and install OpenJDK 11.

sudo apt update
sudo apt install openjdk-11-jdk -y
java -version

Check if Java is installed correctly using the java -version command.

Step 2: Download and Extract Hadoop

Next, download Hadoop 3.4.0 and extract the tarball:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
tar -xvf hadoop-3.4.0.tar.gz
mv hadoop-3.4.0 hadoop

This will create a directory called hadoop containing the Hadoop binaries.

Step 3: Set Up Environment Variables

Configure the Hadoop environment variables by editing the .bashrc file:

vi ~/.bashrc

Add the following lines:

export HADOOP_HOME=/home/ubuntu/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Save the file and apply the changes:

source ~/.bashrc

Step 4: Configure Hadoop Files

Navigate to the Hadoop configuration directory:

cd $HADOOP_HOME/etc/hadoop
ls

4.1 Edit core-site.xml

Configure the HDFS URL in the core-site.xml file:

vi core-site.xml

Add the following content:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://ec2-44-223-67-165.compute-1.amazonaws.com:9000</value>
  </property>
</configuration>

change the PUBLIC DNS to your DNS Which u can find in Ec2 Dashboard Replace this .

4.2 Edit hdfs-site.xml

Set up the namenode and datanode directories:

vi hdfs-site.xml

Add the following:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/ubuntu/hadoop/hadoopdata/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///home/ubuntu/hadoop/hadoopdata/hdfs/datanode</value>
  </property>
</configuration>

4.3 Edit yarn-site.xml

Configure YARN in the yarn-site.xml file:

vi yarn-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>ec2-44-223-67-165.compute-1.amazonaws.com</value>
  </property>
</configuration>

change the PUBLIC DNS to your DNS Which u can find in Ec2 Dashboard Replace this

4.4 Edit mapred-site.xml

Configure MapReduce in the mapred-site.xml file:

vi mapred-site.xml

Add the following lines:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>ec2-44-223-67-165.compute-1.amazonaws.com:8021</value>
  </property>
</configuration>

change the PUBLIC DNS to your DNS Which u can find in Ec2 Dashboard Replace this

Step 5: Update /etc/hosts

Update the /etc/hosts file to map the local IP address to your EC2 hostname:

sudo vi /etc/hosts

Remove the line with 127.0.0.1 localhost and add the following:

172.31.87.190   ec2-3-83-109-16.compute-1.amazonaws.com

Step 6: Add JAVA path

Update the Java path in cd $HADOOP_HOME/etc/hadoop and add the JAVA PATH

vi hadoop-env.sh

Remove # and paste this export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Step 7: ssh key adding for (NOT ROOT USER)

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys

Step 8: Start Hadoop

Navigate to the Hadoop sbin directory:

cd $HADOOP_HOME/sbin

Start all Hadoop services:

./start-all.sh

Check if all processes are running using:

jps

Format the Namenode:

hadoop namenode -format

Finally, start the distributed file system:

start-dfs.sh

Conclusion

That’s it! You now have Apache Hadoop running on a single-node cluster on your EC2 instance. You can access the NameNode UI by visiting http://<ec2-public-ip>:9870.

##Reference images

Happy Hadooping!

Posted:

Leave a comment