How to Install and Run Apache Hadoop on AWS EC2 (Multi-Node Cluster)

2 minute read

Introduction

To set up a multi-node Hadoop cluster on AWS EC2, follow the steps below. Ensure you’ve already created 4 EC2 instances (NameNode, Secondary NameNode, DataNode1, DataNode2). If you’re unfamiliar with setting up EC2 instances, check out my Guide to Creating an EC2 Instance before proceeding. When launching EC2 instances, make sure to specify 4 instances.

Pre-requisites

Create AWS EC2 Instances:
- Launch 4 Ubuntu EC2 instances for NameNode, Secondary NameNode, and DataNodes (1 & 2). Assign names to your instances accordingly.
SSH into the NameNode:
- Copy the key pair you created for the EC2 instances to the NameNode. First, locate the .pem file and execute the following command:
```
scp -i Laptopkey.pem Laptopkey.pem ubuntu@<namenode_dns>:~/
```
Update and Restart Instances:
- After logging into all 4 EC2 instances, update and restart them:
```
sudo apt-get update && sudo apt-get -y dist-upgrade
sudo reboot
```
Update /etc/hosts:
- On each of the 4 instances, update the /etc/hosts file:
```
sudo vi /etc/hosts
```
- Replace 127.0.0.1 localhost with the private IP and DNS of each instance. For example:
```
172.31.45.2 ec2-54-172-91-40.compute-1.amazonaws.com
```

Install Java and Hadoop

Install Java:

Install Java on all instances:

sudo apt-get -y install openjdk-8-jdk-headless

Download Hadoop:

Download and extract Hadoop on all 4 instances:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
tar -xvzf hadoop-3.4.0.tar.gz
mv hadoop-3.4.0 hadoop

Update .bashrc:

Update the .bashrc file on all 4 instances:
```
vi ~/.bashrc
```

Add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/ubuntu/hadoop
export HADOOP_CONF=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$JAVA_HOME:$HADOOP_HOME/bin

Source the .bashrc file:
```
  source ~/.bashrc
```

Configure Password-less SSH

Generate SSH Keys on NameNode (Master):

ssh-keygen

Copy the public key to all other nodes:

scp -i Laptopkey.pem /home/ubuntu/.ssh/id_ed25519.pub ubuntu@ec2-54-172-91-40.compute-1.amazonaws.com:/home/ubuntu/.ssh/id_ed25519.pub

Then append the public key to the authorized_keys file:

cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys

Set up SSH Config:

Modify ~/.ssh/config on the NameNode:
```
vi ~/.ssh/config
```

Add the following:

Host nnode
  HostName <namenode_dns>
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Host snnode
  HostName <secondary_namenode_dns>
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Host dnode1
  HostName <datanode1_dns>
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Host dnode2
  HostName <datanode2_dns>
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Set up Hadoop Cluster

Create HDFS Directories on All Nodes:

sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown -R ubuntu:ubuntu /usr/local/hadoop/hdfs/data

Configure Hadoop Files:

Modify the following Hadoop configuration files:

hadoop-env.sh:

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add or update the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

core-site.xml:

vi $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://<namenode_dns>:9000</value>
  </property>
</configuration>

hdfs-site.xml:

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hdfs/data</value>
  </property>
</configuration>

mapred-site.xml:

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml:

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value><namenode_dns></value>
  </property>
</configuration>

Navigate to the Hadoop configuration directory:
```
cd $HADOOP_HOME/etc/hadoop
```

Copy Configuration Files to Slaves:

Copy the configuration files from NameNode to the slave nodes:

scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml ubuntu@<SlaveDNS>:~/hadoop/etc/hadoop/

Set up Masters and Slaves:
- On NameNode and Secondary NameNode, edit the masters file:
```
vi $HADOOP_HOME/etc/hadoop/masters
```
  - Add the following:
    <namenode_dns> <secondary_namenode_dns>
- On DataNode1 and DataNode2, edit the slaves file:
```
vi $HADOOP_HOME/etc/hadoop/slaves
```
  - Add the following (specific to each DataNode):
    <datanode1_dns> # On DataNode1 <datanode2_dns> # On DataNode2
- Note: Leave the masters file empty on DataNode1 and DataNode2.

Start the Hadoop Cluster

Format the NameNode:
```
hdfs namenode -format
```

Start HDFS and YARN:

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Verify Hadoop Cluster:
- Use the following command on each node to verify the daemons:
```
jps
```

Conclusion

That’s it! You now have Apache Hadoop running on a multi-node cluster on your EC2 instances. You can access the NameNode UI by visiting http://<ec2-public-ip>:9870.

Final Output:

Share on

Twitter Facebook LinkedIn

Abdul Rahman

How to Install and Run Apache Hadoop on AWS EC2 (Multi-Node Cluster)

Introduction

Pre-requisites

Install Java and Hadoop

Configure Password-less SSH

Set up Hadoop Cluster

Start the Hadoop Cluster

Conclusion

Share on

Leave a comment

You may also enjoy

PySpark for Data Engineers: Build Scalable Data Pipelines

06 Feb 2025

Mastering Apache Kafka: A Comprehensive Guide

28 Jan 2025

How to Install Apache Airflow Using Docker and Write Your First DAG

01 Jan 2025

How to tune a foundational model in watsonx.ai

13 Dec 2024