Abdul Rahman

Chat with Multiple Documents Using Streamlit and Watsonx

2024-11-10T00:00:00+01:00

Introduction

The ability to extract meaningful information from multiple document types (like PDFs, DOCX, CSV, JSON, and more) has become essential for businesses and researchers. This blog explains how to build a Streamlit app that integrates Watsonx.ai, LangChain, and retrieval-augmented generation (RAG) to make querying documents seamless and efficient.

Live App

Link to live app

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines document retrieval with large language models (LLMs) to generate accurate and context-based responses. It retrieves relevant information from your documents before generating answers, making it highly effective for specialized queries.

What is Watsonx.ai?

IBM Watsonx.ai is IBM’s next-generation platform for foundation models and generative AI. It offers pre-trained language models that can be fine-tuned for tasks like document querying, answering questions, and more. In this project, Watsonx.ai acts as the backbone for generating context-aware answers.

What is LangChain?

LangChain is a framework for developing applications powered by LLMs. It simplifies tasks like document retrieval, question answering, and conversation handling by connecting LLMs with tools like embeddings and databases.

What is Streamlit?

Streamlit is a Python-based framework for building data-driven web apps quickly. It provides an intuitive interface for users to interact with your application, making it an ideal choice for creating this multi-document retrieval tool.

Features

File Support: Supports multiple file formats such as PDFs, Word documents, PowerPoint presentations, CSV, JSON, YAML, HTML, and plain text.
Watsonx LLM Integration: Utilize IBM Watsonx’s LLM models for querying and generating answers.
Embeddings: Uses HuggingFace embeddings for document indexing.
RAG (Retrieval Augmented Generation): Combines document-based retrieval with LLMs for accurate responses.
Streamlit Interface: Provides an intuitive user experience.

Installation

Follow these steps to clone and run the project locally:

Prerequisites

Python 3.8+ installed on your system.
Install pip (Python package manager).
An IBM Watsonx API key and Project ID.
Install Git if not already installed.

Clone the Repository

git clone https://github.com/Abd-al-RahmanH/Multi-Doc-Retrieval-Watsonx.git
cd Multi-Doc-Retrieval-Watsonx

Install Dependencies

Create a virtual environment (optional but recommended):

 python -m venv env
 source env/bin/activate  # On Windows: .\env\Scripts\activate

Install required Python packages:
```
 pip install -r requirements.txt
```

Set Environment Variables

Create a .env file in the project directory with the following keys:

WATSONX_API_KEY=
WATSONX_PROJECT_ID=

Run the App

Start the Streamlit app by running:
```
 streamlit run app.py
```
Open the URL displayed in your terminal (usually http://localhost:8501) to access the app.

How to Use

Upload Documents: Drag and drop supported files (e.g., PDFs, DOCX, JSON) in the app sidebar.
Select Model and Parameters: Choose a Watsonx model and configure settings like output tokens and decoding methods.
Ask Questions: Enter queries in the chat input to retrieve answers based on the uploaded document.

Project Structure

Multi-Doc-Retrieval-Watsonx/
├── app.py               # Main application file
├── requirements.txt     # Python dependencies
├── README.md            # Project documentation
└── .env                 # Environment variables (not included in repo, create manually)

Dependencies

Streamlit: For building the user interface.
LangChain: For document retrieval and RAG implementation.
HuggingFace Transformers: For embedding and vector representation.
Watsonx Foundation Models: For querying and text generation.
Various Python Libraries: For file handling, including pandas, python-docx, python-pptx, and more.

Contributing

We welcome contributions! If you’d like to improve this project:

Fork the repository.
Create a feature branch: git checkout -b feature-name.
Commit your changes: git commit -m 'Add a new feature'.
Push to the branch: git push origin feature-name.
Open a Pull Request.

More Blogs and Interesting Projects

For more blogs and interesting projects, visit my personal website: https://abdulrahmanh.com

License

This project is licensed under the MIT License. See the LICENSE file for details.

How to Install IBM Watsonx Data 2.0 Developer Edition on Ubuntu EC2 (On-Premises)

2024-11-05T00:00:00+01:00

Setting up IBM Watsonx Data 2.0 Developer Edition on an Ubuntu EC2 instance enables you to leverage IBM’s data lakehouse capabilities on the cloud. This guide provides detailed steps, from configuring entitlement to starting the Watsonx Data containers.

For guidance on creating an EC2 instance, check out my previous blog: How to Create an AWS EC2 Instance.Make sure the instance type bigger eg.t3.xlarge and allow All traffic

Prerequisites

Ensure that you have:

An IBM Entitlement Key for Watsonx Data.
An Ubuntu EC2 instance in AWS.

Step 1: Set Up Entitlement Key

Log in to your IBM container library.
Go to Add New key and create a new API key for entitlement.
Store the API key securely, as you’ll need it for the Watsonx Data installation.

Step 2: Install Docker

Watsonx Data requires Docker to manage its containers. Install Docker as follows:

# Update package information
sudo apt update

# Install Docker
sudo apt install -y docker.io

# Start Docker service
sudo systemctl start docker
sudo systemctl enable docker

Verify Docker installation:

docker --version

Step 3: Set Up Installation Directory and Environment Variables

Switch to root user, create an installation directory, and set the necessary environment variables.

sudo su -
mkdir watsonxdata
cd watsonxdata

# Set environment variables
export LH_ROOT_DIR=
export LH_RELEASE_TAG=latest
export IBM_LH_TOOLBOX=cp.icr.io/cpopen/watsonx-data/ibm-lakehouse-toolbox:$LH_RELEASE_TAG
export LH_REGISTRY=cp.icr.io/cp/watsonx-data
export PROD_USER=cp
export IBM_ENTITLEMENT_KEY=
export IBM_ICR_IO=cp.icr.io

Replace with the key obtained in Step 1.

For Docker, set DOCKER_EXE as follows:

export DOCKER_EXE=docker

Step 4: Download and Extract Watsonx Data Developer Package

Pull the Watsonx Data developer package and copy it to the host system:

$DOCKER_EXE pull $IBM_LH_TOOLBOX
id=$($DOCKER_EXE create $IBM_LH_TOOLBOX)
$DOCKER_EXE cp $id:/opt - > /tmp/pkg.tar
$DOCKER_EXE rm $id

Extract the package and verify the checksum:

tar -xf /tmp/pkg.tar -C /tmp
cat /tmp/opt/bom.txt
cksum /tmp/opt/*/*
tar -xf /tmp/opt/dev/ibm-lh-dev-*.tgz -C $LH_ROOT_DIR

Step 5: Authenticate with IBM Registry

$DOCKER_EXE login ${IBM_ICR_IO} --username=${PROD_USER} --password=${IBM_ENTITLEMENT_KEY}

Step 6: Run Setup Script

Run the setup script to initialize the Watsonx Data Developer environment. You can set a custom password with the --password option; otherwise, the default password is password.

$LH_ROOT_DIR/ibm-lh-dev/bin/setup --license_acceptance=y --runtime=$DOCKER_EXE

Step 7: Start Watsonx Data Containers

Start the Watsonx Data containers using the following command:

$LH_ROOT_DIR/ibm-lh-dev/bin/start

Step 8: Access Watsonx Data Console

Open the Watsonx Data console by visiting https://:9443 (or the port specified during setup).
Log in with the username ibmlhadmin and the password you set during setup (default is password).

Managing Watsonx Data

Check Container Status

To view the status of all containers:

$LH_ROOT_DIR/ibm-lh-dev/bin/status --all

Stop All Containers

To stop all containers:

$LH_ROOT_DIR/ibm-lh-dev/bin/stop

Start/Stop a Specific Container

To manage individual containers, use stop_service and start_service commands. Replace with the name from the docker ps output:

$LH_ROOT_DIR/ibm-lh-dev/bin/stop_service 
$LH_ROOT_DIR/ibm-lh-dev/bin/start_service 

Step 9: Log In to Watsonx Data

Once Watsonx Data is up and running, access the login page via your browser at https://:9443.

Login Page: Enter your username and password to access the Watsonx Data console.

After that you will see the Dashboard

Step 10: Infrastructure Manager

After logging in, navigate to the Infrastructure Manager to monitor and manage system resources and services.

Infrastructure Manager: View and control Watsonx Data’s underlying infrastructure and resource allocations.

Step 11: Explore the Query Workspace

Use the Query Workspace to write and run SQL queries directly within Watsonx Data.

Query Workspace: Execute SQL queries and analyze data with Watsonx Data’s SQL editor.

Step 12: Access Query History

The Query History section lets you review past queries, making it easy to track, repeat, or debug previous SQL commands.

Query History: Review and manage past queries for efficient workflow management.

Congratulations! You have successfully installed and configured IBM Watsonx Data 2.0 Developer Edition on your Ubuntu EC2 instance.

Resources:

For more insights, check out my Blog Section.

AWS Glue: Convert CSV to JSON using Lambda - A Practical Guide

2024-10-08T00:00:00+02:00

🌟 Introduction to AWS Glue

AWS Glue is a fully managed ETL (Extract, Transform, Load) service provided by Amazon Web Services. It simplifies the process of discovering, preparing, and combining data for analytics, machine learning, and application development.

Key Components of AWS Glue

AWS Glue Data Catalog 📚
- Acts as a centralized repository to store metadata about your datasets.
Crawlers 🕵️‍♂️
- Automatically scan data sources and store metadata in the Data Catalog.
ETL Jobs 🔧
- Allow transformation of raw data into structured formats, such as CSV to JSON.
Triggers ⏲️
- Automate the execution of ETL jobs based on schedules or events.
Development Endpoints 💻
- Provide an environment for testing and developing ETL scripts.

💡 AWS Glue Project: Convert CSV to JSON in S3 using Lambda

In this project, we’ll build an automated pipeline to convert CSV files stored in an S3 bucket into JSON format using AWS Glue and trigger the ETL process using AWS Lambda.

🗺️ Overview of the Process

S3 Buckets 🪣: Set up source, destination, and script buckets.
AWS Glue Job 🛠️: Create a Glue job to handle the conversion of CSV to JSON.
AWS Lambda Function ⚡: Trigger the job based on file uploads.
Monitoring 📊: Track the process with CloudWatch logs.

📂 Step 1: Set Up S3 Buckets

Go to the S3 Console.
Create three buckets:
- source-csv-bucket
- destination-json-bucket
- glue-script-bucket

🛠️ Step 2: Create AWS Glue ETL Job

Navigate to the AWS Glue Console.
Create a new job:
- Source: source-csv-bucket, format CSV.
- Target: destination-json-bucket, format JSON.
- Script Path: Stored in the glue-script-bucket.

⚡ Step 3: Set Up AWS Lambda Trigger

Go to AWS Lambda Console.
Create a function named csv_to_json and add an S3 Trigger for source-csv-bucket with event type as PUT for .csv files.

Use this Python code to start the Glue job:

import boto3
glueClient = boto3.client('glue')

def lambda_handler(event, context):
    glueClient.start_job_run(JobName="csv_to_json")
    return "Job started"

IAM Permissions:

Ensure that the Lambda IAM role has the following permissions:
- AWSGlueConsoleFullAccess
- AmazonS3FullAccess
- AWSLambdaBasicExecutionRole

🔍 Step 4: Monitor the Pipeline

Use CloudWatch Logs to monitor the Lambda execution and Glue job status.
AWS Glue Console will provide job status updates.

🎉 Conclusion

By following this guide, you’ve successfully set up an automated ETL pipeline using AWS Glue, Lambda, and S3. This architecture ensures that every CSV file uploaded to the source bucket is automatically converted into JSON, providing seamless data transformation. Expand this project by adding more complex ETL tasks or integrating AWS Step Functions for advanced workflows.

Mastering AWS S3: A Comprehensive Guide from Beginner to Pro

2024-10-04T00:00:00+02:00

Introduction

Amazon Simple Storage Service (AWS S3) is one of the most popular and versatile cloud storage solutions. In this guide, we will take you from understanding the basics of AWS S3 to mastering advanced features like versioning, lifecycle policies, and encryption.

Whether you’re just getting started or looking to fine-tune your knowledge, this blog will cover everything you need to know about S3 buckets, object storage, and best practices.

🌐 What is AWS S3?

AWS S3 is a scalable, secure, and durable object storage service that allows users to store and retrieve any amount of data at any time. It provides flexibility and simplicity, making it ideal for a range of use cases, including:

Backup and restore
Big data analytics
Content storage and delivery

For beginners, AWS S3 provides an easy-to-use GUI that allows you to create and manage storage buckets directly from the AWS Management Console.

🗂️ Key Concepts in AWS S3

1. Buckets

Buckets are the fundamental containers in S3 that hold your data (objects). Each bucket has a globally unique name and can store an unlimited number of objects.

GUI Steps:

Go to the S3 Management Console.
Click on Create bucket.
Enter a unique bucket name and choose a region.

Example:

aws s3 mb s3://my-bucket-name

2. Objects

An object in S3 consists of the data you want to store and its metadata. Each object is uniquely identified within a bucket by a key (the file name).

GUI:

Upload objects via Add files in the AWS Console.

Example:

aws s3 cp my-file.txt s3://my-bucket-name/

🔐 S3 Security and Access Control

1. Bucket Policies

Bucket policies define the permissions for accessing the bucket and its objects. You can configure public access or restrict it to specific users.

GUI Steps:

Open the S3 Console.
Navigate to Permissions for your bucket.
Edit the Bucket policy to control access.

Example Policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-bucket-name/*"
    }
  ]
}

2. Encryption

S3 provides several options for encrypting your data, both at rest and in transit.

Server-Side Encryption (SSE) with Amazon S3-managed keys (SSE-S3).
Client-Side Encryption for encrypting data before uploading.

📜 Versioning in AWS S3

S3 versioning allows you to store multiple versions of an object in the same bucket, enabling rollback in case of accidental deletion or overwriting.

GUI Steps to Enable Versioning:

Open the S3 Management Console.
Select your bucket.
Click on Properties and enable Versioning.

Enabling Versioning via CLI:

aws s3api put-bucket-versioning --bucket my-bucket-name --versioning-configuration Status=Enabled

🔄 S3 Lifecycle Policies

Lifecycle policies help manage the lifecycle of your objects by automating transitions between storage classes and expiring outdated objects.

GUI Steps:

Go to Management tab in the S3 console.
Click Create lifecycle rule to configure rules for moving or deleting objects based on time.

Example Lifecycle Policy:

{
  "Rules": [
    {
      "ID": "Archive older objects",
      "Prefix": "",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

🔧 Installing AWS CLI (Advanced)

For advanced users, installing the AWS CLI allows you to manage S3 from the command line.

Installation (Windows)

Download the installer from AWS CLI Installer for Windows.
Run the installer and follow the prompts.
Verify installation by typing the following in Command Prompt:

aws --version

🛠️ Advanced Features for Pro Users

1. S3 Object Lock

S3 Object Lock enables you to store objects using a write-once-read-many (WORM) model to help meet regulatory requirements.

2. S3 Replication

You can replicate objects across different regions for redundancy or compliance with S3 Cross-Region Replication (CRR).

Advanced operations like creating buckets or uploading files via CLI:

aws s3 mb s3://my-bucket-name
aws s3 cp my-file.txt s3://my-bucket-name/

🚀 Best Practices for AWS S3

Enable versioning for critical data.
Use S3 Intelligent-Tiering to optimize storage costs.
Set up lifecycle policies to archive infrequently accessed data.
Regularly audit your security settings using AWS CloudTrail and CloudWatch.

📈 Conclusion

AWS S3 is a robust, versatile storage solution suitable for both beginners and advanced users. Whether you’re managing backups or serving data-intensive applications, mastering S3 will enhance your cloud computing skills.

Create an OpenShift Cluster on IBM Cloud with ROKS and Deploy a Simple CRUD App

2024-09-30T00:00:00+02:00

Introduction

In this blog, we will go through a step-by-step guide on how to create an OpenShift Cluster using IBM Cloud’s Red Hat OpenShift on IBM Cloud (ROKS) service. By the end of this guide, you’ll have a fully functioning OpenShift Container Platform (OCP) cluster running on IBM Cloud.

The ROKS (Red Hat OpenShift Kubernetes Service) is a managed OpenShift solution that simplifies cluster deployment and management. This tutorial will use IBM Cloud’s console to configure and deploy the cluster.

⚡ Overview of Creating an OpenShift Cluster in IBM Cloud

The process of creating a cluster involves the following steps:

Key Steps:

Log in to IBM Cloud: Access the IBM Cloud console to start the process.
Create a Resource Group: Group related resources for easy management.
Navigate to OpenShift Services: Locate the ROKS service and initiate cluster creation.
Configure Cluster Settings: Define cluster parameters like name, region, and worker nodes.
Review and Create the Cluster: Confirm settings and create the cluster.
Access the OpenShift Web Console: Use the OpenShift URL to start managing the cluster.

🛠️ Step-by-Step Guide to Create an OpenShift Cluster

Follow the steps below to create an OpenShift cluster using IBM Cloud’s ROKS service.

Step 1: Log in to IBM Cloud

Visit the IBM Cloud login page and log in using your IBM Cloud credentials.
After logging in, you will be redirected to the IBM Cloud dashboard.

Step 2: Create a Resource Group

Creating a resource group helps you manage related resources together:

Go to Manage > Account > Resource Groups.
Click Create and give your resource group a name (e.g., “OpenShift-Resources”).
Click Create to save the resource group.

Step 3: Navigate to OpenShift Services

Click on the Catalog from the left panel.
Search for OpenShift and select Red Hat OpenShift on IBM Cloud.
Click Create to start configuring your cluster.

Step 4: Configure Cluster Settings

Now, configure the settings for your cluster:

Cluster Name: Choose a unique name for your cluster (e.g., “my-openshift-cluster”).
Location: Select the region and zone where you want your cluster to be deployed.
Worker Node Count: Define the number of worker nodes (e.g., 3 nodes for a basic cluster).

Step 5: Review and Create the Cluster

Review your configuration and click Create.
The cluster creation process might take 15-30 minutes depending on the configuration.

Step 6: Access the OpenShift Web Console

Once the cluster is created, access the OpenShift web console:

From the IBM Cloud dashboard, go to Resource List.
In Containers Click on your newly created OpenShift cluster.

Click on the OpenShift Web Console link.

Here you had successfully created ROKS cluster in IBM cloud

Use the URL provided to access the OpenShift Console. Use the oc CLI or web interface to start managing your cluster resources.

If you want to install and configure oc - Command line Interface, Check out my Get Free OpenShift Dedicated cluster blog.

🌐 Deploy Sample Crud-app

If you want to run a sample web app, Check out my Github PageOpenshift-crud-app.

Get Started with Free OpenShift Dedicated Cluster

2024-09-24T00:00:00+02:00

Introduction

OpenShift Dedicated is a powerful, enterprise-ready Kubernetes platform that provides developers with an efficient way to build, deploy, and scale applications in the cloud. Getting access to a free OpenShift Developer Sandbox is a fantastic opportunity for developers to explore its capabilities without incurring costs.

In this guide, we will walk through how to sign up for a free OpenShift Developer Sandbox, access the OpenShift Console, and perform basic operations like creating pods and deployments.

⚡ Why Choose OpenShift?

OpenShift Dedicated offers a managed Kubernetes environment, complete with developer and operational tools, enabling seamless collaboration and rapid application development. With features like CI/CD integration, automated updates, and built-in security, OpenShift is ideal for cloud-native application development.

Key Features:

Developer Sandbox: A free, temporary environment to explore and develop with OpenShift.
Multi-cluster Support: Easily manage resources across multiple clusters.
CI/CD Pipelines: Integrate with Jenkins, Tekton, and other CI/CD tools.
Rich Ecosystem: Includes Operators, Helm Charts, and Templates for quick application deployment.

🛠️ How to Get a Free OpenShift Developer Sandbox

To get started with the OpenShift Developer Sandbox, follow the steps below:

Step 1: Visit the OpenShift Developer Sandbox Page

Open your browser and go to the OpenShift Developer Sandbox page.

If you already have a Red Hat account:
- Click Log In and enter your credentials.
If you do not have a Red Hat account:
- Click on Register to create a new account.
- Fill out the registration form with your details and confirm your email to activate the account.

Step 3: Start the Developer Sandbox

After logging in or registering, you’ll be redirected to the Developer Sandbox dashboard.
Click on “Start using the Developer Sandbox”.

Step 4: Accept the Terms and Conditions

You will be prompted to review and accept the Red Hat Developer Sandbox terms.
Click “Accept” to proceed.

Step 5: Provision Your Sandbox

The sandbox will be provisioned for your account. This may take a few minutes.
You will see a loading screen, and once the setup is complete, you’ll receive a confirmation email.

Step 6: Access Your Sandbox

Once provisioned, you’ll be redirected to the OpenShift Web Console.
You can also access it from the confirmation email containing your sandbox URL.

Step 7: Log In to OpenShift

Click on the sandbox URL or go to OpenShift Console.
Log in using your Red Hat credentials.

Step 8: Start Using the Sandbox

Begin by creating a new project and deploying a sample application.
Explore the Developer and Administrator perspectives to get familiar with OpenShift’s capabilities.

🚀 Working with OpenShift Dedicated

Once you have access to your dedicated OpenShift cluster, you can begin deploying applications, creating resources, and managing configurations. Below are some common tasks to get you started:

Downloading and Using `oc` CLI

To manage OpenShift resources, you can use the oc command-line tool:

Download the CLI:
- In the OpenShift console, click on the question mark (?) icon in the top-right corner.
- Select Command Line Tools and download the appropriate oc binary for your OS.

Log in Using the CLI:
- Click on your username in the top-right corner and select Copy Login Command.
- Copy the oc login command displayed and paste it into your terminal.
- The execution should done in oc where oc.exe file is extracted.

For example:

   oc login https://api.dedicated-cluster1.us-east-1.openshiftapps.com:6443 --token=

1. Creating and Managing Projects

Projects in OpenShift are equivalent to namespaces in Kubernetes. Before deploying any resources, create a new project:

oc new-project my-app

Switch to the newly created project:

oc project my-app

2. Deploying a Sample Application

Deploy a Node.js application from a GitHub repository using the oc new-app command:

oc new-app nodejs~https://github.com/sclorg/nodejs-ex

This command will automatically create a deployment, build the image from the source code, and deploy the application pods.

3. Creating and Managing Pods and Deployments

You can manually create Pods and Deployments using YAML files or commands:

Create a Deployment:

oc create deployment my-deployment --image=nginx

Scale the Deployment:

oc scale deployment my-deployment --replicas=3

4. Exposing Your Application

After deploying your application, expose it using a service and route:

Expose a Service:

oc expose deployment my-deployment --port=80

Create a Route:

oc expose svc/my-deployment

The route URL will be displayed, and you can access your application from a web browser.

6. Monitoring and Troubleshooting

Use the following commands to view logs and debug issues:

View Logs:
```
oc logs pod-name
```
Describe a Resource:
```
oc describe pod pod-name
```

🌐 Conclusion

With this guide, you now have the knowledge to get started with a free OpenShift Developer Sandbox and perform basic operations in OpenShift Dedicated clusters. Take advantage of this environment to explore cloud-native development, deploy applications, and test CI/CD workflows.

If you encounter any issues or have questions, feel free to ask for help in the Red Hat Developer community or explore the Documentation!

Happy OpenShift-ing! ```

A Complete Guide to OpenShift: Everything from Basics to Pro-Level

2024-09-12T00:00:00+02:00

Introduction

In this blog, we will delve deep into OpenShift, a powerful enterprise Kubernetes platform that simplifies container orchestration and application lifecycle management. By the end of this guide, both beginners and advanced users will have a comprehensive understanding of OpenShift, its architecture, deployment options, and prerequisites to get started.

OpenShift provides a platform for building and managing containerized applications, supporting various cloud and on-premises solutions. This guide covers OpenShift types, architecture, installation options (including ROSA, ROKS, OpenShift Dedicated, and OKD), and much more.

🌟 What is OpenShift?

OpenShift is a Kubernetes-based container platform developed by Red Hat, aimed at simplifying the management and deployment of containerized applications. It extends Kubernetes by adding powerful DevOps and developer tools, making it easier to deploy, scale, and manage workloads.

Key Features:

Enterprise-Grade Security: Provides strong RBAC, security policies, and container isolation.
Developer-Friendly: Supports CI/CD pipelines, developer IDEs, and Source-to-Image (S2I) builds.
Multi-Cloud Support: Can be deployed across multiple cloud providers or on-premises.

🏢 OpenShift Types

OpenShift comes in several deployment options, depending on your needs:

OpenShift Container Platform (OCP):
- Runs on your on-premises infrastructure or private cloud.
- Offers full control over the Kubernetes environment.
OpenShift Dedicated:
- Fully managed by Red Hat, runs on AWS or GCP.
- Ideal for teams looking to focus on development without managing the underlying infrastructure.
ROSA (Red Hat OpenShift Service on AWS):
- A managed OpenShift service jointly managed by AWS and Red Hat.
- Integrates deeply with AWS services, providing a seamless hybrid cloud experience.
ROKS (Red Hat OpenShift on IBM Cloud):
- A managed OpenShift service on IBM Cloud.
- Offers enhanced integration with IBM Cloud services, including Watson and IBM Cloud Paks.
OKD (OpenShift Kubernetes Distribution):
- The upstream, community-supported version of OpenShift (similar to CentOS for RHEL).
- Ideal for testing and learning OpenShift for free.

🏛️ OpenShift Architecture

The OpenShift architecture consists of several components that work together to manage and orchestrate containers. Key components include:

Master Nodes:
- Manage the entire OpenShift cluster, schedule workloads, and handle API requests.
Worker Nodes:
- Host the application Pods and provide compute power to run your containers.
ETCD:
- A distributed key-value store that stores cluster state and configuration.
Controller Manager & Scheduler:
- Controllers automate routine tasks, while the Scheduler allocates workloads to nodes.
Ingress & Routes:
- OpenShift uses Routes to expose services externally, providing traffic management and load balancing.

Additional OpenShift Tools:

BuildConfigs & ImageStreams: Facilitate automatic builds and deployments.
Operators: Manage complex applications with ease using Operators to automate tasks like upgrades and scaling.

⚙️ Prerequisites for OpenShift

Before installing OpenShift, ensure you meet the following prerequisites:

Hardware Requirements:
- Minimum 4 CPUs, 16 GB RAM for a single-node cluster.
- For multi-node clusters, additional nodes with similar configurations.
Software Requirements:
- Operating System: RHEL 7/8, CentOS, or Fedora (for OpenShift Container Platform).
- Cloud Environment: AWS, GCP, or Azure for cloud deployments.
CLI Tools: Install oc and kubectl for interacting with OpenShift clusters.

☁️ Deployment Options

OpenShift can be installed on different platforms, depending on your needs:

OpenShift on Cloud (ROSA, ROKS):
- Provides a managed OpenShift experience with integrations specific to each cloud provider.
OpenShift Dedicated:
- Red Hat manages the OpenShift cluster, letting you focus on developing and deploying applications.
Self-Hosted OpenShift (OCP):
- Install and run OpenShift on your own infrastructure.
OKD for Development:
- Ideal for local development and testing, OKD can run on VirtualBox, AWS, or GCP.

Why Installing OpenShift on Ubuntu is Challenging

While OpenShift can technically run on Ubuntu, it’s not officially supported by Red Hat. OpenShift Container Platform (OCP) is optimized for Red Hat Enterprise Linux (RHEL) and CentOS, which include dependencies and kernel modules tuned for OpenShift workloads. Using Ubuntu may lead to compatibility issues and lack of support from Red Hat.

🎯 Getting Started with OpenShift

For beginners, start with OpenShift Online or OKD for a free experience. Once familiar, consider moving to OpenShift Dedicated or ROSA for production use. For advanced users, deploying OpenShift Container Platform (OCP) on your infrastructure gives the most control.

📌 Summary

OpenShift is a versatile platform that simplifies the complexities of managing containerized applications. It offers several flavors like ROSA, ROKS, and OpenShift Dedicated, catering to different use cases. From beginners to pros, OpenShift has something for everyone.

If you’re looking to get started, feel free to get a Free OpenShift Dedicated Cluster to test and deploy your applications!

How to Install Splunk Enterprise and Universal Forwarder on AWS EC2

2024-09-07T00:00:00+02:00

Introduction

This blog walks you through setting up Splunk Enterprise (Master) and a Universal Forwarder (Slave) on AWS EC2 instances. It also covers the core components of Splunk, configuration steps, and how to run basic queries. If you haven’t already set up your AWS EC2 instances, follow the instructions in my previous blog post,Guide to Creating an EC2 Instance before proceeding.

Prerequisites
Splunk Master Installation
Splunk Slave Installation (Universal Forwarder)
Key Components of Splunk
Configuration
- Master Configuration
- Slave Configuration
Adding Data Sources
Running Queries in Splunk

Prerequisites

Before you begin the installation, ensure that:

You have two AWS EC2 instances: one for the Splunk Master (Enterprise) and one for the Splunk Slave (Universal Forwarder).
You have SSH access to both instances.
You have downloaded the Splunk packages from the official Splunk download page.

If you need assistance setting up the servers, refer to my detailed guide on Guide to Creating an EC2 Instance before proceeding

Splunk Master Installation

Here are the steps to install Splunk Enterprise on the master EC2 instance:

Navigate to the /opt directory:
```
cd /opt
```

Download the Splunk package:

wget -O splunk-8.2.6-a6fe1ee8894b-Linux-x86_64.tgz https://download.splunk.com/products/splunk/releases/8.2.6/linux/splunk-8.2.6-a6fe1ee8894b-Linux-x86_64.tgz

Extract the package:

tar -xvzf splunk-8.2.6-a6fe1ee8894b-Linux-x86_64.tgz
cd splunk/bin

Start Splunk:
```
./splunk start --accept-license
```

Access the Splunk Web Interface by visiting the following URL:
```
http://:8000
```
Log in with default credentials:
- Username: root
- Password: 12345678 (change this on first login).

Splunk Slave Installation (Universal Forwarder)

To install the Splunk Universal Forwarder on the slave EC2 instance:

Navigate to the /opt directory:
```
cd /opt
```

Download the Universal Forwarder package:

wget -O splunkforwarder-8.2.6-a6fe1ee8894b-Linux-x86_64.tgz https://download.splunk.com/products/universalforwarder/releases/8.2.6/linux/splunkforwarder-8.2.6-a6fe1ee8894b-Linux-x86_64.tgz

Extract the package:

tar -xvzf splunkforwarder-8.2.6-a6fe1ee8894b-Linux-x86_64.tgz
cd splunkforwarder/bin

Start the Splunk Forwarder:
```
./splunk start --accept-license
```
Create credentials:
- Username: root
- Password: 12345678

Key Components of Splunk

Splunk is built on three essential components:

Search Head: The interface where you can search, analyze, and visualize data.
Indexer: Indexes and stores incoming data for quick searches.
Forwarder: Collects data from various sources and forwards it to the indexer.

These components work together to form a powerful data monitoring and analysis system.

Configuration

Master Configuration

Enable the Master Server to listen on port 9997:
```
./splunk enable listen 9997
```

Slave Configuration

Add the Master Server’s IP to the Slave (Forwarder) configuration:
```
./splunk add forward-server :9997
```
Enter Master node credentials:
- Username: root
- Password: 12345678

Adding Data Sources

To forward data, such as syslog, from the Slave to the Master:

Add a syslog file to monitor:

./splunk add monitor /var/log/syslog --index main

Alternatively, add a sample log file:
```
cd /var/log
vi sample_log
```
Enter the sample logs here and save the file. Then navigate back to the Splunk directory:
```
cd /opt/splunkforwarder/bin
```
Add the sample log for monitoring:
```
./splunk add monitor /var/log/sample_log --index main
```
Splunk will start indexing the log files from the specified path.

Running Queries in Splunk

Once the data is indexed, you can run queries to retrieve and analyze the data.

Example Query

index="main" host="slave-node" source="/var/log/syslog" sourcetype="syslog"

or simply:

index="main"

This query will fetch logs from the main index where the source is /var/log/syslog.

By following this guide, you have successfully installed Splunk Enterprise and a Universal Forwarder, configured them, and added data sources. Now, you’re ready to dive deeper into Splunk’s capabilities!

For more information, visit the official Splunk documentation.

How to Install and Run Apache Hadoop on AWS EC2 (Multi-Node Cluster)

2024-09-05T00:00:00+02:00

Introduction

To set up a multi-node Hadoop cluster on AWS EC2, follow the steps below. Ensure you’ve already created 4 EC2 instances (NameNode, Secondary NameNode, DataNode1, DataNode2). If you’re unfamiliar with setting up EC2 instances, check out my Guide to Creating an EC2 Instance before proceeding. When launching EC2 instances, make sure to specify 4 instances.

Pre-requisites

Create AWS EC2 Instances:
- Launch 4 Ubuntu EC2 instances for NameNode, Secondary NameNode, and DataNodes (1 & 2). Assign names to your instances accordingly.
SSH into the NameNode:
- Copy the key pair you created for the EC2 instances to the NameNode. First, locate the .pem file and execute the following command:
```
scp -i Laptopkey.pem Laptopkey.pem ubuntu@:~/
```
Update and Restart Instances:
- After logging into all 4 EC2 instances, update and restart them:
```
sudo apt-get update && sudo apt-get -y dist-upgrade
sudo reboot
```
Update /etc/hosts:
- On each of the 4 instances, update the /etc/hosts file:
```
sudo vi /etc/hosts
```
- Replace 127.0.0.1 localhost with the private IP and DNS of each instance. For example:
```
172.31.45.2 ec2-54-172-91-40.compute-1.amazonaws.com
```

Install Java and Hadoop

Install Java:

Install Java on all instances:

sudo apt-get -y install openjdk-8-jdk-headless

Download Hadoop:

Download and extract Hadoop on all 4 instances:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
tar -xvzf hadoop-3.4.0.tar.gz
mv hadoop-3.4.0 hadoop

Update .bashrc:

Update the .bashrc file on all 4 instances:
```
vi ~/.bashrc
```

Add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/ubuntu/hadoop
export HADOOP_CONF=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$JAVA_HOME:$HADOOP_HOME/bin

Source the .bashrc file:
```
  source ~/.bashrc
```

Configure Password-less SSH

Generate SSH Keys on NameNode (Master):

ssh-keygen

Copy the public key to all other nodes:

scp -i Laptopkey.pem /home/ubuntu/.ssh/id_ed25519.pub ubuntu@ec2-54-172-91-40.compute-1.amazonaws.com:/home/ubuntu/.ssh/id_ed25519.pub

Then append the public key to the authorized_keys file:

cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys

Set up SSH Config:

Modify ~/.ssh/config on the NameNode:
```
vi ~/.ssh/config
```

Add the following:

Host nnode
  HostName 
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Host snnode
  HostName 
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Host dnode1
  HostName 
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Host dnode2
  HostName 
  User ubuntu
  IdentityFile ~/.ssh/id_ed25519

Set up Hadoop Cluster

Create HDFS Directories on All Nodes:

sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown -R ubuntu:ubuntu /usr/local/hadoop/hdfs/data

Configure Hadoop Files:

Modify the following Hadoop configuration files:

hadoop-env.sh:

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add or update the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

core-site.xml:

vi $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration:

  
    fs.defaultFS
    hdfs://:9000

hdfs-site.xml:

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration:

  
    dfs.replication
    3
  
    dfs.namenode.name.dir
    file:///usr/local/hadoop/hdfs/data

mapred-site.xml:

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration:

  
    mapreduce.framework.name
    yarn

yarn-site.xml:

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following configuration:

  
    yarn.resourcemanager.hostname

Navigate to the Hadoop configuration directory:
```
cd $HADOOP_HOME/etc/hadoop
```

Copy Configuration Files to Slaves:

Copy the configuration files from NameNode to the slave nodes:

scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml ubuntu@:~/hadoop/etc/hadoop/

Set up Masters and Slaves:
- On NameNode and Secondary NameNode, edit the masters file:
```
vi $HADOOP_HOME/etc/hadoop/masters
```
  - Add the following:
- On DataNode1 and DataNode2, edit the slaves file:
```
vi $HADOOP_HOME/etc/hadoop/slaves
```
  - Add the following (specific to each DataNode):
    # On DataNode1 # On DataNode2
- Note: Leave the masters file empty on DataNode1 and DataNode2.

Start the Hadoop Cluster

Format the NameNode:
```
hdfs namenode -format
```

Start HDFS and YARN:

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Verify Hadoop Cluster:
- Use the following command on each node to verify the daemons:
```
jps
```

Conclusion

That’s it! You now have Apache Hadoop running on a multi-node cluster on your EC2 instances. You can access the NameNode UI by visiting http://:9870.

Final Output:

Docker Tutorial for Beginners – Introduction & Getting Started

2024-09-03T00:00:00+02:00

Introduction

Docker and Containers – Simplified

Docker is a tool that helps you create, deploy, and manage applications inside containers. Containers are lightweight and include everything needed to run an application, such as libraries and dependencies, ensuring the application works the same everywhere.

Here’s a breakdown:

Docker Image: A template or blueprint for your application. It includes the application and everything needed to run it. Container: A running instance of a Docker image. It works like an isolated mini-computer running your application. Docker Hub: A public repository where you can find Docker images to use. Dockerfile: A file containing the instructions to build a Docker image.

Overview

Docker allows developers to package applications into containers—standardized units that contain everything needed to run code, including libraries, system tools, and settings.

Prerequisites

An EC2 Instance running Ubuntu (at least 15GB of storage).If you haven’t set up an EC2 instance yet, check out my previous Blog->Guide to creating an EC2 instance before proceeding.
SSH access to the instance.

Part 1: Installing Docker

Step 1: Update the Package Index

Before installing Docker, ensure your package list is up to date.

sudo apt-get update

Step 2: Install Required Dependencies

sudo apt-get install -y ca-certificates curl gnupg lsb-release

Step 3: Add Docker’s Official GPG Key

sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

Step 4: Set up Docker’s Stable Repository

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Step 5: Install Docker

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

Step 6: Verify Docker Installation

Test if Docker was installed correctly by running a test container:

docker run hello-world

Part 2: Managing Docker

Step 1: Pulling an Image from Docker Hub

docker pull httpd

Step 2: Running a Container

docker run -itd -p "8080:80" httpd

Step 3: List Running Containers

docker ps

Step 4: Accessing the Container

docker exec -it CONTAINER_ID /bin/bash

Step 5: Stopping a Container

docker stop CONTAINER_ID

Part 3: Advanced Docker Usage

Creating a Dockerfile

Create a Directory for your project:
```
 mkdir myproject
 cd myproject
```
Create a Dockerfile:
```
 vi Dockerfile
```

Add the Following Instructions:

 FROM ubuntu:latest
 RUN apt-get update && apt-get install -y nginx
 CMD ["nginx", "-g", "daemon off;"]

Build the Docker Image:
```
 docker build -t mynginx .
```

Volumes and Persistence

To keep data persistent even if the container is deleted, Docker allows you to use volumes:

Create a Volume:
```
 docker volume create myvol
```

Run a Container with Volume Mounting:

 docker run -itd -v myvol:/usr/share/nginx/html -p 8080:80 nginx

Push an Image to Docker Hub

Log in to Docker Hub:
```
 docker login
```

Tag the Image:

 docker tag mynginx yourusername/mynginx

Push the Image:
```
 docker push yourusername/mynginx
```

Conclusion

With this guide, you should be able to install Docker, manage containers, create custom images, and even share them on Docker Hub. Keep practicing and try to create different kinds of applications in Docker containers!

For more advanced Docker guides and blog posts, visit www.abdulrahmanh.com.