Spark Installation
Share:
Apache Spark is an open-source distributed computing framework that can handle large datasets. It is designed to work on Hadoop clusters and is compatible with other Hadoop ecosystem tools. Apache Spark is fast, scalable, and has a low latency. In this article, we will be discussing how to install Apache Spark using the command line interface (CLI).
Before proceeding with installation, ensure that you have the following prerequisites:
- Java Development Kit (JDK) 8 or above installed on your system.
- A Hadoop cluster or a single-node Hadoop setup.
- An IDE such as Eclipse or IntelliJ IDEA for development.
Step 1: Download Apache Spark
The first step is to download Apache Spark from the official website. Go to the following URL and download the required version of Apache Spark based on your operating system: https://spark.apache.org/downloads.html.
For example, if you are using a Ubuntu machine, you can download the tar.gz file for Apache Spark as follows:
wget http://ftp.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
Step 2: Extract the downloaded tar.gz file
Once you have downloaded the tar.gz file, extract it using the following command:
tar -xvzf spark-2.4.7-bin-hadoop2.7.tgz
This will extract the contents of the tar.gz file into a directory named "spark-2.4.7-bin-hadoop2.7" in your current working directory.
Step 3: Configure Apache Spark Environment Variables
Before we can run Apache Spark, we need to set some environment variables. Open the terminal and execute the following command to add the Apache Spark bin directory to your PATH variable:
export PATH=$PATH:/path/to/spark-2.4.7-bin-hadoop2.7/bin
Replace "/path/to/spark-2.4.7-bin-hadoop2.7" with the actual path to your Apache Spark installation directory.
You can also set the environment variables permanently by adding the following lines to your .bashrc or .zshrc file:
export PATH=$PATH:/path/to/spark-2.4.7-bin-hadoop2.7/bin
Step 4: Run Apache Spark
Now that we have set the environment variables, we can run Apache Spark. Open another terminal window and execute the following command to start the Spark shell:
spark-shell
This will launch the Spark shell where you can write Spark code and execute it. You can also use any other Spark client, such as Spark SQL, Spark Streaming, or PySpark.
Step 5: Configure Apache Spark Properties
By default, Apache Spark uses local mode when running the Spark shell or any other Spark client. If you want to run it on a Hadoop cluster, you need to set some properties in your spark-default.conf file.
Open the following file using a text editor:
/path/to/spark-2.4.7-bin-hadoop2.7/conf/spark-default.conf
Add the following lines to the end of the file:
spark.master spark://myCluster:7077
spark.driver.memory 1g
spark.executor.memory 2g
Replace "myCluster" with the name of your Hadoop cluster and adjust the memory settings as per your requirements.
Save the file and exit.
Step 6: Start Apache Spark Cluster Manager
Before running any Spark client on a Hadoop cluster, you need to start the Apache Spark cluster manager. This can be done using the following command:
./sbin/start-dfs.sh
./sbin/start-yarn.sh
This will start the NameNode and the ResourceManager daemons on your Hadoop cluster.
Step 7: Run Apache Spark Client
Now that we have set up the Spark properties and started the Spark cluster manager, we can run any Spark client on our Hadoop cluster. Open another terminal window and execute the following command to start a PySpark shell:
./bin/pyspark
This will launch a PySpark shell where you can write PySpark code and execute it. You can also use any other Spark client, such as Spark SQL or Spark Streaming.
Conclusion
In this article, we have discussed how to install Apache Spark using the command line interface (CLI). We have covered the prerequisites for installation, downloaded and extracted the tar.gz file, configured environment variables, started Apache Spark cluster manager, and run a PySpark shell. These steps can be used for any version of Apache Spark on any operating system.
0 Comment
Sign up or Log in to leave a comment