Spark setup:
- Install Java and Python
- Download spark from here and extract to a directory C:\ spark-2.3.1-bin-hadoop2.7
- Clone the GIT repository winutils. Let’s say it is C:\winutils
- Set the system environment variables as below
HADOOP_HOME=C:\winutils\hadoop-2.7.1
SPARK_HOME=C:\spark-2.3.1-bin-hadoop2.7
verify the environment variables from command prompt àecho %SPARK_HOME%
**Restart the cmd window/computer if required for these variables to take effect - Create a temporary directory for hive. Let’s say the path is C:\tmp\hive
- Change the permissions to C:\tmp\hive directory to Full control to everyone (restrict to a specific user if required)
- Now run pyspark command in cmd window from SPARK_HOME directory.
Pycharm configuration:
- Install pyspark interpreter (File→ Settings→ Project Interpreter → click on + and search for pyspark→ Install package
- Create a new project called spark-test and create a file called test.py in the project
- Write a simple script to read and display records with ‘Spark’ in SPARK_HOME\README.md
- Add pyspark libraries to the project
File→ Settings→ Search for Project Structure→ click on Add Content Root→ select SPARK_HOME\python\lib
- Run configuration:
Run→ Edit Configuration→ click on + button to add new config → Select Python and configure as below
Execute the program. You should see the results below