Happy Learning: Pyspark Setup with PyCharm Community edition on Windows

Friday, 2 February 2018

Pyspark Setup with PyCharm Community edition on Windows

Spark setup:

Install Java and Python
Download spark from here and extract to a directory C:\ spark-2.3.1-bin-hadoop2.7
Clone the GIT repository winutils. Let’s say it is C:\winutils
Set the system environment variables as below
HADOOP_HOME=C:\winutils\hadoop-2.7.1
SPARK_HOME=C:\spark-2.3.1-bin-hadoop2.7
verify the environment variables from command prompt àecho %SPARK_HOME%
**Restart the cmd window/computer if required for these variables to take effect
Create a temporary directory for hive. Let’s say the path is C:\tmp\hive
Change the permissions to C:\tmp\hive directory to Full control to everyone (restrict to a specific user if required)
Now run pyspark command in cmd window from SPARK_HOME directory.

Pycharm configuration:

Install pyspark interpreter (File→ Settings→ Project Interpreter → click on + and search for pyspark→ Install package
Create a new project called spark-test and create a file called test.py in the project
Write a simple script to read and display records with ‘Spark’ in SPARK_HOME\README.md
Add pyspark libraries to the project
File→ Settings→ Search for Project Structure→ click on Add Content Root→ select SPARK_HOME\python\lib
Run configuration:
Run→ Edit Configuration→ click on + button to add new config → Select Python and configure as below
Execute the program. You should see the results below

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)