Friday 2 February 2018

Pyspark Setup with PyCharm Community edition on Windows

 Spark setup:

  1. Install Java and Python
  2. Download spark from here and extract to a directory C:\ spark-2.3.1-bin-hadoop2.7
  3. Clone the GIT repository winutils. Let’s say it is C:\winutils
  4. Set the system environment variables as below
    HADOOP_HOME=C:\winutils\hadoop-2.7.1
    SPARK_HOME=C:\spark-2.3.1-bin-hadoop2.7
    verify the environment variables from command prompt àecho %SPARK_HOME%
    **Restart the cmd window/computer if required for these variables to take effect
  5. Create a temporary directory for hive. Let’s say the path is C:\tmp\hive
  6. Change the permissions to C:\tmp\hive directory to Full control to everyone (restrict to a specific user if required)
  7. Now run pyspark command in cmd window from SPARK_HOME directory.

Pycharm configuration:

  1. Install pyspark interpreter (File→ Settings→ Project Interpreter →  click on + and search for pyspark Install package



  2. Create a new project called spark-test and create a file called test.py in the project



  3. Write a simple script to read and display records with ‘Spark’ in SPARK_HOME\README.md



  4. Add pyspark libraries to the project

    File→ Settings→ Search for Project Structure→ click on Add Content Root→ select SPARK_HOME\python\lib




  5. Run configuration: 

    Run→ Edit Configuration→ click on + button to add new config → Select Python and configure as below




  6. Execute the program. You should see the results below


No comments:

Post a Comment

 A good reference for Shell scripting  https://linuxcommand.org/lc3_writing_shell_scripts.php