Thursday 22 February 2018

Convert single column into multiple in Spark

Consider below Sample JSON which has geo column with latitude and longitude values.
let's convert this into multiple columns dynamically.

{"Name":"Hanu","Address":{"Address1":"11213","Address2":"N TEST BLVD","City":"MIAMI"},"State":"FL"}

Below code will convert the Address into multiple columns
val sample = spark.read.json("/myHome/sample.json")
smaple.select('Name,"$Address.*",'State).show

Code snippet


Friday 2 February 2018

Pyspark Setup with PyCharm Community edition on Windows

 Spark setup:

  1. Install Java and Python
  2. Download spark from here and extract to a directory C:\ spark-2.3.1-bin-hadoop2.7
  3. Clone the GIT repository winutils. Let’s say it is C:\winutils
  4. Set the system environment variables as below
    HADOOP_HOME=C:\winutils\hadoop-2.7.1
    SPARK_HOME=C:\spark-2.3.1-bin-hadoop2.7
    verify the environment variables from command prompt àecho %SPARK_HOME%
    **Restart the cmd window/computer if required for these variables to take effect
  5. Create a temporary directory for hive. Let’s say the path is C:\tmp\hive
  6. Change the permissions to C:\tmp\hive directory to Full control to everyone (restrict to a specific user if required)
  7. Now run pyspark command in cmd window from SPARK_HOME directory.

Pycharm configuration:

  1. Install pyspark interpreter (File→ Settings→ Project Interpreter →  click on + and search for pyspark Install package



  2. Create a new project called spark-test and create a file called test.py in the project



  3. Write a simple script to read and display records with ‘Spark’ in SPARK_HOME\README.md



  4. Add pyspark libraries to the project

    File→ Settings→ Search for Project Structure→ click on Add Content Root→ select SPARK_HOME\python\lib




  5. Run configuration: 

    Run→ Edit Configuration→ click on + button to add new config → Select Python and configure as below




  6. Execute the program. You should see the results below


 A good reference for Shell scripting  https://linuxcommand.org/lc3_writing_shell_scripts.php