Tuesday 17 July 2018

When to use RDD, Dataframe and Dataset

When to use RDDs?


  • you want low-level transformation and actions and control on your dataset;
  • your data is unstructured, such as media streams or streams of text;
  • you want to manipulate your data with functional programming constructs than domain specific expressions
  • you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
  • you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

When should I use DataFrames or Datasets?


  • If you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame or Dataset.
  • If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame or Dataset.
  • If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
  • If you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
  • If you are a R user, use DataFrames.
  • If you are a Python user, use DataFrames and resort back to RDDs if you need more control.
Reference:

 A good reference for Shell scripting  https://linuxcommand.org/lc3_writing_shell_scripts.php