Happy Learning: July 2018

Tuesday, 17 July 2018

you want low-level transformation and actions and control on your dataset;
your data is unstructured, such as media streams or streams of text;
you want to manipulate your data with functional programming constructs than domain specific expressions
you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

If you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame or Dataset.
If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame or Dataset.
If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
If you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
If you are a R user, use DataFrames.
If you are a Python user, use DataFrames and resort back to RDDs if you need more control.

Reference: