Happy Learning: November 2019

Spark is a distributed in memory (mostly) computing framework/processing engine designed for batch and streaming data featuring SQL queries, graph processing and machine learning.
Map Redue is a 2 stage frame work. Spark is multi stage framework operating on DAG.
Why immutable --> to avoid source/raw data modifications, multiple threads can't change the data. RDD can be created at any time in case of failures, caching, sharing and replication
Spark deployment modes - Local, client and Cluster.

Client vs Cluster - Driver program is outside of Yarn cluster manager (i.e at client) in client mode.
Driver program will be alongside of Application master in Yarn cluster in cluster mode

RDD vs DataFrame vs DataSet:

Use RDD for

low level transformations/actions
When schema is not necessary, ex: accessing columnar format,
When data is unstructured like media streams or steamed text (Spark streaming is available now)

DataFrame(DF)	DataSet(DS)
It is distributed collection of objects of type Row	It allows users to assign java class to the records inside DF
Not Type-Safe	Type-safe (compile time error check)
Scala, Java, Python and R	Scala and Java
	Leverages Tungsten’s fast in-memory encoding
	Encoders are highly optimized and use run time code generation to build custom serde. As a result, it is faster than Java/Kyro serialization.
	Comparatively less in size.. Which will improve the network transfer speeds.
	Single interface usable in Java & Scala

Rdd.toDebugString to get RDD lineage graph

Tuning -

Data Serialization: Use Kyro. But, it doesn't support all serializable types and requires to register the classes. If the objects are large, increase spark.kryoserializer.buffer. If the object is not registered, Kyro will still work. But - it will store full classname with the object which is wasteful
Memory Tuning:

Avoid using String (takes ~40bytes of overhead than raw data) and common collection classes like HashMap, LinkeList etc., (They will have a wrapper object for each entry).
Prefer to use arrays of objects and primitive types - fastutil library
If you have less than 32 GB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight

Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for before a task completes, it means that there isn’t enough memory available for executing tasks.

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to collect GC stats and tuning GC accordingly

Level of parallelism: 2-3 tasks per CPU core is recommended
Memory usage of Reduce Tasks: RDD Serialized persist if possible. Try to reduce shuffles (reduceByKey over groupByKey, coalesce over repartition etc,.
Broadcasting large variables: Broadcast large variables as RDD lookup is O(m) and broadcast is O(1)
Data Locality:
Other considerations:

Use Datasets whenever possible.. Avoid using UDF/UDAF. If there is a need for UDF, implement and extend catalyst's
Akka size akka.frameSize- Spark message broker for data transfer over n/w.
Check the time taken for execution from stages tab
Check the cache utilization from storage tab.
In spark, JDBC overwrite mode, below actions will be performed - (which is not advisable as we will lose metadata like column constraints)

remove all the metadata like index,PK,FK etc.,
Create table(only col def)
write data

Better approach(s):

collect the data to driver and use standard jdbc tools like scalikejdbc to perform required operation
use spark truncate=true property along with overwrite
First truncate the table using standard jdbc tools then use append mode

Partition calculation: no.of Executers * no. of Cores
Executors & Memory calculation: https://mylearninginbigdata.blogspot.com/2018/06/number-of-executors-and-memory.html

memoryOverhead = MAX( driver/executor memory * 0.1, 384) --> default calculation.

spark.driver.supervise can be used to restart the driver automatically in case it fails with non-zero exit code. It is supported in standalone, mesos cluster only

spark.memory.fraction : The lower this is, the more frequently spills and cached data eviction occur.
spark.shuffle.memoryFraction: If spills are often, consider increasing this value at the expense of spark.storage.memoryFraction.

File formats which one and when ** -->File Formats
Spark v1 vs v2 -

Unified DataFrame and DataSet API
Standard SQL support
SparkSession is the entry point and it subsumes SQLContext and HiveContext.
Improved performance in Accumulator
DF based ML APIs
ML pipeline persistence - Users can now save and load machine learning pipelines and models across all programming languages supported by Spark
Distributing algorithms in R - Added support for Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means in R.
User-defined functions (UDFs) in R - Added support for running partition level UDFs (dapply and gapply) and hyper-parameter tuning (lapply).
Improved Catalyst optimizer , introduced Tungstun tuning (encode, decode, generate entire code as single function etc.,) and Vectorized Parquet
Spark structured streaming - Rich integration with Batch process

Spark catalog is used to manage views/tables.

Ex: spark.catalog.listTables.show (It will list local tables/views)
Ex: spark.catalog.listTables("global_temp").show (It will list global&local tables/views)

In CSV read - options for mode are - -> permissive, dropmalformed and failfast.
Coalesce can be used to reduce the no.of partitions and it doesn't shuffle the data, however instructs spark to read multiple partitions as one
readStream (provide topic).load(), writeStream(topic,key,value).start()--- watermark("10 min") should be applied on same as aggr column and before aggr.
Code executes on executor only when SPARK APIs are used i.e operations on RDD,DF or DS. All the other code i.e before using sparkSession/context/spark api or operations on collected data will be executed on driver.
Mistakes to avoid:

Wrong calculation of executors - remember to Consider yarn-memory-overhead as 15-20%
No spark shuffle block size can be greater than 2GB
Default # of partitions used when shuffle is involved is 200. So use re partition of coalesce wisely (spark.sql.shuffle.partitions)

How many partitions? ~256MB per partition
Remember the number 2000 (Spark book keeping limit) if the number of partitions is near to 2000, bump the number

References:

https://eng.uber.com/hdfs-file-format-apache-spark/

https://spark.apache.org/docs/2.4.3/tuning.html

https://mylearninginbigdata.blogspot.com

https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

https://unraveldata.com/to-cache-or-not-to-cache/