Latest Posts

PySpark & Coalesce

Let’s say you have a big dataset. It’s formed of 1,000,000 files, all around 1MB each, so we have a 1TB dataset. On HDFS, our default block size is 128MB. This dataset would be stored on 7,800 partitions (dataset size divided by HDFS block size), meaning your job...