107 views
0

`sc.parallelize([2,1,8,5],4).fold(2,lambda a,b:a+1) ` `sc.parallelize([2,1,8,5],4).fold(2,lambda a,b:b+1)`

Spark gives 6 for first one and 4 for the second. I am not able to understand these outputs. Below is my intuition for the second example:
Spark divides the data into 4 partitions:
P1 -> [2]
P2 -> [1]
P3 -> [8]
P4 -> [5]

Now I apply the lambda function on each partition taking 2 as my zero value for each partition. So a is the zero value and b is the current value in each partition. So now data becomes:
P1 -> [2,2] -> b+1 -> 3
P2 -> [2,1] -> 2
P3 -> [2,8] -> 9
P4 -> [2,5] -> 6

Now lets do the final combining for all partitions with 2 as zero value
[2,3,2,9,6] -> [4,2,9,6] (*since a is 2 and b is 3. so b+1 = 4*) ->
[3,9,6] -> [10,6] ->
[7]

I am getting 7 as the final answer but spark gives 4. Please help me understand where I am going wrong. Looked it up online but not able to find an answer.

I tried with multiple examples but my output differs from what spark gives as an output.

In: 0

With fold ‘a’ is your accumulator, and ‘b’ is your current value.

Your first example initialises the accumulator with ‘2’, and on each element it ignores the current value and adds ‘1’ to the accumulator. So really it is just adding ‘1’ each time. You have 4 rows so you get 2+1+1+1+1=6

Your second example ignores the accumulator and just adds 1 to the current value. Each partition will send to the reduce operation whatever the value is in its last row, plus 1. Ready of those results are ignored in the final fold, since the accumulator is ignored, so in the end you get the result of the last partition, plus 1. I would say the original value of ‘2’ has ended up in the last position in the reduce, its partition returned ‘3’ (2+1), which in turn got another +1 to wind up at 4