Use It or Lose It Notes

Because I mostly don’t use it, and then end up losing it.

This is my living blog of quick, forgotten patterns. Not profound, just practical.


Table of Contents

Spark

1. Create a SparkSession (Boilerplate I always forget)

SparkSession spark = SparkSession.builder()
        .appName("UILI")
        .master("local[*]")
        .getOrCreate();

2. Create a Dataset from Strings (not from files)

Dataset<String> ds = spark.createDataset(
        Arrays.asList("Abc", "xyz"),
        Encoders.STRING()
);

3. SparkConf and RDD creation options

SparkConf conf = new SparkConf().setAppName("uili").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);

// Create RDDs from:
sc.parallelize(...);
sc.textFile("path");
sc.wholeTextFiles("path");

// Or convert from DataFrame:
df.javaRDD();

// Or from Kafka:
JavaInputDStream<String> lines = KafkaUtils.createStream(...);
lines.foreachRDD(rdd -> { /* process rdd */ });

4. Word Count with Stopword Filtering

                lines
                .flatMap(s ->  Arrays.asList(SPACE.split(s)).iterator())
                .filter( word -> !word.isEmpty() && !stopWords.contains(word))
                .mapToPair(word -> new Tuple2<>(word, 1))
                .reduceByKey(Integer::sum)
                .mapToPair(Tuple2::swap)
                .sortByKey(false)
                .take(10)
                .forEach( tuple -> { System.out.println(tuple._1() + ": " + tuple._2());});

5. Custom Partitioner

rdd.partitionBy(new UserIdModPartitioner(10));

public static class UserIdModPartitioner extends Partitioner {
    private final int numPartitions;

    public UserIdModPartitioner(int numPartitions) {
        this.numPartitions = numPartitions;
    }

    @Override
    public int numPartitions() {
        return numPartitions;
    }

    @Override
    public int getPartition(Object key) {
        Integer userId = (Integer) key;
        return userId % numPartitions;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof UserIdModPartitioner) {
            return ((UserIdModPartitioner) obj).numPartitions == this.numPartitions;
        }
        return false;
    }
}

6. groupByKey vs reduceByKey vs mapValues

  • groupByKey() → Avoid for large data. Shuffles everything.
  • reduceByKey() → Preferred. Does local aggregation before shuffle.
  • mapValues() → Lightweight transform on values only (no shuffle).
rdd.mapValues(value -> transform(value));

Java

Random nuggets that fade away from memory faster than they should.