Spark Streaming
Here is an example of how to do a very simple streaming from and to file with pyspark. https://gist.github.com/sallos-cyber/a14e03da49cc0c873a651628dba4d096
Here is an example of how to do a very simple streaming from and to file with pyspark. https://gist.github.com/sallos-cyber/a14e03da49cc0c873a651628dba4d096
spark.dynamicAllocation.enabled spark.dynamicAllocation.initialExecutors spark.dynamicAllocation.minExecutors spark.dynamicAllocation.maxExecutors spark.shuffle.partitions spark.default.parallelism = spark.executor.instances * spark.executor.cores * 2 maxPartitionBytes Input bytes = 40 GB? Wähle so viele Partitions, so dass die Größe einer Partition <= 200
Assumptions: zeppelin 10.0, and Spark 3.1.1. I assume Spark runs in one thread on a single machine (local) and Zeppelin runs on the same machine. The Spark-Home variable has been
Renaming: If you are reading in from a topic to which you sent data formatted as Json you must serialize the data, optionally process it and finally you serialize it
List all topics Consume from a given topic from console
Assumption: Spark and Clickhouse are up and running. According to the official Clickhouse documentation we can use the ClicHouse-Native-JDBC driver. To use it with python we simply download the shaded