Key conclusions
- Apache Beam is a powerful open source project for batch and streaming
- Its portability allows the launch of pipelines of various backends from Apache Spark to Google Cloud Dataflow
- Beam is extensible, which means you can write and share new SDKs, IO connectors and transformers
- Beam currently supports Python, Java and Go
- By using its Java SDK, you can take full advantage of the JVM
In this article, we will introduce Apache Beam, a powerful open source batch and streaming project used by large companies such as eBay to integrate its streaming pipelines and Mozilla to securely move data between its systems.
Overview
Apache Beam is a software model for data processing that supports packet and streaming.
Using the provided SDKs for Java, Python and Go, you can develop pipelines and then select the backend that will run the pipeline.
Benefits of Apache Beam
Beam Model (Francis Perry and Tyler Akidau)
- Built-in I / O connectors
- Apache Beam connectors allow easy retrieval and loading of data from several types of storage
- The main types of connectors are:
- Based on files (eg Apache Parquet, Apache Thrift)
- File system (eg: Hadoop, Google Cloud Storage, Amazon S3)
- Messages (eg Apache Kafka, Google Pub / Sub, Amazon SQS)
- Database (eg Apache Cassandra, Elastic Search, MongoDb)
- As an OSS project, support for new connectors is growing (eg: InfluxDB, Neo4J)
- Portability:
- Beam provides several guides for running the piping, allowing you to choose the best one for each use case and avoid blocking the supplier.
- Distributed processing baccands such as Apache Flink, Apache Spark, or Google Cloud Dataflow can be used as moving.
- Distributed parallel processing:
- Each element of the data set is processed independently by default, so that its processing can be optimized by parallel execution.
- Developers do not have to manually distribute the workload among workers, as Beam provides an abstraction for it.
The beam model
The key concepts in the Beam programming model are:
- PCollection: is a collection of data, ie: an array of numbers or words extracted from text.
- PTransform: a transform function that receives and returns a PC collection, ie: sums all numbers.
- Pipeline: manages the interactions between PTransforms and PCollections.
- PipelineRunner: specifies where and how the pipeline should run.
Quick start
The main operation of the pipeline consists of 3 steps: reading, processing and writing the result of the transformation. Each of these steps is programmatically defined using one of the Apache Beam SDKs.
In this section we will create pipelines using the Java SDK. You can choose between creating a local application (using Gradle or Maven) or you can use Online Playground. The examples will use the local runner, as it will be easier to check the result with JUnit Assertions.
Java local dependencies
- beam-sdks-java-core: contains all classes of the Beam Model.
- beam-runners-direct-java: by default, the Apache Beam SDK will use the direct runner, which means that the pipeline will run on your local machine.
Multiply by 2
In this first example, the pipeline will get an array of numbers and will compare each element multiplied by 2.
The first step is to create an instance of the pipeline that will receive the input array and perform the transformation function. Because we use JUnit to run Apache Beam, we can easily create a TestPipeline as a test class attribute. If you prefer to work with your main application instead, you will need to set the pipeline configuration options,
@Rule public final transition pipeline TestPipeline = TestPipeline.create ();
Now we can create a PC collection that will be used as input to the pipeline. This will be an array instantiated directly from memory, but can be read from any location supported by Apache Beam:
PCollection
We then apply our transformation function, which will multiply each element of the data set by two:
PCollection
To check the results, we can write a statement:
PAssert.that (output) .containsInAnyOrder (2, 4, 6, 8, 10);
Note that the results do not have to be sorted as input, as Apache Beam processes each item independently and in parallel.
The test is done at this point and we start the pipeline by calling:
pipeline.run ();
Reduce the operation
The reduction operation is a combination of multiple input elements, resulting in a smaller collection, usually containing one element.
MapReduce (Francis Perry and Tyler Akidau)
Now let’s expand the example above to sum all the elements multiplied by two, which results in a MapReduce conversion.
Each PCollection transformation results in a new instance of PCollection, which means that we can perform chain transformations using the apply method. In this case, the Sum operation will be used after multiplying each input by 2:
PCollection
Working with FlatMap
FlatMap is an operation that first applies a map to each input element, which usually returns a new collection, resulting in a collection of collections. A flat operation is then applied to merge all nested collections, resulting in a single one.
The next example will be to transform arrays from strings into a unique array containing each word.
First, we declare our list of words that will be used as input to the pipeline:
last string[] WORDS_ARRAY = new string[] {“hello beans”, “hello Alice”, “hello Sue”}; final list
Then we create the input PCollection using the list above:
PCollection
We are now applying the flatmap transformation, which will separate the words in each nested array and combine the results into one list:
PCollection
Group work
A common job in data processing is aggregation or counting with a specific key. We will demonstrate this by counting the number of occurrences of each word in the previous example.
Once we have the flat array of strings, we can chain another PTtransform:
PCollection
Resulting in:
PAssert.that (output) .containsInAnyOrder (KV.of (“hi”, 2L), KV.of (“hello”, 1L), KV.of (“alice”, 1L), KV.of (“sue”) 1L), KV.of (“bob”, 1L));
Reading from a file
One of the principles of Apache Beam is reading data from anywhere, so let’s see in practice how to use a text file as a data source.
The following example will read the contents of “words.txt” with the contents of the “Advanced Unified Programming Model”. The transform function will then return a PC collection containing each word of the text.
PCollection
Save output to file
As seen in the previous input example, Apache Beam has multiple built-in output connectors. In the following example, we will count the number of each word present in the text file “words.txt”, which contains only one sentence (“Advanced Unified Programming Model”) and the output will be saved in text file format.
PCollection
Even file recording is optimized for default concurrency, which means that Beam will determine the best number of snippets (files) to keep the result. The files will be located in the src / main / resources folder and will have the prefix “wordcount”, the shard number and the total number of fragments, as defined in the last output transformation.
When I run it on my laptop, …
Add Comment