2.6 C
New York
Friday, February 3, 2023

Data Oriented Programming in Java – InfoQ.com

InfoQ Live July 19: How to Successfully Build and Deploy Applications on the Cloud? Register Now
Facilitating the Spread of Knowledge and Innovation in Professional Software Development


Avdi Grimm describes the future of development, which is already here. Get a tour of a devcontainer, and contrast it with a deployment container.
Fran Mendez discusses event-driven or asynchronous APIs, comparing AsyncAPI with OpenAPI/Swagger, AMQP/MQTT/Kafka with HTTP, and publish/subscribe with request/response.
Roland Meertens shows how one can get started deploying models without requiring any data, discussing foundational models, and examples of them, such as GPT-3 and OpenAI CLIP.
As much as it has been discussed and praised, digital transformation has suffered from a lack of clear definition. While the promise of becoming “customer-centric” and “disruptive” has been widely publicized, there has still been little in actual guidance as to how to achieve those and many of the other benefits commonly associated with digital transformation. This article offers concrete advice
This article summarizes how we see the "cloud computing and DevOps" space in 2022, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
Learn how to migrate an application to serverless and what are the common mistakes to avoid. Register Now.
Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.
InfoQ Homepage Articles Data Oriented Programming in Java
Jun 20, 2022 22 min read
by
Brian Goetz
reviewed by
Daniel Bryant
Project Amber has brought a number of new features to Java in recent years — local variable type inferencetext blocksrecords, sealed classes, pattern matching, and more. While each of these features are self-contained, they are also designed to work together. Specifically, records, sealed classes, and pattern matching work together to enable easier data-oriented programming in Java. In this article, we'll cover what is meant by this term and how it might affect how we program in Java.
The goal of any programming paradigm is to manage complexity. But complexity comes in many forms, and not all paradigms handle all forms of complexity equally well. Most programming paradigms have a one-sentence slogan of the form "Everything is a …"; for OOP, this is obviously "everything is an object." Functional programming says "everything is a function"; actor-based systems say "everything is an actor", etc. (Of course, these are all overstatements for effect.)
OOP encourages us to model complex entities and processes using objects, which combine state and behavior. OOP encourages encapsulation (object behavior mediates access to object state) and polymorphism (multiple kinds of entities can be interacted with using a common interface or vocabulary), though the mechanisms for accomplishing these goals vary across OO languages. When modeling the world with objects, we are encouraged to think in terms of is-a (a savings account is-a bank account) and has-a (a savings account has-a owner and account number) relationships.
The platform to power synchronized digital experiences in realtime. Guaranteed to deliver at scale. Get started for free.
While some developers take pleasure in loudly declaring object-oriented programming to be a failed experiment, the truth is more subtle; like all tools, it is well-suited to some things and less well-suited to others. OOP done badly can be awful, and a lot of people have been exposed to OOP principles taken to ridiculous extremes. (Rants like the kingdom of nouns may be amusing and therapeutic, but they're not really railing against OOP, as much as a cartoon exaggeration of OOP.) But if we understand what OOP is better or worse at, we can use it where it offers more value and use something else where it offers less.
OOP is at its best when it is defining and defending boundaries — maintenance boundaries, versioning boundaries, encapsulation boundaries, compilation boundaries, compatibility boundaries, security boundaries, etc.
Independently maintained libraries are built, maintained, and evolved separately from the applications that depend on them (and from each other), and if we want to be able to freely move from one version of the library to the next, we need to ensure that boundaries between libraries and their clients are clear, well-defined, and deliberate. Platform libraries may have privileged access to the underlying operating system and hardware, which must be carefully controlled; we need a strong boundary between the platform libraries and the application to preserve system integrity. OO languages provide us with tools for precisely defining, navigating, and defending these boundaries.
Dividing a large program into smaller parts with clear boundaries helps us manage complexity because it enables modular reasoning — the ability to analyze one part of the program at a time, but still reason about the whole. In a monolithic program, placing sensible internal boundaries helped us build bigger applications that spanned multiple teams. It is no accident that Java thrived in the era of monoliths.
Since then, programs have gotten smaller; rather than building monoliths, we compose larger applications out of many smaller services. Within a small service, there is less need for internal boundaries; sufficiently small services can be maintained by a single team (or even a single developer.) Similarly, within such smaller services, we have less need for modeling long-running stateful processes.
Java's strong static typing and class-based modeling can still be tremendously useful for smaller programs, just in different ways. Where OOP encourages us to use classes to model business entities and processes, smaller codebases with fewer internal boundaries will often get more mileage out of using classes to model data. Our services consume requests that come from the outside world, such as via HTTP requests with untyped JSON/XML/YAML payloads. But only the most trivial of services would want to work directly with data in this form; we'd like to represent numbers as int or long rather than as strings of digits, dates as classes like LocalDateTime, and lists as collections rather than long comma-delimited strings. (And we want to validate that data at the boundary, before we act on it.)
Data-oriented programming encourages us to model data as (immutable) data, and keep the code that embodies the business logic of how we act on that data separately. As this trend towards smaller programs has progressed, Java has acquired new tools to make it easier to model data as data (records), to directly model alternatives (sealed classes), and to flexibly destructure polymorphic data (pattern matching) patterns.
Data-oriented programming encourages us to model data as data. Records, sealed classes, and pattern matching, work together to make that easier.
Programming with data as data doesn't mean giving up static typing. One could do data-oriented programming with only untyped maps and lists (one often does in languages like Javascript), but static typing still has a lot to offer in terms of safety, readability, and maintainability, even when we are only modeling plain data. (Undisciplined data-oriented code is often called "stringly typed", because it uses strings to model things that shouldn't be modeled as strings, such as numbers, dates, and lists.)
Records, sealed classes, and pattern matching are designed to work together to support data-oriented programming. Records allow us to simply model data using classes; sealed classes let us model choices; and pattern matching provides us with an easy and type-safe way of acting on polymorphic data. Support for pattern matching has come in several increments; the first added only type-test patterns and only supported them in instanceof; the next supported type-test patterns in switch as well; and most recently, deconstruction patterns for records were added in Java 19. The examples in this article will make use of all of these features.
While records are syntactically concise, their main strength is that they let us cleanly and simply model aggregates. Just as with all data modeling, there are creative decisions to make, and some modelings are better than others. Using the combination of records and sealed classes also makes it easier to make illegal states unrepresentable, further improving safety and maintainability.
As a first example, consider how we might model invocation options in a command line program. Some options take arguments; some do not. Some arguments are arbitrary strings, whereas others are more structured, such as numbers or dates. Processing command line options should reject bad options and malformed arguments early in the execution of the program. A quick-and-dirty approach might be to loop through the command line arguments and for each known option we encounter, squirrel away the presence or absence of the option, and possibly the option's parameter, in variables. This is simple, but now our program is dependent on a set of stringly-typed, effectively global variables. If our program is tiny, this might be OK, but it doesn't scale very well. Not only is this likely to impede maintainability as the program grows, but it makes our program less testable — we can only test the program as a whole through its command line.
A slightly less quick-and-dirty approach might be to create a single class representing a command line option, and parse the command line into a list of option objects. If we had a cat-like program that copies lines from one or more files to another, can trim files to a certain line count, and can optionally include line numbers, we might model these options using an enum and an Option class:
This is an improvement over the previous approach; at least now there is a clean separation between parsing the command line options and consuming them, which means we can test our business logic separately from the command-line shell by feeding it lists of options. But it's still not very good. Some options have no parameter, but we can't see that from looking at the enum of options, and we still model them with an OptionValue object that has an optionValue field. And even for options that do have parameters, they are always stringly typed.
The better way to do this is to model each option directly. Historically, this might have been prohibitively verbose, but fortunately this is no longer the case. We can used a sealed class to represent an Option, and have a record for each kind of option:
The Option subclasses are pure data. The option values have nice clean names and types; options that have parameters represent them with the appropriate type; options without parameters do not have useless parameter variables that might be misinterpreted. Further, it is easy to process the options with a pattern matching switch (usually one line of code per kind of option.) And because Option is sealed, the compiler can type-check that a switch handles all the option types. (If we add more option types later, the compiler will remind us which switches need to be extended.)
We've probably all written code like that outlined in the first two versions, even though we know better. Without the ability to cleanly and concisely model the data, doing it "right" is often too much work (or too much code.)
What we've done here is take messy, untyped data from across the invocation boundary (command line arguments) and transformed it into data that is strongly typed, validated, easily acted upon (through pattern matching), and makes many illegal states (such as specifying --input-file but not providing a valid path) unrepresentable. The rest of the program can just use it with confidence.
This combination of records and sealed types is an example of what are called algebraic data types (ADTs). Records are a form of "product types", so-called because their state space is the cartesian product of that of their components. Sealed classes are a form of "sum types", so-called because the set of possible values is the sum (union) of the value sets of the alternatives. This simple combination of mechanisms — aggregation and choice — is deceptively powerful, and shows up in many programming languages. (Our example here was restricted to one level of hierarchy, but this need not be the case in general; one of the permitted subtypes of a sealed interface could be another sealed interface, allowing modeling of complex structures.)
In Java, algebraic data types can be modeled precisely as sealed hierarchies whose leaves are records. Java's interpretation of algebraic data types have a number of desirable properties. They are nominal — the types and components have human-readable names. They are immutable, which makes them simpler and safer and can be freely shared without worry of interference. They are easily testable, because they contain nothing but their data (possibly with behavior derived purely from the data). They can easily be serialized to disk or across the wire. And they are expressive — they can model a broad range of data domains.
One of the simplest but most frequently used applications of algebraic data types is complex return types. Since a method can only return a single value, it is often tempting to overload the representation of the return value in questionable or complex ways, such as using null to mean "not found", encoding multiple values into a string, or using an overly abstract type (arrays, List or Map) to cram all the different kinds of information a method could return into a single carrier object. Algebraic data types make it so easy to do the right thing, that these approaches become less tempting.
In Sealed Classes, we gave an example of how this technique that could be used to abstract over both success and failure conditions without using exceptions:
The advantage of this approach is that the client can handle success and failure uniformly by pattern matching over the result, rather than having to handle success via the return value and the various failure modes via separate catch blocks:
Another benefit of sealed classes is that if you switch over them without a default, the compiler will remind you if you've forgotten a case. (Checked exceptions do this too, but in a more intrusive manner.)
As another example, imagine a service that looks up entities (users, documents, groups, etc) by name, and which distinguishes between "no match found", "exact match found", and "no exact match, but there were close matches." We can all imagine ways to cram this into a single List or array, and while this may make the search API easy to write, it makes it harder to understand, use, or test. Algebraic data types make both sides of this equation easy. We can craft a concise API that says exactly what we mean:
If we encountered this return hierarchy while browsing the code or the Javadoc, it is immediately obvious what this method might return, and how to handle its result:
While such a clear encoding of the return value is good for the readability of the API and for its ease of use, such encodings are also often easier to write as well, because the code virtually writes itself from the requirements. On the other hand, trying to come up with (and document) "clever" encodings that cram complex results into abstract carriers like arrays or maps takes more work.
Algebraic data types are also useful for modeling ad-hoc versions of general purpose data structures. The popular class Optional could be modeled as an algebraic data type:
(This is actually how Optional is defined in most functional languages.) Common operations on Opt can be implemented with pattern matching:
Similarly, a binary tree can be implemented as:
and we can implement the usual operations with pattern matching:
It may seem odd to see this behavior written as static methods, when common behaviors like traversal should "obviously" be implemented as abstract methods on the base interface. And certainly, some methods may well make sense to put into the interface. But the combination of records, sealed classes, and pattern matching offers us alternatives that we didn't have before; we could implement them the old fashioned way (with an abstract method in the base class and concrete methods in each subclass); as default methods in the abstract class implemented in one place with pattern matching; as static methods; or (when recursion is not needed), as ad-hoc traversals inline at the point of use.
Because the data carrier is purpose-built for the situation, we get to choose whether we want the behavior to travel with the data or not. This approach is not at odds with object orientation; it is a useful addition to our toolbox that can be used alongside OO, as the situation demands.
If you look closely enough at the JSON spec, you'll see that a JSON value is also an ADT:
When presented as such, the code to extract the relevant bits of information from a blob of JSON is pretty straightforward; if we want to match the JSON blob { "name":"John", "age":30, "city":"New York" } with pattern matching, this is:
When we model data as data, both creating aggregates and taking them apart to extract their contents (or repack them into another form) is straightforward, and because pattern matching fails gracefully when something doesn't match, the code to take apart this JSON blob is relatively free of complex control flow for enforcing structural constraints. (While we might be inclined to use a more industrial-strength JSON library than this toy example, we could actually implement the toy with only a few dozen additional lines of parsing code which follows the lexical rules outlined in the JSON spec and turns them into a JsonValue.)
The domains we've looked at so far have either been "throwaways" (return values used across a call boundary) or modeling general domains like lists and trees. But the same approach is also useful for more complex application-specific domains. If we wanted to model an arithmetic expression, we could do so with:
Having the intermediate sealed interface BinaryNode which abstracts over addition and multiplication gives us the choice when matching over a Node; we could handle both addition and multiplication together by matching on BinaryNode, or handle them individually, as the situation requires. The language will still make sure we covered all the cases.
Writing an evaluator for these expressions is trivial. Since we have variables in our expressions, we'll need a store for those, which we pass into the evaluator:
The records which define the terminal nodes have reasonable toString implementations, but the output is probably more verbose than we'd like. We can easily write a formatter to produce output that looks more like a mathematical expression:
As before, we could express these as static methods, or implement them in the base class as instance methods but with a single implementation, or implement them as ordinary instance methods — we're free to choose which feels most readable for the domain.
Having defined our domain abstractly, we can easily add other operations on it as well. We can symbolically differentiate with respect to a single variable easily:
Before we had records and pattern matching, the standard approach to writing code like this was the visitor pattern. Pattern matching is clearly more concise than visitors, but it is also more flexible and powerful. Visitors require the domain to be built for visitation, and imposes strict constraints; pattern matching supports much more ad-hoc polymorphism. Crucially, pattern matching composes better; we can use nested patterns to express complex conditions that can be much messier to express using visitors. For example, the above code will yield unnecessarily messy trees when, say, we have a multiplication node where one subnode is a constant. We can use nested patterns to handle these special cases more eagerly:
Doing this with visitors — especially at multiple levels of nesting — can quickly become quite messy and error-prone.
Many of the ideas outlined here may look, at first, to be somewhat "un-Java-like", because most of us have been taught to start by modeling entities and processes as objects. But in reality, our programs often work with relatively simple data, which often comes from the "outside world" where we can't count on it fitting cleanly into the Java type system. (In our JSON example, we modeled numbers as double, but in fact the JSON specification is silent on the range of numeric values; code at the boundary of a system is going to have to make a decision of whether to truncate or reject values that don't fit into the local representation.)
When we're modeling complex entities, or writing rich libraries such as java.util.stream, OO techniques have a lot to offer us. But when we're building simple services that process plain, ad-hoc data, the techniques of data-oriented programming may offer us a straighter path. Similarly, when exchanging complex results across an API boundary (such as our match result example), it is often simpler and clearer to define an ad-hoc data schema using ADTs, than to complect results and behavior in a stateful object (as the Java Matcher API does.)
The techniques of OOP and data-oriented programming are not at odds; they are different tools for different granularities and situations. We can freely mix and match them as we see fit.
Whether modeling a simple return value, or a more complex domain such as JSON or our expression trees, there are some simple principles that usually lead us to simple, reliable data-oriented code.
Model the data, the whole data, and nothing but the data. Records should model data. Make each record model one thing, make it clear what each record models, and choose clear names for its components. Where there are choices to be modeled, such as "a tax return is filed either by the taxpayer, or by a legal representative", model these as sealed classes, and model each alternative with a record. Behavior in record classes should be limited to implementing derived quantities from the data itself, such as formatting.
Data is immutable. An object that has a mutable int field does not model an integer; it models a time-varying relationship between a specific object identity and an integer. If we want to model data, we should not have to worry about our data changing out from under us. Records give us some help here, as they are shallowly immutable, but it still requires some discipline to avoid letting mutability inject itself into our data models.
Validate at the boundary. Before injecting data into our system, we should ensure that it is valid. This might be done in the record constructor (if the validation applies universally to all instances), or by the code at the boundary that has received the data from another source.
Make illegal states unrepresentable. Records and sealed types make it easy to model our domains in such a way that erroneous states simply cannot be represented. This is much better than having to check for validity all the time! Just as immutability eliminates many common sources of errors in programs, so does avoiding modeling techniques that allow us to model invalid data.
A hidden benefit of this approach is testability. Not only is it easy to test code when its inputs and outputs are simple, well-defined data, but it opens the door to easier generative testing, which is often far more effective at finding bugs than hand-crafting individual test cases.
The combination of records, sealed types, and pattern matching makes it easy to follow these principles, yielding more concise, readable, and more reliable programs. While programming with data as data may be a little unfamiliar given Java's OO underpinnings, these techniques are well worth adding to our toolbox.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We’d love to have more people join our team.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
Real-world technical talks. No product pitches.
Practical ideas to inspire you and your team.
QCon San Francisco – Oct 24-28, In-person.

QCon San Francisco brings together the world’s most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices.
Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now
InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.
Privacy Notice, Terms And Conditions, Cookie Policy

source

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles