20.2 C
New York
Sunday, September 25, 2022

How To Speed Up Large Collections Processing in Java – InfoQ.com

Learn the emerging software trends you should pay attention to. Attend online QCon Plus (Nov 29 – Dec 9, 2022). Register Now
Facilitating the Spread of Knowledge and Innovation in Professional Software Development


The panelists discuss ways to improve as developers. Are better tools the solution, or can simple changes in mindset help? And what practices are already here, but not yet universally adopted?
A new breed of integration software is arising that syncs business data into a simplified data hub and then syncs that data to the destination system.  The benefit of this integration pattern is that it reduces the number of manual transformations required (often to zero) and makes it easier to write manual transformations when you have to.
This article covers the benefits of streaming-first infrastructure for two scenarios of real-time ML: is online prediction, where a model can receive a request and make predictions as soon as the request arrives, and continual learning, when machine learning models are capable of continually adapting to change in data distributions in production.
POCs and Scrum can play a critical role in implementing Quality software solutions. Poor quality often starts with a POC that was prematurely turned into the development pipeline. Scrum short sprints often create an environment most conductive to working reactively to constantly-changing requirements making it hard for developers to prioritize and achieve Quality over the course of the project.
Zero trust is a powerful security model that’s at the forefront of modern security practices. It’s also a term that is prone to buzz and hype, making it hard to cut through the noise. So what is zero trust, exactly, and for Kubernetes, what does it mean in concrete terms? In this article, we’ll explore what zero trust is from an engineering perspective.
Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.
Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.
Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.
InfoQ Homepage Articles How To Speed Up Large Collections Processing in Java
Aug 26, 2022 13 min read
by
Nahla Davies
reviewed by
Erik Costlow
 
Programming today involves working with large data sets, often including many different types of data. Manipulating these data sets can be a complex and frustrating task. To ease the programmer’s job, Java introduced the Java Collections Framework collections in 1998.
This article discusses the purpose behind the Java Collections Framework, how Java collections work, and how developers and programmers can use Java collections to their best advantage.
Although it has passed the venerable age of 25, Java remains one of the most popular programming languages today. Over 1,000,000 websites use Java in some form, and more than a third of software developers have Java in their toolbox.
vFunction is a patented AI-powered platform for companies that intelligently and automatically transforms legacy monolithic applications into microservices. Request a Demo.
Throughout its life, Java has undergone substantial evolution. One early advancement came in 1998 when Java introduced the Collection Framework (JCF), which simplified working with Java objects. The JCF provided a standardized interface and common methods for collections, reduced programming effort, and increased the speed of Java programs.
Understanding the distinction between Java collections and the Java Collections Framework is essential. Java collections are simply data structures representing a group of Java objects. Developers can work with collections in much the same way they work with other data types, performing common tasks such as searches or manipulating the collection's contents.
An example of a collection in Java is the Set Collection interface (java.util.Set). A Set is a collection that does not allow for duplicate elements and does not store elements in any particular order. The Set interface inherits its methods from Collection (java.util.Collection) and contains only those methods.
In addition to sets, there are queues (java.util.Queue) and maps (java.util.Map). Maps are not collections in the truest sense as they don’t extend collection interfaces, but developers can manipulate Maps as if they are collections. Sets, Queues, Lists, and Maps each have descendants, such as sorted sets (java.util.SortedSet) and navigable maps (java.util.NavigableMap).
In working with collections, developers need to be familiar with and understand some specific collections-related terminology:
Beginning programmers may find it difficult to grasp the difference between unmodifiable and immutable collections. Unmodifiable collections are not necessarily immutable. Indeed, unmodifiable collections are often wrappers around a modifiable collection that other code can still access and modify. Other code may actually be able to modify the underlying collection. It will take some time working with collections to gain a degree of comfort with unmodifiable and immutable collections.
As an example, consider creating a modifiable list of the top five cryptocurrencies by market capitalization. You can create an unmodifiable version of the underlying modifiable list using the java.util.Collections.unmodifiableList() method. You can still modify the underlying list, which will appear in the unmodifiable list. But you cannot directly modify the unmodifiable version.
On execution, you will see that an addition to the underlying modifiable list shows up as a modification of the unmodifiable list.
Note the difference, however, if you create an immutable list and then attempt to change the underlying list. There are many ways to create immutable lists from existing modifiable lists, and below, we use the List.copyOf() method.
After modifying the underlying list, the immutable list does not display the change. And trying to modify the immutable list directly results in an UnsupportedOperationException:

Prior to the introduction of the JCF, developers could group objects using several special classes, namely the array, the vector, and the hashtable classes. Unfortunately, these classes had significant limitations. In addition to lacking a common interface, they were difficult to extend.
The JCF provided an overarching common architecture for working with collections. The Collections Interface contains several different components, including:
The JCF offered developers many benefits compared to the prior object grouping methods. Notably, the JCF made Java programming more efficient by reducing the need for developers to write their own data structures.
But the JCF also fundamentally altered how developers worked with APIs. With a new common language for dealing with different APIs, the JCF made it simpler for developers to learn and design APIs and implement them. In addition, APIs became vastly more interoperable. An example is Eclipse Collections, an open source Java collections library fully compatible with different Java collections types.
Additional development efficiencies arose because the JCF provided structures that made it much easier to reuse code. As a result, development time decreased, and program quality increased.
The JCF has a defined hierarchy of interfaces. java.util.collection extends the superinterface Iterable. Within Collection there are numerous descendant interfaces and classes, as shown below:

As noted previously, Sets are unordered groups of unique objects. Lists, on the other hand, are ordered collections that may contain duplicates. While you can add elements at any point in a list, the remainder of the order is maintained.
Queues are collections where elements are added at one end and removed from the other end, i.e., it is a first-in, first-out (FIFO) interface. Deques (double-ended queues) allow for the addition or removal of elements from either end.
Each interface in the JCF, including java.util.collection, has specific methods available for accessing and manipulating individual elements of the collection. Among the more common methods used in collections are:
Each subinterface may have additional methods as well. For example, although the Set interface includes only the methods from the Collection interface, the List interface has many additional methods based on accessing specific list elements, including:
As the size of collections grows, they can develop noticeable performance issues. And it turns out that the proper selection of collection types and associated collection design can also substantially affect performance.
The ever-increasing amount of data available to developers and applications led Java to introduce new ways to process collections to increase overall performance. In Java 8, released in 2014, Java introduced Streams – new functionality whose purpose was to simplify and increase the speed of bulk object processing. Since their introduction, Streams have had numerous improvements.
It is essential to understand that streams are not themselves data structures. Instead, as Java explains it, streams are "Classes that support functional-style operations on streams of elements, such as map-reduced transformations on collections."
Streams use pipelines of methods to process data received from a data source such as a collection. Every stream method is either an intermediate method (methods that return new streams that can be further processed) or a terminal method (after which no additional stream processing is possible). Intermediate methods in the pipeline are lazy; that is, they are evaluated only when necessary.
Both parallel and sequential execution options exist for streams. Streams are sequential by default.

Processing large collections in Java can be cumbersome. While Streams simplified dealing with large collections and coding operations on large collections, it was not always a guarantee of improved performance; indeed, programmers frequently found that using Streams actually slowed processing.
As is well known with respect to websites, in particular, users will only allow a matter of seconds for loads before they move on out of frustration. So to provide the best possible customer experience and maintain the developer’s reputation for offering quality products, developers must consider how to optimize processing efforts for large data collections. And while parallel processing cannot guarantee improved speeds, it is a promising place to start.
Parallel processing, i.e., breaking the processing task into smaller chunks and running them simultaneously, offers one way to reduce the processing overhead when dealing with large collections. But even parallel stream processing can lead to decreased performance, even if it is simpler to code. In essence, the overhead associated with managing multiple threads can offset the benefits of running threads in parallel.
Because collections are not thread-safe, parallel processing can result in thread interference or memory inconsistency errors (when parallel threads do not see changes made in other threads and therefore have differing views of the same data). The Collections Framework attempts to prevent thread inconsistencies during parallel processing using synchronization wrappers. While the wrapper can make a collection thread-safe, allowing for more efficient parallel processing, it can have undesirable effects. Specifically, synchronization can cause thread contention, which can result in threads executing more slowly or ceasing execution.
Java has a native parallel processing function for collections: Collection.parallelstream. One significant difference between the default sequential stream processing and parallel processing is that the order of execution and output, which is always the same when processing sequentially, can vary from execution to execution when using parallel processing.
As a result, parallel processing is particularly effective in situations where processing order does not affect the final output. However, in situations where the state of one thread can affect the state of another, parallel processing can create problems.
Consider a simple example where we create a list of current accounts receivables for a list of 1000 customers. We want to determine how many of those customers have receivables in excess of $25,000. We can perform this check either sequentially or in parallel with differing processing speeds.
To set the example up for parallel processing, we will use the
Code execution demonstrates that parallel processing may lead to performance improvements when processing data collections:

Note, however, that each time you execute the code, you will obtain different results. In some instances, sequential processing will still outperform parallel processing.

In this example, we used Java’s native processes for splitting the data and assigning threads.
Unfortunately, Java’s native parallel processing efforts are not always faster in every situation than sequential processing, and indeed, they are frequently slower.
As one example, parallel processing is not useful when dealing with linked lists. Whereas data sources like ArrayLists are simple to split for parallel processing, the same is not true of LinkedLists. TreeMaps and HashSets lie somewhere in between.
One method for making decisions about whether to utilize parallel processing is Oracle’s NQ model. In the NQ model, N represents the number of data elements to be processed. Q, in turn, is the amount of computation required per data element. In the NQ model, you calculate the product of N and Q, with higher numbers indicating higher possibilities that parallel processing will lead to performance improvements.
When using the NQ model, there is an inverse relationship between N and Q. That is, the higher amount of computing required per element, the smaller the data set can be for parallel processing to have benefits. A rule of thumb is that for low computational requirements, a minimum data set of 10,000 is the baseline for using parallel processing.
Although beyond the scope of this article, there are more advanced methods for optimizing parallel processing in Java collections. For example, advanced developers can adjust the partitioning of data elements in the collection to maximize parallel processing performance. There are also third-party add-ons and replacements for the JCF that can improve performance. But beginners and intermediate developers, however, should focus on understanding which operations will benefit from Java’s native parallel processing features for data collections.
In a world of big data, finding ways to improve the processing of large data collections is a must to create high-performing web pages and applications. Java provides built-in collection processing features that help developers improve data processing, including the Collections Framework and native parallel processing functions. Developers need to become familiar with how to use these features and understand when the native features are acceptable and when they should shift to parallel processing.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We’d love to have more people join our team.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
Real-world technical talks. No product pitches.
Practical ideas to inspire you and your team.
QCon San Francisco – Oct 24-28, In-person.

QCon San Francisco brings together the world’s most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices.
Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now
InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.
Privacy Notice, Terms And Conditions, Cookie Policy

source

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles