IBM Sterling B2B Integrator: Architecture and Production

IBM Sterling B2B Integrator is one of those platforms that moves trillions of dollars in annual transactions and that almost nobody outside certain industries has heard of. Banking, finance, retail, supply chain — in all these domains, Sterling is the piece that connects partners, processes EDI files, orchestrates transfers, and ensures data arrives where it needs to go.

I've spent years running it in production banking environments. What follows isn't IBM's official documentation — it's what I learned keeping the platform running when things don't go as planned.

What Sterling is and why it matters

Sterling B2B Integrator is an enterprise integration platform built for B2B communications. Its core function is receiving, transforming, routing, and delivering data between organizations. It supports protocols like AS2, SFTP, FTPS, HTTP/S, MQ, and specific connectors for EDI standards (X12, EDIFACT).

In financial services, Sterling handles payment files, reconciliations, regulatory reports, and interbank communications. If Sterling goes down, transactions stop. It's not just another application — it's critical infrastructure.

Internal architecture: the pieces that matter

Sterling has three fundamental layers that every operator needs to understand:

Adapters: connectors to the outside world. Each protocol has its adapter: SFTP Adapter, HTTP Adapter, MQ Adapter, File System Adapter. Adapters listen for incoming connections or initiate outbound ones. When an adapter fails, that communication channel is down. Monitoring adapter status is the first thing you do every morning.
Business Processes (BPs): workflows that orchestrate the data flow. A BP defines the steps: receive file, validate format, transform, route, deliver, notify. They're designed in the Business Process Modeler and executed in the workflow engine. A stuck BP is the most common problem and the most frustrating to diagnose.
Database: the nerve center — and the weak point. Sterling stores everything in the database: configuration, BP state, in-transit documents, logs, partner metadata. A slow database turns Sterling into a slow system. A locked database turns it into a dead system.

On top of all this, Sterling File Gateway (SFG) acts as a simplified routing layer. SFG abstracts the complexity of BPs for the most common use case: moving files between partners. You define rules based on partner, protocol, format, and destination, and SFG handles the rest. For operations teams that don't build custom BPs, SFG is the primary interface with the platform.

Production operation: what the docs don't tell you

Running Sterling in production is an exercise in constant vigilance. These are the areas that demand daily attention:

BP monitoring: Sterling's console shows business process status: Success, Error, Waiting, Interrupted. BPs in "Waiting" state longer than expected signal problems. Set up alerts for BPs exceeding their normal execution time. Don't wait for someone to report a missing file.
Logs: Sterling generates logs across multiple locations. The most useful are BP logs (accessible from the console), system logs (noapp.log, system.log), and adapter logs. When something fails, correlate all three. The BP error tells you what failed; the system log tells you why.
JVM tuning: Sterling runs on the JVM, and its performance depends directly on memory configuration. A heap that's too small triggers frequent Full GCs. A heap that's too large extends GC pauses. The sweet spot depends on transaction volume, but in banking environments I've worked with 8-16 GB heaps and G1GC as the collector.
Connection pools: Sterling maintains connection pools to the database and external services. An exhausted pool blocks everything. Monitor pool usage and size according to actual demand, not IBM's generic documentation.
Purge schedules: Sterling accumulates data in the database: processed documents, BP logs, transaction metadata. Without periodic purging, the database grows until performance degrades. Configure aggressive purge schedules for data you no longer need. I've seen 500 GB databases that should have been 50 GB.

Real troubleshooting: the problems you will face

After years running Sterling, these are the recurring issues and how to approach them:

BP stuck in "Waiting": the most frequent scenario. A business process sits in Waiting state and won't advance. First, check which workflow step it stopped at. Then verify whether the required adapter is active and functional. If the adapter is fine, check for database locks blocking the transaction. As a last resort, a thread dump of the Java process will show whether there's thread contention.
File Gateway routing failures: SFG won't route a file. Check the routing channel configuration: correct partner, correct protocol, matching filename pattern. Routing errors are usually configuration errors, not platform bugs. Check SFG logs to see which rule was evaluated and why the file was rejected.
Database contention: Sterling makes heavy use of the database. Prolonged lock waits degrade everything. Identify the slowest queries, verify indexes are up to date, and make sure purges are running. In Oracle, check wait events. In DB2, lock escalations. The database is always the first suspect.
Thread pool exhaustion: Sterling has thread pools for processing BPs, adapters, and internal operations. If all threads are busy, new operations queue indefinitely. Capture a thread dump to see what threads are doing. If they're all waiting on database responses, the problem isn't the pool — it's the database.

High availability: how not to depend on a single instance

Sterling supports active-active configurations with load balancing. The typical architecture includes:

Two or more Sterling nodes behind a load balancer to distribute incoming connections.
Shared storage (NFS, GPFS, or enterprise storage) so all nodes access the same files.
Clustered database (Oracle RAC, DB2 HADR) as the shared persistence layer.
Session affinity configured on the balancer so long-running BPs don't hop between nodes.

The critical point in HA is the database. If the database fails, both Sterling nodes fail. It doesn't matter how many application nodes you have — if they share a database without redundancy, you don't have real high availability. Database failover must be tested, automated, and rehearsed periodically.

Sterling is as robust as the attention you give it in production. It's not a system you configure once and forget. It's critical infrastructure that demands constant monitoring, disciplined purging, and a team that understands its internal architecture.