Why not just add threads instead of processes for payment decoding?

Threads share one interpreter and the GIL, so pure-Python fixed-width decoding still serializes onto a single core. Use processes for CPU-bound work; threads only help when the work is a C extension that releases the GIL.

How big should each ingestion chunk be?

Large enough to amortize pickle overhead over many records, small enough that twice the worker count of chunks fits in memory. For 94-byte ACH records, 1 to 4 MB is a good starting band; profile peak resident memory against a real file.

The event loop periodically freezes even with a process pool — why?

CPU-bound code is still running inside a coroutine. Set PYTHONASYNCIODEBUG=1 and watch loop.slow_callback_duration; any callback over about 100 ms is work that leaked onto the loop instead of being dispatched via run_in_executor.

Asyncio vs Multiprocessing for Payment Ingestion: Engineering Deterministic Throughput

At 02:00 UTC a reconciliation service pulls a 1.4 GB NACHA file off an SFTP gateway, and an engineer has to decide — before the batch is written — whether each record is decoded on the asyncio event loop or handed to a separate OS process. Choose wrong and the container OOM-kills mid-file, a Fedwire acknowledgement times out, and the 14-minute gap that follows eats into a Reg E investigation clock. This page sits inside the Async Batch Processing Architectures design and, within the broader Automated File Ingestion & Parsing Pipelines framework, isolates one surgical question: which concurrency primitive owns which half of a payment ingestion workload, and how to wire them together so neither starves the other.

The short answer is that asyncio and multiprocessing are not competitors — they own different resources. asyncio owns file descriptors and sockets; multiprocessing owns CPU cores. A production payment pipeline needs both, split precisely along the line where network waiting ends and byte-level work begins. Getting that line right is what separates a pipeline that clears its settlement window from one that stalls behind its own fixed-width NACHA decoding.

Concept Spec: The GIL, the Event Loop, and the Memory Model

asyncio is a single-threaded cooperative scheduler. It excels at I/O-bound work — polling SFTP, streaming chunked HTTP from correspondent banks, awaiting SWIFT gpi acknowledgements — because every await yields the one thread back to the reactor while the kernel waits on the socket. Thousands of concurrent connections cost only their buffers. But the event loop runs Python bytecode under the Global Interpreter Lock (GIL), so any CPU work executed inside a coroutine holds that single thread hostage. Slice a fixed-width record, compile a regex, instantiate a Pydantic model, or sum a decimal.Decimal column directly in a coroutine and every other socket freezes until it returns.

multiprocessing (via concurrent.futures.ProcessPoolExecutor) sidesteps the GIL by forking isolated interpreters, each with its own memory space and core. That is exactly where CPU-bound payment work belongs: positional decoding, hash-based keying, strict schema validation, cent arithmetic. The cost is that every argument crossing a process boundary is pickle-serialized and copied, so payloads must stay small.

The memory model is the crux. A naive asyncio.gather over every record materializes $O (n)$ coroutines and their payloads before a single one runs, and naive multiprocessing hands each of w workers the whole file — $O (w \cdot n)$ resident bytes. The bounded hybrid below reads the file as a generator of record-aligned chunks and caps in-flight work with a semaphore, so peak resident memory is $O (w \cdot c)$ for w workers and chunk size c, independent of file size. Time stays a single streaming pass at $O (n)$ , divided across cores by the pool.

The Failure This Prevents: OOM During a Peak ACH Window

The anti-pattern that forces this design is spawning asyncio.gather over tens of thousands of parse coroutines. Concretely: a team wraps parse_record in an async def, fires 500,000 of them into gather, and every coroutine plus its raw line lands on the heap before the loop schedules the first. Worse, parse_record is pure CPU, so once it does run it blocks the reactor — TCP keep-alives lapse, the correspondent bank's socket drops, and the download that was feeding records dies mid-transfer. Memory climbs linearly to the 4 GB container ceiling and Kubernetes issues an OOMKilled. Three distinct bugs stack: unbounded task creation exhausts the heap, CPU work starves the event loop, and GIL contention prevents the surviving I/O from progressing. The fix is not "add more workers" — it is to move CPU work off the loop entirely and bound the queue.

Full Annotated Python Implementation

The pattern streams the file with mmap (no full-file allocation), aligns chunks to record boundaries, and dispatches each chunk to the process pool through a semaphore. Results are yielded as workers finish, so decoded records never accumulate on the heap. Amounts stay integer cents across the process boundary — never float — and are only widened to decimal.Decimal for aggregation on the far side.

python

import asyncio
import mmap
from concurrent.futures import ProcessPoolExecutor
from typing import AsyncIterator, List, Dict, Any


# CPU-bound worker: runs in an isolated process, so it never touches the loop
# and is free of GIL contention with the network layer.
def parse_ach_chunk(chunk_bytes: bytes) -> List[Dict[str, Any]]:
    """Decode a record-aligned slice of a NACHA file into entry-detail dicts.

    Strictly CPU-bound: ASCII decode, positional slicing, integer-cent coercion.
    Monetary values stay as integer cents here; widen to decimal.Decimal only
    at the aggregation boundary, never to float.
    """
    records: List[Dict[str, Any]] = []
    for line in chunk_bytes.decode("ascii").splitlines():
        if len(line) < 94 or line[0:1] != "6":  # Entry Detail records only
            continue
        # NACHA Entry Detail positional layout (0-indexed slices):
        #   [1:3]   transaction code
        #   [3:12]  receiving DFI routing number (incl. check digit)
        #   [12:29] account number
        #   [29:39] amount in cents (zero-padded, no decimal point)
        records.append({
            "transaction_code": line[1:3],
            "routing_number": line[3:12],
            "account_number": line[12:29].strip(),
            "amount_cents": int(line[29:39]),
            "individual_name": line[54:76].strip(),
        })
    return records


class PaymentIngestionPipeline:
    """Streams a payment file through an asyncio loop into a CPU-bound pool."""

    def __init__(self, file_path: str, chunk_size: int = 1 << 20, max_workers: int = 4) -> None:
        self.file_path = file_path
        self.chunk_size = chunk_size
        self.max_workers = max_workers
        self.executor = ProcessPoolExecutor(max_workers=max_workers)
        # Cap in-flight chunks at twice the worker count so the reader blocks
        # (back-pressure) instead of queueing the whole file into memory.
        self.semaphore = asyncio.Semaphore(max_workers * 2)

    async def _stream_chunks(self) -> AsyncIterator[bytes]:
        """Yield newline-aligned byte chunks via mmap — no full-file RAM load."""
        with open(self.file_path, "rb") as f:
            with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
                offset = 0
                size = mm.size()
                while offset < size:
                    end = min(offset + self.chunk_size, size)
                    # Extend to the next newline so a 94-byte record is never
                    # split across two workers.
                    newline = mm.find(b"\n", end)
                    end = size if newline == -1 else newline
                    yield bytes(mm[offset:end])
                    offset = end + 1

    async def run(self) -> AsyncIterator[List[Dict[str, Any]]]:
        """Dispatch chunks to the pool and stream decoded batches as they land."""
        loop = asyncio.get_running_loop()
        pending: set = set()
        async for chunk in self._stream_chunks():
            await self.semaphore.acquire()  # blocks the reader under back-pressure
            fut = loop.run_in_executor(self.executor, parse_ach_chunk, chunk)
            fut.add_done_callback(lambda _f: self.semaphore.release())
            pending.add(fut)
            if len(pending) >= self.max_workers * 2:
                done, pending = await asyncio.wait(
                    pending, return_when=asyncio.FIRST_COMPLETED
                )
                for f in done:
                    yield f.result()  # freed from the heap once consumed
        while pending:
            done, pending = await asyncio.wait(
                pending, return_when=asyncio.FIRST_COMPLETED
            )
            for f in done:
                yield f.result()

    def close(self) -> None:
        """Drain workers on shutdown; wait=False avoids orphaned processes on SIGTERM."""
        self.executor.shutdown(wait=False)

Every property from the concept spec is enforced here: mmap keeps file bytes out of the heap, newline alignment protects record integrity, the semaphore makes the reader itself the back-pressure valve, and asyncio.wait(..., FIRST_COMPLETED) streams results so decoded records are garbage-collected the moment the caller consumes them.

Calibration & Configuration

The knobs are chunk_size, max_workers, and the semaphore multiplier, and they tune differently per payment rail.

ACH batch files (large, uniform 94-byte records): favor throughput. Set max_workers = os.cpu_count() - 1, use a 1–4 MB chunk_size to amortize pickle overhead across many records per dispatch, and keep the semaphore at 2 * max_workers. Confirm peak RSS stays at or below 70% of the container limit against a real peak-window file.
Wire messages (Fedwire, low count, high value): favor latency and priority. Shrink chunk_size so a high-value wire is not stuck behind a full ACH chunk, and consider a dedicated priority pool or worker lane so cut-off-critical wires bypass bulk reconciliation. The pickle cost is negligible at wire volumes.
ISO 20022 (pain.001 / camt.053, nested XML): the CPU cost per message is far higher (XML tree parsing, not slicing), so lower max_workers if each message is memory-heavy and keep chunk payloads well under a few megabytes — the serialization boundary punishes large XML blobs. The ISO 20022 vs legacy format tradeoffs shape how much work each worker actually does.

As a sizing rule of thumb, peak resident memory tracks $2 \cdot w \cdot c$ : doubling workers or chunk size doubles the memory floor while cutting wall-clock time only until you saturate cores or the source socket.

Validation Example: Before and After

Consider a 1.4 GB file of ~15 million Entry Detail records, one shaped like this (routing 021000021, account 0001234567, amount $1,250.00):

code

6270210000210001234567        0000125000ACME PAYROLL LLC         0091000010000001

Before (single asyncio.gather over per-record coroutines): the loop queues ~15M coroutines and their lines onto the heap; RSS crosses 4 GB in under 90 seconds and the process is OOMKilled at roughly record 6.1M. Zero records committed, and the raw file must be re-ingested next cycle — a Reg E timeline hit.

After (bounded hybrid above, 7 workers, 2 MB chunks): the semaphore holds at most 14 chunks in flight. mmap keeps the 1.4 GB out of the heap; peak RSS settles near 900 MB. Workers decode in parallel and run() yields batches of a few thousand records each, which a downstream consumer streams into the transaction matching and reconciliation algorithms. The full file clears in a single $O (n)$ pass inside the batch window, with the amount surfacing as amount_cents = 125000 — an exact integer, no float rounding.

Failure Modes & Guardrails

Three edge cases silently corrupt payment data if left unguarded.

Non-ASCII bytes in addenda payloads. chunk_bytes.decode("ascii") raises UnicodeDecodeError the instant a legacy export smuggles a 0xA0 or accented name into a remittance field, killing the whole chunk. Catch it in the worker and route the offending line to a dead-letter queue rather than aborting; the systematic handling of this belongs in handling encoding drift in legacy bank exports.
A semaphore that never releases on worker crash. The add_done_callback release runs even when the future finishes with an exception, so a crashing worker still frees its slot — but only if you never swallow the future silently. Always consume fut.result() (as run() does) so a worker exception surfaces instead of deadlocking the pipeline once all permits leak.
float creeping into cent arithmetic. int(line[29:39]) is safe, but the moment a downstream step does amount_cents / 100 for display you have introduced binary floating-point. Keep money as integer cents through the process boundary and convert with decimal.Decimal(amount_cents) / 100 only at the aggregation edge; a single stray float sum across millions of records drifts reconciliation totals by cents that never balance.

Frequently Asked Questions

Why not just add threads instead of processes?

Threads share one interpreter and the GIL, so CPU-bound decoding still serializes onto a single core — you gain nothing for byte-level work. ThreadPoolExecutor only helps when the "CPU" work is actually a C extension that releases the GIL (some Arrow or NumPy paths). For pure-Python fixed-width slicing and Pydantic validation, use processes.

How big should each chunk be?

Large enough that pickle overhead is amortized over many records, small enough that 2 * max_workers chunks fit in memory. For 94-byte ACH records, 1–4 MB (roughly 11k–44k records per chunk) is a good starting band; profile peak RSS against a real file and adjust down if you approach the container ceiling.

The event loop periodically freezes even with the pool — why?

Something CPU-bound is still running inside a coroutine. Set PYTHONASYNCIODEBUG=1 and watch loop.slow_callback_duration; any callback over ~100 ms is work that leaked onto the loop, usually a validation or Decimal sum called directly instead of dispatched through run_in_executor.

Can I share the mmap across worker processes?

You pass byte slices, not the mmap object — each bytes(mm[offset:end]) is copied into the pickled argument. For truly zero-copy hand-off use multiprocessing.shared_memory and pass offsets, but for most ACH volumes the copy is cheaper than the coordination it replaces.

Async Batch Processing Architectures — the parent design that assembles this I/O-vs-CPU split into a full staged pipeline with exception routing and audit.
Fixed-Width File Decoding — the positional slicing each CPU worker performs once a chunk crosses the process boundary.
Optimizing pandas read_fwf for 1 GB NACHA files — vectorized, memory-tuned decoding when a columnar engine beats per-worker Python loops.
Validating NACHA Addenda Records with Pydantic — the strict validation gate that runs inside the worker after decoding.

Asyncio vs Multiprocessing for Payment Ingestion: Engineering Deterministic Throughput #

Concept Spec: The GIL, the Event Loop, and the Memory Model #

The Failure This Prevents: OOM During a Peak ACH Window #

Full Annotated Python Implementation #

Calibration & Configuration #

Validation Example: Before and After #

Failure Modes & Guardrails #

Frequently Asked Questions #

Related #