Every team using Kafka eventually hits the same moment: a deploy rolls out, consumers restart, and for a few seconds nothing gets processed. That gap is a rebalance — and understanding what's happening during it is the difference between a non-event and a production incident.

What a consumer group actually is

A consumer group is just a named set of consumers that split the partitions of a topic between them so no two consumers in the group read the same partition at the same time. Kafka's own documentation covers the mechanics in depth — see the Kafka Consumer Groups guide for the canonical explanation.

Here's the mental model that matters day-to-day:

Partitions Consumers Result
6 3 2 partitions per consumer
6 6 1 partition per consumer
6 8 6 active, 2 sit idle

That last row surprises people the most — Kafka never gives one partition to two consumers, so extra consumers beyond the partition count just sit there doing nothing.

What triggers a rebalance

  • A consumer joins or leaves the group (deploy, crash, scale event)
  • A consumer misses a heartbeat past session.timeout.ms
  • Topic partition count changes
  • A consumer's poll() loop takes longer than max.poll.interval.ms, and the broker assumes it's dead

That last one is the sneaky one — it's not a crash, just slow processing, and it looks identical to a dead consumer from the broker's point of view.

A minimal Spring Kafka listener

@KafkaListener(
    topics = "order-events",
    groupId = "order-processor",
    concurrency = "3"
)
public void handleOrderEvent(ConsumerRecord<String, OrderEvent> record) {
    OrderEvent event = record.value();
    log.info("Processing order {} from partition {}", event.getOrderId(), record.partition());
    orderService.process(event);
}

And the consumer config that most directly affects rebalance behavior:

# how long the broker waits before declaring a consumer dead
session.timeout.ms=10000

# max time allowed between poll() calls before the broker assumes you're stuck
max.poll.interval.ms=300000

# how many records one poll() can return — lower this if your processing is slow
max.poll.records=200

If you're seeing rebalances correlate with slow downstream calls (a database, an external API), max.poll.interval.ms is usually the first thing to check — not session.timeout.ms.

Seeing it in action

Confluent has a short walkthrough of the consumer group protocol that pairs well with everything above — worth watching if you want to see the JoinGroup/SyncGroup handshake described visually rather than just in text:

The rebalance strategies that matter

Kafka has evolved through several partition assignment strategies. The two worth knowing:

  1. Eager rebalancing (the old default) — every consumer gives up all its partitions before reassignment, causing a brief full stop across the whole group.
  2. Cooperative sticky rebalancing — consumers only give up the specific partitions that need to move, so unaffected consumers keep processing throughout.

If you're still on an older client version, switching to CooperativeStickyAssignor is usually a bigger win than tuning any timeout:

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

For the full background on why this strategy exists, KIP-429 is worth reading directly from the source.

Takeaway

Most "Kafka is flaky" complaints are actually rebalance side effects, not broker problems. Before touching infrastructure, check whether processing time is creeping past max.poll.interval.ms, and whether you're still on eager rebalancing when cooperative sticky has been stable for years.


Further reading: