- Upgrades should be done often to get bug fixes and improvements, following the upgrade guide carefully. Start with a healthy cluster and upgrade components outward from Zookeeper to Kafka brokers to clients. Don't rush the process or have any unresolved partition reassignments. - Collect JMX metrics to monitor the cluster as outages can be prolonged without visibility. The Kafka defaults are suitable for single node deployments but replication factor, threads, and broker configuration should be tuned for larger clusters. - Quotas like replication throttling and bandwidth/request limits per client or topic should be used to protect the cluster and clients. Log files should separate each component and be retained for a few days. Consider multiple clusters by SLA