Cisco uses Ceph for storage in its OpenStack cloud platform. The initial Ceph cluster design used HDDs which caused stability issues as the cluster grew to petabytes in size. Improvements included throttling client IO, upgrading Ceph versions, moving MON metadata to SSDs, and retrofitting journals to NVMe SSDs. These steps stabilized performance and reduced recovery times. Lessons included having clear stability goals and automating testing to prevent technical debt from shortcuts.