career

Lessons Learned from Accidentally Deleting a Production Database

thepylot

May 2, 2023 • 5 min read

Lessons Learned from Accidentally Deleting a Production Database

Accidentally deleting a production database is one of the worst nightmares for any developer. Losing all of the data can have catastrophic consequences for a business.

In this article, we will discuss the story in behind and some of the lessons learned from such an experience.

Before we proceed, I want to mention that the deleted database was one of the microservices, not a monolith database. It does not include sensitive or high-value information, but still was one of the main sources of data used to generate XML feeds and possibly could affect paid traffic of the platform.

Underlying Issue and Cause

The infrastructure team notified us about an alert that frequently appears in one of our team's projects. After listing the Kubernetes pods, I saw that the Mongo Arbiter is crashlooping, which is also causing interruptions in the database operations at times.

The Mongo Arbiter is a component of the Mongo database that helps ensure that data is properly distributed across the replica set. In this case, the Mongo Arbiter was crashlooping, causing interruptions in the database operations at times, and was eventually identified as one of the root causes of the issue.

I started a discovery phase to identify the root cause of the problem. After hours of research and multiple attempts to find a solution, it turns out the Helm chart version of Mongo was causing issues with that particular release where newer versions require authentication to be set up, but our database was configured without credentials.

However, the password did not update even after uninstalling Helm and starting a fresh deployment. According to Github issues, this is caused by persistent volume claims (PVCs) that store the old state of authentication and cannot be patched. The only solution seems to be removing the PVC as well.

Take Calculated Risks

The only option for resolving the issue was to delete database replicas and persistent volume claims. However, before doing so, I needed to ensure that I had covered all backup plans and calculated potential side effects.

Unfortunately, I made a huge mistake by not reviewing the backup data details and failing to capture the latest version of the data. Also, some dependent projects escaped my attention, which could potentially create performance bottlenecks.

Additionally, In Kubernetes, deleting replicas of database instances does not affect the actual data. However, if you delete persistent volume claims, it will wipe out the data which can no longer be recovered. I didn't expect that, and it ended up costing me a lot after executing the command.

Always double-check and review all backup plans and potential side effects before taking any critical steps. Taking calculated risks is important, but it's crucial to have a clear understanding of the potential consequences to avoid making mistakes that can have catastrophic consequences.

Stay Calm and Find a Temporary Solution

After deleting the production data, the marketing department was quick to contact me with a report of a massive drop in the number of products displayed on Google Merchant Center. This was a serious issue that needed to be addressed promptly in order to prevent any further damage to the company's reputation.

In order to resolve the issue, the following steps were taken:

Stay calm and focused. Panic is never helpful in this kind of situation, and it's important to approach the problem with a clear head.
Find a temporary solution to keep users out of this and prevent any interruption. This could involve rerouting traffic to a different site or temporarily suspending certain services until the issue is resolved.
Start the recovery phase. This phase involves identifying the root cause of the issue and implementing a permanent solution to prevent it from happening again in the future. This could include system updates, data backups, or staff training to prevent similar mistakes from happening.

Staying calm and focusing on finding a solution is crucial when dealing with critical issues like this. Panicking can lead to poor decision-making and further exacerbate the situation. It's important to take a step back, gather all available information, and work methodically to find a resolution.

Fast Debugging with Clean Code

Maintaining a project can be a daunting task, especially when the code is not clean and well-organized. It's important to remember that clean code is not only about aesthetics, but also about functionality. By keeping your code clean, you can ensure that your project is easier to maintain, modify, and debug.

A former teammate of ours implemented poorly designed code that makes debugging and identifying root causes more difficult and time-consuming.

Sometimes, management pressures developers to deliver projects quickly, which doesn't allow for producing high-quality code. This can be a result of many factors, such as a lack of resources, tight deadlines, or simply a failure to prioritize code quality. In these situations, it is important for developers to communicate the potential risks of producing low-quality code to their management, as well as educate them on the long-term benefits of prioritizing high-quality code. By doing so, developers can not only meet management's expectations for fast delivery, but also ensure that the code they produce is maintainable and scalable in the long term.

In any case, poorly designed code inevitably breaks the logic at some point and can worsen maintainability even further. Investing time in writing clean code can save you time in the long run by making it easier to maintain and modify your project. It's a small upfront investment that can pay off big in the future.

Asking for help

In cases where tough decisions need to be made, it is imperative to engage in internal discussions and brainstorming sessions with your team in order to ensure that everyone is on the same page. This not only helps to ensure that everyone is aligned with the company's goals and objectives, but it also fosters a sense of teamwork and collaboration, which can lead to greater success in the long run.

It is not good to be only one trying to find a solution to the problem, while my teammates could have potentially helped overcome the situation. Leaving a coworker in such a situation shows a lack of team mindset and can eventually lead to the destruction of the team.

Conclusion

Accidentally deleting a production database can cause a lot of damage, but it can also be a valuable learning experience. By understanding the underlying issues and causes of the problem, taking calculated risks, staying calm and focused, and investing time in clean code and teamwork, you can minimize the risk of similar mistakes happening in the future.

Remember to always double-check your backup plans and potential side effects before taking any critical steps. Don't be afraid to ask for help from your team, and prioritize communication and collaboration to ensure that everyone is aligned with the company's goals and objectives.

By learning from our mistakes and taking a proactive approach to preventing similar issues in the future, we can ensure that our projects and businesses are as resilient and successful as possible.