How We Reduced Developer Time to Fix Bugs!

By Rahul Ghosh, Senior Tech Lead

25th July, 2025

Bizom uses MySQL as its primary database to support daily operations and meet the needs of its customers. Each day, millions of requests come in, making hundreds of thousands of transactions across the customer base. With the ever-growing list of features and supports, another item grew parallely. Yes, it’s bugs!

Now, the challenge with bugs is that they need to be reproducible consistently, in order for the developers to fix them permanently. Debugging directly on a production server is generally a big red flag, and for good reason. Like most teams, we follow this best practice strictly and only consider it in rare, justifiable emergencies.

Bizom has a staging environment, which is a copy of the production environment with sensitive information of the customers masked. This is used by the developers to identify and fix bugs, along with verifying the new feature set that is built. Other teams, like product and sales, use the staging environment for demo purposes. Sometimes the customers themselves use this environment before confirming the go-live.

In one of the earlier hackathons, one of the teams in Bizom developed a cloning solution, which would leverage the MYSQL dump and import technique, to copy the entire database and apply masking of sensitive data points of the customer. As for the code, the production environment commit would be checked, and the same codebase would be pushed to the staging server for the respective customer.

Our platform is built on a multi-tenant architecture, a design choice that offers significant benefits in terms of both scalability and security. At present, the total database size has reached around 20TB. The previous solution served us well for the past couple of years, but as the database has grown significantly in size, we’re starting to hit some limitations. Now, to clone the individual schemas to the staging environment, we faced the following challenges:-

Once our databases grew beyond 100GB per schema, the MySQL dump and import process started to slow down noticeably. This meant developers spent a lot of time waiting for the database to clone before they could dive in and fix bugs!
Resetting the environment was not possible, so we had to re-clone it from scratch. This was not ideal for quick iterations.
Custom codes, patches introduced by developers themselves on behalf of the other teams, make it harder for themselves to reproduce and fix bugs.
The staging environment was actively in use, making it impossible to overwrite without disrupting ongoing activities.

Bug fixing became a slow process because it took time to reproduce and confirm issues on the staging environment. Until that step was done, developers couldn’t really make progress on resolving them.

Approach

Looking at this problem, we identified a couple of ways to solve the challenges. Some of them are listed as follows:-

Optimize the dump and import process: We currently rely on standard MySQL dump and import techniques. To improve performance, we had planned to optimize this workflow by introducing parallelization during the dump and restore phases.
Exclude non-essential tables from cloning: Our database contains several large log tables that contribute significantly to storage usage. By implementing a filtering mechanism, we can selectively exclude such tables from the cloning process, reducing both time and resource consumption.
Clone only a subset of the tables: As an alternative to the previous approach, a filtering mechanism was considered to selectively include specific tables in the cloning process.
Limit the data set being cloned: Instead of copying the entire dataset, we aim to implement a delta synchronization approach—transferring only the most recent and relevant data from production to staging. This will significantly reduce the volume of data handled during each sync.

We focused heavily on these three main strategies, each of which went through several rounds of ideation and refinement.

Option 4 would be ideally the best choice, but it had many limitations for our system. Some of them being that all the tables would have to have an identifier(id, time-range) in order to effectively copy the data, with the index being present on the identifier alone. This was not present in all the tables, and to apply to all of them would be time consuming. Another challenge involves handling foreign key constraints. For instance, an invoice record may reference an order that lies outside the defined scope of the delta sync, resulting in referential integrity issues during the cloning process. Third, other teams used the same schema in the staging environment for demo purposes. Creating a conflict of data points would hardly be beneficial to our cause. Consistency of data is highly desirable.

The second and third options also looked good, but require extra maintenance of the filtered table list per schema. A major drawback is handling the foreign key constraints across multiple tables. This would lead the developers to effectively debug the cloning procedure, not the bug that they were supposed to solve. When new tables are added, developers need to ensure this new table is handled accordingly.

I learnt later on that the above 3 approaches were tried by another team in another hackathon, but they had failed hard!

Option 1 seems the best choice, and the tool that we wanted to go with is MyDumper. One recurring challenge was the continuous growth of data. In the absence of a consistent archival strategy across all tables, core transactional data kept expanding rapidly. As a result, the scalability of this solution was inherently limited—even in the near term. And while it might work for a single customer, applying it across all customers would be unsustainable.

Relook at the problem statement

The primary goal is to enable safe and efficient debugging in the production environment without compromising data integrity. The key challenge lies in debugging live systems, where unintended writes or overwrites could corrupt critical data—an unacceptable risk for Bizom and its customers.

To address this, we need a solution that allows developers to inspect and troubleshoot issues in real-time while ensuring that no accidental modifications occur. This requires implementing safeguards that prevent unintended data changes while still providing comprehensive visibility into system behavior.

Talking to our developers and QA team, we found out the following:-

The data at the current point in time is not required. Most of the bugs were for data which were a few days/weeks old.
A precise, unchanged copy of all settings and configurations is necessary.
Exact production codebase replication was required.

Given the considerations above, it became evident that a separate environment was needed—one that could be accessed by both developers and the QA team. Since real-time, up-to-the-minute data wasn’t a requirement, designing and implementing this environment became significantly more straightforward.

Here’s how:-

Bizom performs scheduled daily backups of the server and database during off-peak hours.
Generate a new copy of the server and the database, using the backup that was created from the previous step.
Link the server and database in step 2 and ensure sensitive data is properly masked before deploying and making it available to the internal team.

These steps create a new setup, where the developers and QA can safely debug and reproduce the bugs, respectively. These gave the following benefits:-

Safe debugging of the code, any overwrites of data would only affect the newly created database.
No overwrites of settings and configurations, as it is exactly the same present in the production environment.
Creating an entire database server with the snapshot technique is typically faster than copying the largest individual database within it.
All the customers’ data were provided simultaneously, thereby reducing developers’ requests for a particular clone.
Developers do not have to wait for days for the data to become available.

The only trade-off with this solution is the cost. Given the volume of data involved, the monthly expense represents a significant investment for a company of our size. However, we made a strategic decision to move forward, anticipating that it would alleviate a key bottleneck: developer productivity. By enabling faster debugging and issue resolution, we believe the time saved will translate into long-term cost savings for Bizom.

Further improvements

The initial implementation of this solution was focused solely on the web platform. Subsequent iterations introduced enhancements that improved developer experience and usability. Given the clear benefits, the solution was eventually extended to support mobile applications as well.

However, cost optimization remained a key concern. We identified an opportunity to reduce expenses further by evaluating the nature of the database usage. Since the database server was being recreated daily and the previous instance purged, the data was inherently temporary, offering no long-term value in persisting it. Additionally, by analyzing developer login patterns, we gained insights into actual usage times, which further informed our optimization strategy.

A straightforward solution emerged: the new setup would be provisioned only on weekdays (Monday to Friday) and remain inactive over the weekends. Based on observed usage patterns, the setup is automatically torn down at 11:00 PM IST each night and re-initialized the following morning — striking a balance between availability and resource efficiency.

With this approach, the total cost of the setup is reduced, resulting in savings of nearly 49%. This not only brings significant cost efficiency to the organization but also provides developers with the added benefit of a fresh, dedicated environment for debugging and issue resolution.

Internal metrics indicate a 20% improvement in the average bug lifecycle turnaround time. Additionally, monthly clone request numbers have decreased by approximately 20–30%, reflecting greater efficiency in issue resolution and workflow management.

How We Reduced Developer Time to Fix Bugs!

Approach

I learnt later on that the above 3 approaches were tried by another team in another hackathon, but they had failed hard!

Relook at the problem statement

Further improvements

The Retail Intelligence Platform

Company

Bengaluru

Delhi

Mumbai

United Kingdom