<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Srigovind Nayak's Blog]]></title><description><![CDATA[A result-driven technical leader at BETSOL, helping our customers achieve AI-first digital transformation on the cloud. With over 5 years of experience in software design, architecture, and development, my focus is on delivering value through clear communication and strong leadership. I believe in the open-source foundations supporting organizations around the world and continue to contribute back to the community through projects like restic, pyshamir, VaultSharp and others.]]></description><link>https://blog.srigovindnayak.com</link><image><url>https://cdn.hashnode.com/uploads/logos/61826c2dfd5d634d016953c1/5e4b7ac8-90cb-4992-8c76-d24d48a8e0d7.png</url><title>Srigovind Nayak&apos;s Blog</title><link>https://blog.srigovindnayak.com</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 11 May 2026 06:22:00 GMT</lastBuildDate><atom:link href="https://blog.srigovindnayak.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Engineering with AI - The Hidden Tax]]></title><description><![CDATA[AI makes your team faster. It may also be quietly overwhelming the people responsible for making sure it all holds together.

Two years ago, something shifted. The developers I work with stopped stari]]></description><link>https://blog.srigovindnayak.com/engineering-with-ai-the-hidden-tax</link><guid isPermaLink="true">https://blog.srigovindnayak.com/engineering-with-ai-the-hidden-tax</guid><category><![CDATA[AI]]></category><category><![CDATA[claude]]></category><category><![CDATA[engineering]]></category><category><![CDATA[leadership]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Tue, 07 Apr 2026 19:03:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/61826c2dfd5d634d016953c1/30baf2e3-392f-4826-9019-75537f42f816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p><em>AI makes your team faster. It may also be quietly overwhelming the people responsible for making sure it all holds together.</em></p>
<hr />
<p>Two years ago, something shifted. The developers I work with stopped staring at blank files waiting for inspiration. They started shipping. Fast. A feature that once took a week arrived on Friday afternoon in a pull request. A proof-of-concept that used to require a research spike was running in a sandbox by lunchtime. On the surface, it looked like a productivity miracle. And in many ways, it was.</p>
<p>But somewhere in the middle of all that acceleration, I started feeling something I didn't expect: exhausted. Not from doing less work — from doing a fundamentally <em>different</em> kind of work that nobody had really prepared me for. I was no longer just a senior engineer. I had quietly become the human quality gate at the end of an AI-powered production line.</p>
<p>This is a post about that experience — and a reflection on what it means to lead an engineering team in the age of AI-generated code, AI-generated documentation, and AI-accelerated expectations.</p>
<hr />
<h2>The Acceleration Is Real. So Is the Bottleneck.</h2>
<p>There's no point pretending the tools aren't remarkable. I've watched a non-technical executive describe a rough idea in plain English and receive a working, deployable application within 48 hours. I've used AI myself to cut research time from hours to minutes — skipping through documentation, comparing architectural options, generating functional scripts I could validate immediately.</p>
<p>The productivity gains are real, measurable, and genuinely exciting.</p>
<p>But here's what we don't talk about enough: when five developers on a team can each generate production-quality code in a matter of hours, the bottleneck doesn't disappear. It moves. It moves to the one person who has to review all of it, understand all of it, and be accountable for all of it — the lead engineer.</p>
<blockquote>
<p><em>"I used to occasionally receive a large pull request. Now I regularly receive pull requests with more than 10,000 lines of code. That's not a small shift — it's a different job."</em></p>
</blockquote>
<p>Ten thousand lines. Generated in a day or two. Each line potentially containing a perfectly reasonable decision that doesn't fit the codebase, the organisation's standards, or the architectural direction we've been building toward for months.</p>
<hr />
<h2>The Problem Isn't the Code. It's the Context.</h2>
<p>When I review AI-generated code, the issue is almost never that the code is wrong. It often works. Sometimes it's elegant. The problem is that it works in a vacuum — without awareness of the larger system it's being dropped into.</p>
<hr />
<h3>A Tale of Two Auth Flows</h3>
<p><strong>Real scenario:</strong> A developer is building a new service that requires user authentication. They prompt their AI tool: <em>"Implement OAuth 2.0 authentication."</em> The AI, being genuinely helpful, implements a secure, well-structured OAuth 2.0 flow. It chooses the Client Credentials flow — reasonable for certain service-to-service contexts.</p>
<p><strong>The problem:</strong> every other application in the organisation uses the Authorization Code + PKCE flow. The new service is technically secure. But it doesn't fit. It can't be managed the same way. It creates a maintenance burden, an onboarding headache, and a future security audit question.</p>
<hr />
<p>None of this is visible in the code itself. You only know it's wrong if you know the organisation.</p>
<p>The same pattern repeats across every layer of a codebase: pagination strategies (cursor-based vs. page-based), API versioning conventions, logging formats, error response shapes. An AI tool has no way to know what your organisation decided in a design review six months ago. It makes a reasonable choice. The lead engineer has to catch it, explain it, and redirect it — across dozens of pull requests, week after week.</p>
<h3>Over-Engineering by Default</h3>
<p>There's another pattern I've noticed: AI tools tend to gold-plate. Ask for a simple data endpoint and you may receive a fully instrumented, horizontally scalable service with a caching layer, a retry queue, and an event-driven architecture. These are often impressive. They're also sometimes completely unnecessary.</p>
<p>I've seen caching layers added to services that receive ten requests a day. I've seen event-driven patterns introduced into workflows that run once a week. The AI isn't wrong, exactly — these patterns are useful in the right context. But context is precisely what AI doesn't have. And the lead engineer has to not only identify the over-engineering, but explain <em>why</em> simplicity is the right call — a harder conversation than it sounds when the code in front of you looks sophisticated and well-intentioned.</p>
<hr />
<h2>Junior Engineers, Design Patterns, and the Invisible Gap</h2>
<p>One of the trickier dynamics I've observed is the gap between what junior engineers ask for and what the codebase actually needs.</p>
<p>Senior engineers develop pattern instincts over years of reading code, making mistakes, and reviewing others' work. They know when to reach for a Strategy pattern versus a Factory. They recognise when a new module is inadvertently duplicating an abstraction that already exists three directories away. They feel the shape of a codebase.</p>
<p>Junior engineers — even talented, hard-working ones — haven't had time to build that instinct yet. And now they have a tool that can generate a thousand lines of working code from a single prompt. The code may function perfectly and still be architecturally alien to the rest of the project.</p>
<p>The cognitive load here is subtle but real: as a reviewer, I have to hold the existing architecture in my head, understand what the AI has generated, identify where they diverge, and then explain the divergence in a way that's instructive rather than demoralising. Every time. For every PR. On top of everything else.</p>
<blockquote>
<p><em>"The bottleneck used to be 'can we build it?' Now the bottleneck is 'can we integrate it?' — and that second question is entirely on the human."</em></p>
</blockquote>
<hr />
<h2>The Documentation Deluge</h2>
<p>The challenge isn't confined to code. It extends upstream — into the documents, diagrams, and design artefacts that define how we build things before a single line of code is written.</p>
<p>AI tools are extraordinarily good at generating documentation. A single prompt can produce a 15–20 page architecture document, complete with database schema breakdowns, relationship diagrams, data flow descriptions, and edge case analyses. This is genuinely useful. It's also genuinely overwhelming.</p>
<p>The problem is that generating a document is not the same as understanding it. When I hand a stakeholder a comprehensive 20-page requirements document, they reasonably expect me to be able to speak to every page of it. That means I need to have read it carefully, verified it against our actual constraints, and caught any places where the AI has been confidently wrong — which happens more than I'd like to admit.</p>
<hr />
<h3>The Revision Loop</h3>
<p>You generate a 20-page document. You identify sections that need changing. You prompt again with your revisions. The new document comes back. You read all 20 pages again to verify the changes were applied correctly — and that nothing else shifted in the process. Sometimes the AI has quietly reintroduced the very content you asked it to remove, just rephrased slightly. You catch it, or you don't.</p>
<hr />
<p>Multiply this loop by two or three documents a day, every day, and you start to understand the weight of it.</p>
<p>There's also a subtler issue: stakeholders and senior leadership are using AI too. They can now arrive at meetings with polished documents, detailed proposals, and rapid-fire ideas that used to require days of preparation. The pace of conversation has accelerated. The expectation of response time has compressed. Everyone can have a fully articulated perspective on everything, almost instantly.</p>
<p>The lead engineer in the middle of this — responsible for translating between leadership's AI-accelerated vision and the development team's AI-accelerated execution — is now doing a kind of cognitive dance that simply didn't exist before. And it's tiring in ways that are hard to quantify.</p>
<hr />
<h2>What's Actually Missing: Governance at the Prompt Layer</h2>
<p>The honest answer is that we don't yet have good tools for governing how AI is used within a development team. We can add linting rules, we can write coding standards documents, we can include architectural decision records in our repositories. But none of these things reach back to the moment when a developer is sitting at their keyboard, typing a prompt, and an AI tool is making a dozen invisible architectural micro-decisions.</p>
<p>Some teams are experimenting with prompt templates, shared context files, and AI coding guidelines — and these help at the edges. But for genuinely novel problems, for the parts of a system where there's no prior art to reference, the AI still has to make a call. And someone human still has to evaluate that call after the fact.</p>
<p>Right now, that someone is almost always the lead engineer.</p>
<hr />
<h2>A Note to Other Leads in the Same Position</h2>
<p>If you're a senior engineer or engineering lead and any of this feels familiar, I want to say: you're not imagining it. The cognitive load is real. It's structural, not personal. It's a consequence of a genuine and rapid shift in how software gets built — and the tools, processes, and norms for managing that shift at a human level are still catching up.</p>
<p>The goal of this post isn't to argue against AI tools. They've made me faster, made my team more capable, and opened up possibilities that didn't exist two years ago. The goal is to name what's happening honestly — because the first step to solving a problem is admitting it exists.</p>
<p>We need to have a serious conversation with our leaders about what it means to lead a team when the output volume has multiplied but the human review bandwidth hasn't. About how we set expectations with stakeholders who are also using AI and have lost their intuition for how long things actually take. About how we invest in the judgment and context-building that no AI tool can replace.</p>
<p>The bottleneck has moved. Now we need to figure out how to move with it.</p>
<hr />
<div>
<div>💡</div>
<div>In the past as a part of this series, I wanted to write about AI without using AI tools but this blog that I generated using Apple Intelligence on my Phone and Claude Opus 4.6. This content echoes my inner thoughts very well and I thought it would be great to share it with everyone. I create a Apple voice memo where I spoke for 16 mins describing what are the challenges I face. Then when I saw the Apple Intelligence transcription, I thought it would be a good idea to ask Claude to write a blog for me. I originally intended to get ideas, but the content turned out so good, I decided to publish it without any changes.</div>
</div>]]></content:encoded></item><item><title><![CDATA[Mastering Disaster Recovery - Part 4 - Offsite & Offline Backups ]]></title><description><![CDATA[They both solve different problems
Offline and Offsite backup and disaster recovery is often confused as the same thing. Both terms describe data locality but are designed to address different risk in]]></description><link>https://blog.srigovindnayak.com/mastering-disaster-recovery-part-4-offsite-and-offline-backups</link><guid isPermaLink="true">https://blog.srigovindnayak.com/mastering-disaster-recovery-part-4-offsite-and-offline-backups</guid><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sat, 07 Mar 2026 19:53:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/61826c2dfd5d634d016953c1/a6e85334-d98d-4572-a44e-00082aa9495a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>They both solve different problems</h1>
<p>Offline and Offsite backup and disaster recovery is often confused as the same thing. Both terms describe data locality but are designed to address different risk in an organization's disaster recovery posture.</p>
<h1>What are offline backups?</h1>
<p>Offline backup involves storing data on a storage medium disconnected from the network such as a tape drive or an external HDD. It is also called “air-gapped backup” because it’s not reachable by any running software.</p>
<h1>What are offsite backups?</h1>
<p>Offsite backups on the other hand is data stored at a different physical location than the primary data. Offsite backup copies help provide redundancy and improve availability of data. They provide protection against disasters like fires, power outage or natural disasters in the primary site.</p>
<h1>The overlap</h1>
<p>These concepts can overlap:</p>
<ul>
<li><p>A backup can be offsite but still online — for example, a cloud replica that is network-accessible.</p>
</li>
<li><p>A backup can be offline but still onsite — for example, a tape stored in a cabinet at the primary site.</p>
</li>
</ul>
<p>To maximize resilience, enterprises must focus on combining both Offline and Offsite backups.</p>
<p>Example: a weekly backup is copied from the primary site to a secondary site over the network. The secondary site then writes that copy to an air-gapped medium (e.g., tape) and retains it in a physically separate location. This approach protects against both localized site failures and network-based threats.</p>
<h1>Upgrading the resiliency of your 3-2-1 backup strategy</h1>
<p>The 3-2-1 backup strategy requires three copies of data, on two different media types, with one copy stored offsite. The most resilient way to implement 3-2-1 is to make the third copy both offline and offsite. The primary data lives on production storage. The second copy is a disk-based backup kept onsite for fast restores. The third copy is a weekly backup sent to a secondary site and written to a tape drive or external HDD. After the backup is completed the tape drive or HDD is disconnected from the network.</p>
<h2>Threat 1: Hardware and Media Failure</h2>
<p>Hardware failure is the most common threat to data. Drives fail, storage controllers malfunction and tapes degrade over time.</p>
<p><strong>Mitigation</strong>: An offline copy is protected from cascading hardware failures. A firmware bug that bricks an entire storage array cannot reach a backup that is disconnected. An offsite copy adds further protection. If the failure is caused by a local issue like overheating or unstable power a copy at a different location is not affected.</p>
<h2>Threat 2: Human Error and Data Corruption</h2>
<p>A misconfigured script can overwrite production data. An application bug can corrupt a database silently over weeks. These failures are dangerous because they propagate. A corrupted database gets replicated to every connected backup target.</p>
<p><strong>Mitigation:</strong> Offline backups break this chain. They preserve the state of the data at the point the backup was taken. If corruption has been spreading for days the offline copy from last week may be the only clean version that exists.</p>
<h2>Threat 3: Ransomware and Cyber Threats</h2>
<p>Modern ransomware does not just encrypt production data. It actively searches for backups. Network-attached backup shares and cloud-synced repositories are all targets.</p>
<p><strong>Mitigation:</strong> An air-gapped copy such as a tape in a vault or a disconnected drive is not reachable by malware. There is no network path to it. Offsite backups add protection against insider threats. An employee with administrative access to local systems cannot destroy backups at a remote facility they cannot physically access.</p>
<h3>Threat 3.1 Correlated Failures</h3>
<p>The real risk is when a single event compromises multiple copies at once. Ransomware encrypts production data and the network-attached backup. A fire destroys the server room and the backup drives next to it.</p>
<p><strong>Mitigation:</strong> Offline breaks the network link so cyber threats cannot propagate to every copy. Offsite breaks the geographic link so physical events cannot affect every copy. Together they create a backup that is digitally unreachable and physically distant.</p>
<h2>Threat 4: Site-Level Disasters</h2>
<p>Fire, flood or prolonged power outage can destroy everything in a single location. It does not matter how many backup copies exist if they are all in the same building.</p>
<p><strong>Mitigation:</strong> Offsite backups solve this. A copy at a different geographic location ensures a localized disaster cannot wipe out all copies of the data. When combined with offline storage there is an added benefit. A disaster that takes out network connectivity can leave cloud-based offsite backups unreachable. A physical offline copy at a remote facility is available regardless of network conditions.</p>
<h2>The 3-2-1-1-0 strategy and beyond</h2>
<p>It is beneficial to extend this to build the 3-2-1-1-0 strategy where the extra "1" explicitly requires one copy to be offline or immutable and the “0” explicitly requires a zero errors copy. This makes the air gap a formal part of the backup policy. Backup tools and policies must be configured to ensure that only a backup with zero errors is moved to the offline copy.</p>
<h1>Conclusion</h1>
<p>In my experience making customers understand the difference between offline and offsite backups has been very critical while also explaining the importance of implementing both. Achieving whatever I have mentioned in this blog for both offsite and offline backups is very challenging in large organizations. I will continue to explore the world of backup and disaster recovery and come back with more practical insights in the future.</p>
]]></content:encoded></item><item><title><![CDATA[Engineering with AI - Keeping up with AI tools]]></title><description><![CDATA[In my last blog I wrote about how my journey using and learning about AI tools in my software engineering role. I briefly also wrote about my team and I used AI tools to improve our development workflows and speed up our product’s development process...]]></description><link>https://blog.srigovindnayak.com/engineering-with-ai-keeping-up-with-ai-tools</link><guid isPermaLink="true">https://blog.srigovindnayak.com/engineering-with-ai-keeping-up-with-ai-tools</guid><category><![CDATA[AI]]></category><category><![CDATA[llm]]></category><category><![CDATA[open source]]></category><category><![CDATA[chatgpt]]></category><category><![CDATA[claude.ai]]></category><category><![CDATA[claude-code]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sat, 06 Sep 2025 15:17:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1757171769635/6bc9fef3-4bb2-4301-a196-460ce6bca464.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my last blog I wrote about how my journey using and learning about AI tools in my software engineering role. I briefly also wrote about my team and I used AI tools to improve our development workflows and speed up our product’s development process. Since the time I wrote that blog, there have been leaps and bounds to the number of tools available for development and it has dramatically changed how I approach a problem in general.</p>
<p>I thought it would be good to write more about what I have seen continue my train of thought around Engineering with AI. I see there is a lot of value in me spending time to express my thoughts and experiences on this blog series. Trying to write an essay without AI tools has never been harder.</p>
<h1 id="heading-keeping-up">Keeping up</h1>
<p>It’s exciting and challenging to keep up. I feel like I am gonna miss the AI train if I don’t keep up. Every few weeks/months there is big leap in LLM tools. Each model claims to be more powerful than the previous one with lots of graphs and tables explaining the tests and evaluation metrics. One of the major focuses in these evaluations are around how much faster or better the model is in code generation, image generation or video generation tasks.</p>
<p>Some times it feels like it’s little but other times it feels like AI can solve any given problem in the world. Apart from tools, there are new ways to interact with LLM tools, including MCP servers and Agentic AI. As of writing this blog I haven’t explored a whole lot of these tools.</p>
<h2 id="heading-being-spoilt-for-choice">Being spoilt for choice</h2>
<p>AI tools for the software development industry all the way from hands-free development websites like <a target="_blank" href="http://v0.dev/">v0.dev</a> by Vercel, lovable.dev, <a target="_blank" href="http://bolt.ai/">bolt.ai</a> to IDE and CLI applications like Cursor IDE, Windsurf IDE, Claude Code &amp; Gemini CLI.</p>
<p>All the tools either have free versions with limits this is to increase the user base of the tool. Some users will understand the usefulness and novelty of the AI tool enough to make a decision to upgrade to the paid version while others try to understand the best way to use the free version of the AI tool in question.</p>
<p>Which brings me to the next question.</p>
<h2 id="heading-is-spending-money-on-paid-versions-opportunity-cost">Is spending money on paid versions opportunity cost?</h2>
<p>The FOMO (Feeling Of Missing Out) of not utilizing the latest and greatest models has me more often than not deciding which paid/Pro/Plus version I must purchase? I have seen times where I have purchased ChatGPT Plus, Claude Pro, Cursor and a HuggingFace subscriptions all at the same time.</p>
<p>The decision to upgrade to a paid version of an AI tool is difficult for various reasons. The pricing of most of these tools are around the 20$ per month mark which translates to roughly 2000 INR a month. I always trick myself to think that this is an opportunity cost I am paying. I say to myself I will learn something new or be able to solve some complex problem more easily than other people using the free models. In fact I use some of these paid AI tools more just because I have paid for it; and have explored some cools features before my peers can. A good example of opportunity cost for me is getting to use Claude Code earlier than my peers who use the free model.</p>
<h2 id="heading-should-you-pay-for-ai-tools">Should you pay for AI tools?</h2>
<p>Paying for AI models is a commitment and for most users in a country like India, it’s difficult to commit to a subscription. The logic is simple if you spend 20$ per month per tool on more than one tool, you will end up spending close to 240$ to 480$ a year or 21,000 INR to 42000 INR a year. People I work with are surprised to know that I pay for Plus / Pro variants of AI tools, some times even multiple paid tools. Their argument is that there are ample free models available which give comparable results to paid versions. Some of the bright minds that I work with also have figured out prompting strategies etc. to be able to make the most of the free models and in some cases give better results than the free version. I am not that smart.</p>
<p>I have switched between paying for ChatGPT Plus and Claude Pro multiple times. Depending on what I was doing I would cancel my subscription and move to the other. Notable examples of when I switched is when the “Ghibli Art” trend caught along and when “Claude Code” was released. I have always found that the Plus/Pro versions gave slightly better results than the free version; this might be a confirmation bias, but I tend to be more confident with the outputs I get from the paid version. I have been able to the use the latest/greatest models and see the benefits of higher token limits, context windows and a variety of output options. Models like o4-mini-high and Claude Opus 4 really helped me increase my trust and skill in AI assisted problem solving and development.</p>
<p>Trying out these paid tools helped me recognize the value of these tools in the software development space and also stay ahead of the curve. Since you don’t hit the limits frequently, you are encouraged to use the tools more effectively by starting more prompts and experimenting.</p>
<h2 id="heading-ai-is-becoming-more-accessible">AI is becoming more accessible</h2>
<p>I believe that AI is becoming more accessible to people. Free models are getting good and there are more entry-level paid plans for people.</p>
<h3 id="heading-cheaper-plans">Cheaper Plans</h3>
<p>A very good recent example is Perplexity’s partnership with Airtel in India to give away Perplexity Pro’s 1 year subscription for free. As users utilized the free subscription they realized that it was the same as the free version with just a few more capabilities. Perplexity here might be trying to tap in to a market of users who will build dependencies.</p>
<p>ChatGPT also release a Rs. 399 per month plan so that students and individuals wanting to utilize AI more effectively can claim that slight edge over people who don’t use it. To OpenAI it is a game of numbers in India and monetizing a niche market of young developers and working class who might not be able to pay 20$ a month but can spare 9$.</p>
<h3 id="heading-open-source-models">Open Source Models</h3>
<p>There is also a wave of free open source models which are getting good and becoming accessible for free through Ollama. Some recommended models to try is Mistral, Llama by Meta and Gemma by Google. These models perform well enough with basic compute and memory. I have tried running Mistral 7B and Llama3.1 8B while trying to build local chat bots for some side projects. I see these models are good for most general purpose workflows like email reformatting, basic bash scripting and content generation. The only limiting factor here is the need for GPU memory and fast storage.</p>
<p>Projects like Open Web UI are also making it easier for organizations to deploy the above open source models on a chat interface to introduce their employees to AI in high compliance environments like finance and healthcare.</p>
<h1 id="heading-my-recommendation">My recommendation</h1>
<p>Keeping up with AI has been tough. The benefits of using AI tools have never been better. Paying for an AI tool can benefit you in some ways while choosing to use the free versions help you understand best-practices and prompting strategies to make most of the free models. Open source models will help you start learning more about the use cases of AI, but you’re limited by GPU memory.</p>
<p>Get one paid plan if you can. Experiment deploying local models using tools like Ollama. Explore more powerful models for specific tasks using platforms like HuggingFace which allow you to pay for the compute/memory and GPU for only the resources you have used on an hourly basis. This will help you understand the benefits and challenges of using AI.</p>
]]></content:encoded></item><item><title><![CDATA[As an Architect I want to threat model]]></title><description><![CDATA[When I started software development in 2020 I had a very different view on how software was designed and developed. It was heavily influenced by what I learnt in my under graduate course. Dataflow diagrams, High-level design diagrams, low-level diagr...]]></description><link>https://blog.srigovindnayak.com/secure-software-design-threat-modelling</link><guid isPermaLink="true">https://blog.srigovindnayak.com/secure-software-design-threat-modelling</guid><category><![CDATA[threatmodelling]]></category><category><![CDATA[Secure Design Principles]]></category><category><![CDATA[software development]]></category><category><![CDATA[software architecture]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sun, 11 May 2025 17:13:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1746983235182/4422940c-c950-4ac5-8474-f040146e02ca.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I started software development in 2020 I had a very different view on how software was designed and developed. It was heavily influenced by what I learnt in my under graduate course. Dataflow diagrams, High-level design diagrams, low-level diagrams, class diagrams and project structure was a general sense that I knew.</p>
<p>In early 2021 I was tasked to focus on the security aspects of my backup project including working on securing various secrets and the encryption architecture of files being backed up. With my traditional sense of software design and development, I added dataflow diagrams, high-level design diagrams and low-level design diagrams. I thought I went the extra mile by adding the class diagrams, database schema structures and other details into my document.</p>
<p>I happily scheduled a design review with the architect (my mentor) for the file encryption design. In the design review my mentor asked me a number of questions which I had no answers to. “Where is this encryption key stored?”, “What will you do if someone takes a memory dump and the encryption key is leaked?”, “If the encryption key is leaked, will you re-encrypt all the files?”, “How many encryption keys will you store?”, “How will you encrypt the encryption key, and who will encrypt that key?”. All these questions were difficult to answer, I had focused only on one part of the problem which was encrypting the file and storing the encryption key, I had missed all the other points.</p>
<p>My mentor then asked me to do something called as threat modelling and come back. This blog pretty much sums up the theory about what I learnt from there on. There is information about what I learnt, and how I built my own methodology for threat modelling.</p>
<h1 id="heading-threats-and-vulnerabilities">Threats and Vulnerabilities</h1>
<p>When I started to do some research about the topic, I realised that there is a difference between a what is considered to be a vulnerability and what is a threat. I had interchangeably used these terms before and had seen others do it too.</p>
<h2 id="heading-what-is-a-vulnerability">What is a Vulnerability?</h2>
<p>A vulnerability is any weakness of a computer system which can be exploited to negatively impact the security policy of the system.</p>
<p>Example 1, Role Based Access Control is not implemented on the database containing user data. Any database user can create, update, or delete user records.</p>
<h2 id="heading-what-is-a-threat">What is a Threat?</h2>
<p>A threat is any negative impact on a computer system which can be introduced through a vulnerability.</p>
<p>Example 2., A rogue developer accesses the database to delete all data from the user's database.</p>
<p>If you notice here, the above threat is a result of Example 1’s vulnerability. Since, there was no RBAC implemented on the database there is a threat for a rogue developer to delete all data.</p>
<p>So now, how do I find out all off these vulnerabilities and corresponding threats, I thought that it was not humanly possible to do that by my next design review call. I find out later that there is a way to categorize common threats and vulnerabilities using something called as a STRIDE model.</p>
<h2 id="heading-what-is-threat-modelling"><strong>What is Threat Modelling?</strong></h2>
<p>Threat modelling is a pre-emptive process to identify, communicate, and understand threats and mitigations. The result of threat modelling is a threat model.</p>
<h1 id="heading-what-why-and-who">What, why and who?</h1>
<h2 id="heading-what-is-a-threat-model">What is a Threat Model?</h2>
<p>A threat model is a structured representation of all the information that affects the security of an application.</p>
<p>In essence, it is a view of the application and its environment through the lens of security.</p>
<h2 id="heading-why-threat-model"><strong>Why Threat Model?</strong></h2>
<p>When you perform threat modeling, you begin to recognize what can go wrong in a system.</p>
<p>In an ideal scenario, threat modeling should take place as soon as the architecture is in place. However, modelling all threats at this time might not be ideal.</p>
<p>No matter when you end up performing the threat model, understand that the cost of resolving issues generally increases further along in the SDLC.</p>
<h2 id="heading-who-should-threat-model">Who Should Threat Model?</h2>
<p>You. Everyone. Anyone who is concerned about the privacy, safety, and security of their system.</p>
<p>This intrigued my interests and did a little more research to find out some theory behind threat modelling and what are the best practices. Since I was in the early part of my career, I used to spend a lot of time doing theoretical research and analysis for any task assigned to me. I found the threat modelling manifesto on my research journeys (pre-ChatGPT era)</p>
<h1 id="heading-threat-modelling-manifesto"><strong>Threat Modelling Manifesto</strong></h1>
<p>This was formed by a group of security researchers and CISOs across the globe. More information in the references. I learnt Threat Modelling can be as simple as asking 4 questions while to define a design or solution.</p>
<h2 id="heading-ask-four-key-questions"><strong>Ask Four Key Questions</strong></h2>
<ol>
<li><p>What are we working on?</p>
</li>
<li><p>What can go wrong?</p>
</li>
<li><p>What are going to do about it?</p>
</li>
<li><p>Did we do a good enough job?</p>
</li>
</ol>
<p>At that time I was working on securing different secrets so, here is an example.</p>
<h3 id="heading-example">Example</h3>
<ol>
<li>What are we working on?</li>
</ol>
<ul>
<li>The User-Agent will upload files to an S3 bucket designated to that user using AccessKeyID and SecretAccessKey of that user.</li>
</ul>
<ol>
<li>What can go wrong?</li>
</ol>
<ul>
<li>A memory dump of the user-agent can yield the AccessKeyID and SecretAccessKey which can be maliciously used by the user to get free storage on Wasabi.</li>
</ul>
<ol>
<li>What are going to do about it?</li>
</ol>
<ul>
<li>Create a wrapper on the remote API to fetch an STS token for the S3 client instead of storing the AccessKeyID and Secret Access Key on the user-agent.</li>
</ul>
<ol>
<li>Did we do a good enough job?</li>
</ol>
<ul>
<li><p>We have moved the threat away from the user-agent to the Remote API.</p>
</li>
<li><p>Securing the deployments from remote access and raising alerts on memory dumps or root commands can help us identify malicious behavior and remediate issues</p>
</li>
</ul>
<p>The next section is a little interesting and I have designed it based on my experience as a security engineer. These steps are a mix of various sources stitched together with my experience.</p>
<h1 id="heading-3-step-approach-for-threat-modelling">3 Step Approach for Threat Modelling</h1>
<h2 id="heading-step-1-decompose-the-application">Step 1: Decompose the Application</h2>
<p>On a high-level we do the following things:</p>
<ul>
<li><p>Identify entities, entry points, exit points and interactions between various components of the system.</p>
</li>
<li><p>Create a Dataflow diagram (DFD) highlighting different interactions and privilege boundaries.</p>
</li>
</ul>
<p>Tool recommendation: Microsoft Threat Modelling Tool</p>
<h3 id="heading-activities-to-decompose-the-application">Activities to decompose the application:</h3>
<ol>
<li><p>Start with a three-tier architecture</p>
</li>
<li><p>Identify external entities and interactions</p>
</li>
<li><p>Identify internal entities and interactions</p>
</li>
<li><p>Identify boundaries of the application.</p>
</li>
<li><p>Identify the trust zones (network layer, domain layer, interaction layer)</p>
</li>
<li><p>Identify 3rd-party components, integrations and zones.</p>
</li>
<li><p>Create a data-flow-diagram with the above information.</p>
</li>
</ol>
<p>It can be as simple as a diagram like this:</p>
<p><img src="https://online.visual-paradigm.com/repository/images/0b487371-28fa-461f-accf-1c42a252b104/threat-model-diagram-design/threat-modeling.png" alt="Threat Modeling" /></p>
<p>In the above diagram you can see the dataflows, network boundaries and interactions between different components of the application.</p>
<h2 id="heading-step-2-categorize-and-rank-threats">Step 2: Categorize and Rank Threats</h2>
<h3 id="heading-step-21-threat-categorization">Step 2.1 Threat Categorization</h3>
<p>A threat categorization such as STRIDE is useful in the identification of threats by classifying attacker goals such as:</p>
<ul>
<li><p><strong>S</strong>poofing</p>
</li>
<li><p><strong>T</strong>ampering</p>
</li>
<li><p><strong>R</strong>epudiation</p>
</li>
<li><p><strong>I</strong>nformation Disclosure</p>
</li>
<li><p><strong>D</strong>enial of Service</p>
</li>
<li><p><strong>E</strong>levation of Privilege</p>
</li>
</ul>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Type</strong></td><td><strong>Description</strong></td><td><strong>Security Control</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Spoofing</td><td>Threat action aimed at accessing and use of another user’s credentials, such as username and password.</td><td>Authentication</td></tr>
<tr>
<td>Tampering</td><td>Threat action intending to maliciously change or modify persistent data, such as records in a database, and the alteration of data in transit between two computers over an open network, such as the Internet.</td><td>Integrity</td></tr>
<tr>
<td>Repudiation</td><td>Threat action aimed at performing prohibited operations in a system that lacks the ability to trace the operations.</td><td>Non-Repudiation</td></tr>
<tr>
<td>Information disclosure</td><td>Threat action intending to read a file that one was not granted access to, or to read data in transit.</td><td>Confidentiality</td></tr>
<tr>
<td>Denial of service</td><td>Threat action attempting to deny access to valid users, such as by making a web server temporarily unavailable or unusable.</td><td>Availability</td></tr>
<tr>
<td>Elevation of privilege</td><td>Threat action intending to gain privileged access to resources in order to gain unauthorized access to information or to compromise a system.</td><td>Authorization</td></tr>
</tbody>
</table>
</div><h3 id="heading-step-22-threat-ranking">Step 2.2 Threat Ranking</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1746982716618/e18040ad-cd34-472f-b167-3a5a662d33ef.png" alt class="image--center mx-auto" /></p>
<p>Risk = (Probability of Threat) x (Cost to organization)</p>
<p>To be holistically analyzed with stake holders and understand the risk appetite.</p>
<p>This is a very deep topic, I haven’t understood this yet fully, but from what I know you must present the inherent threats and the impact of the threat (cost) to the business and the business decides what is a threat worth fixing.</p>
<p>An example is, there is a vulnerability that the master key of the application is stored with a single admin user. The threat is that the admin user can be malicious and delete the master key or use that to decrypt all secrets.</p>
<p>The business might consider calculate the risk to be low if the admin user is the CEO of the company or the business owner himself. The probability is near zero so we might not need to worry about engineering to fix the threat immediately. Now, there are other risks here “what if the CEO is not too secure with how he manages secrets?”, so the CEO will accept the risk. There are lots of scenarios for something as simple as a password. I will try to write another blog sometime later to speak about another topic called Shamir Secret Sharing which can solve the problem of distributed secrets. Before I digress more let’s move to step 3.</p>
<h2 id="heading-step-3-identify-countermeasures-amp-remediations">Step 3. Identify Countermeasures &amp; Remediations</h2>
<h3 id="heading-common-threat-types-and-counter-measures">Common Threat types and Counter Measures</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Threat Type</strong></td><td><strong>Mitigation Techniques</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Spoofing Identity</td><td>1. Appropriate authentication 2. Protect secret data 3. Don’t store secrets</td></tr>
<tr>
<td>Tampering with data</td><td>1. Appropriate authorization 2. Hashes 3. MACs 4. Digital signatures 5. Tamper resistant protocols</td></tr>
<tr>
<td>Repudiation</td><td>1. Digital signatures 2. Timestamps 3. Audit trails</td></tr>
<tr>
<td>Information Disclosure</td><td>1. Authorization 2. Privacy-enhanced protocols 3. Encryption 4. Protect secrets 5. Don’t store secrets</td></tr>
<tr>
<td>Denial of Service</td><td>1. Appropriate authentication 2. Appropriate authorization 3. Filtering 4. Throttling 5. Quality of service</td></tr>
<tr>
<td>Elevation of privilege</td><td>1. Run with least privilege</td></tr>
</tbody>
</table>
</div><p>At the end of the threat modelling activity, you will have a list of threats which you now have to sit and categorize. This is a difficult task since you need to see if your design fully mitigates the risk or not. For some threats, the cost of development might be high and they might eventually go into the bucket of non-mitigated threats. It’s good to document it so that there is a record of what were the threats identified and what were the mitigations put in place.</p>
<h3 id="heading-threat-profiling">Threat Profiling</h3>
<ul>
<li><p><strong>Non mitigated threats</strong>: Threats which have no countermeasures and represent vulnerabilities that can be fully exploited and cause an impact.</p>
</li>
<li><p><strong>Partially mitigated threats</strong>: Threats partially mitigated by one or more countermeasures and can only partially be exploited to cause a limited impact.</p>
<ul>
<li>Note: A threat is categorized as partially mitigated only when the risk level is considerably reduced from its state as a non-mitigated threat.</li>
</ul>
</li>
<li><p><strong>Fully mitigated threats</strong>: These threats have appropriate countermeasures in place and do not expose vulnerabilities.</p>
</li>
</ul>
<p>This is the theory behind threat modelling. In a future blog, I will take an example of threat modelling and walk through the 3 steps to help understand threat modelling better. I will also show the usage of tools like Microsoft Threat Modelling to build a threat model.</p>
<h1 id="heading-references">References</h1>
<p>[1] <a target="_blank" href="https://owasp.org/www-community/Threat_Modeling">Threat Modeling | OWASP Foundation</a></p>
<p>[2] <a target="_blank" href="https://owasp.org/www-community/Threat_Modeling_Process">Threat Modeling Process | OWASP Foundation</a></p>
<p>[3] <a target="_blank" href="https://www.microsoft.com/en-in/download/details.aspx?id=49168">Download Microsoft Threat Modeling Tool 2016 from Official Microsoft Download Center</a></p>
<p>[4] <a target="_blank" href="https://www.threatmodelingmanifesto.org/">Threat Modeling Manifesto</a></p>
<p>[5] <a target="_blank" href="https://online.visual-paradigm.com/diagrams/templates/threat-model-diagram/threat-modeling/;VPSESSIONID=729E30DEF0017DF4ABDA751E05F09972">https://online.visual-paradigm.com/diagrams/templates/threat-model-diagram/threat-modeling/;VPSESSIONID=729E30DEF0017DF4ABDA751E05F09972</a></p>
]]></content:encoded></item><item><title><![CDATA[Engineering with AI]]></title><description><![CDATA[ChatGPT, Claude, Gemini, Meta AI and many other tools have made a significant impact on the way we interact with technology and source information on the internet. As I type this article, I have an itch to make a prompt to Claude AI to complete this ...]]></description><link>https://blog.srigovindnayak.com/engineering-with-ai</link><guid isPermaLink="true">https://blog.srigovindnayak.com/engineering-with-ai</guid><category><![CDATA[AI]]></category><category><![CDATA[cursor]]></category><category><![CDATA[claude.ai]]></category><category><![CDATA[claude 3.7]]></category><category><![CDATA[chatgpt]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sat, 12 Apr 2025 19:04:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744484569102/883d499e-0dae-484f-bbec-522efb7d4dbd.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>ChatGPT, Claude, Gemini, Meta AI and many other tools have made a significant impact on the way we interact with technology and source information on the internet. As I type this article, I have an itch to make a prompt to Claude AI to complete this article for me. I know that it is gonna become harder and harder for me in the future as this technology advances. I will try to write this article fully without the help of AI tools.</p>
<p>With this article I want to see if I still have it in me to write an article without the help of AI tools. I will take the good old approach of using Google Search and Microsoft Word Review / Grammarly for grammar. I must say, I have spent more than 6 months adding my thoughts slowly in to this article. I am reminiscing my days in high school where I had to write an essay of about 500-700 words in less than 30 mins. Little did I know that I would be so spoilt with AI tools to write a simple article about how AI is taking over my job.</p>
<p>The more I think about it the harder it gets for me to write this article. I started writing this blog on December 31st 2024, so whenever this blog is published, you will know how bad AI has affected my ability to write a blog without the help of AI tools.</p>
<p>If you find any grammatical errors in this content it probably means that I have been successful in not using any AI tools to write this content. XD I may have used it to review the content, but not change any content, but guess what? You will never know.</p>
<h1 id="heading-early-2023-dont-trust-ai-learn-to-validate-information">Early 2023 - Don’t trust AI, Learn to Validate information</h1>
<p>In the early months of 2023 I read a HackerNews article about ChatGPT and how it can write poems for you, create articles and summarize content provided to it. Others in my office also catch the wind about ChatGPT. Every technical discussion in my workplace ends with “Have you tried to ChatGPT it?” or “What does ChatGPT say about this?”.</p>
<p>I am annoyed since I hadn’t caught on to ChatGPT for technical research and every point I made in a design discussion would be challenged with a quick summary obtained from ChatGPT. With about 2 years of experience it feels like everything I know about software engineering, cloud and other things skills are being challenged. My mentors, stakeholders could simply prompt ChatGPT and validate whatever I am saying. It is super easy now for junior engineers to challenge me and others.</p>
<p>Given enough context, anyone could propose a solution or provide a rebuttal with reasonable confidence. ChatGPT in 2023 was in its initial stages and was not very accurate, it would make lots of mistakes and some points were easy to dismiss with a quick Google search or documentation link. I was skeptical about using the results for any productive work.</p>
<p>This one time my colleague (you know who you are) presented information output by ChatGPT as a fact and in the end it turned out to be quite the opposite.</p>
<p>This was the first major lesson for me. Never trust AI blindly, validate all information provided by it.</p>
<h1 id="heading-2024-i-start-using-ai-for-everything-building-a-knack-for-prompting-and-validating">2024 I start using AI for everything; building a knack for prompting and validating</h1>
<p>In 2024 ChatGPT, Microsoft Copilot and Claude AI improved quite a bit in terms of technical accuracy. There were occasional glitches and misinformation which were easy to deal with by providing additional context and simple Google searches about the given topic.</p>
<p>Intrigued by AI tools, I start to explore different versions of the paid subscriptions. In the last half of 2023 up to the mid of 2024 I bought the Pro version of ChatGPT for longer conversations, the latest model and priority access in times of high demand on OpenAI servers. Post that, every minor inconvenience to write a piece of code slowly changed into a ChatGPT prompt followed by a Ctrl + C and a Ctrl + V. Creating complete packages in Golang with passing unit tests became a breeze. A task for writing utility libraries for encryption and decryption with unit tests took about 10 mins. This is compared to the time when creating an encryption library in .NET Core was one of my first tasks when I joined as an Associate Software Engineer. The estimate on the task was 6 weeks. The process involved doing the research, design, discussion, approvals and security review before actual development. Now, all of this was done with documentation and unit tests. You could also ask ChatGPT to do a security review and give back a threat model and it would do it for you.</p>
<p>Suddenly I deliver 2x-5x more work than without these technologies. It’s surprising that I am not using 100% of the skill sets I needed a year back for the same job role. I spend more time understanding the problem and the use case to craft the right prompt to help me solve the problem. The focus is more on what the business logic must be rather than what are the utility libraries we need, the interfaces etc. Validating information provided by AI is the most important skill I am building now.</p>
<h2 id="heading-claude-for-scripting">Claude for scripting</h2>
<p>In the second half for 2024 I start using Claude AI. I use it for absolutely everything, including generating code snippets and bash scripts as I have to work with Linux and Windows environments. I start working directly with customers in the sales engineering team, I realize the true power of AI in software engineering. I understand the customer request and convert it into a prompt.</p>
<p>I often need a simple bash script to install something, or a bash script to automate a whole lot of commands that the customer would otherwise have to manually run. A simple example is about when we containerized our back up application. We now had to tell the customer to install docker, then change directory to the location where the docker compose file was and multiple steps before the container was deployed. The first customer we deployed this for without AI’s help, took around 1 hour and 30 mins.</p>
<p>After this call, my team and I sit together and craft a prompt with Claude AI describing the issues in our first customer deployment and what were the steps that would need to be automated with a simple bash script. Claude built the initial script which we incrementally modified with additional prompts.</p>
<p>The next time we had to do a customer deployment, we were ready with a setup script. The time for deployment in customer environments reduced to less than 15 mins; this includes the time it takes to download the binaries and the docker image. Whenever the customer faced an issue with the bash script, it’s easy to feed the error back to Claude and ask it to fix it.</p>
<p>Building the same enhancement a year ago (prior to 2022) would have been estimated to take way more time; 3 staff months at the least.</p>
<h2 id="heading-god-mode-development-with-claude-cursor">God mode development with Claude + Cursor</h2>
<p>In Q2 of 2024, while I was improving my ability to use Claude effectively, my colleagues pick up an AI powered IDE called Cursor. The editor prompts the next logical piece of code in the application based on the context of what you’re typing. It is the best thing to happen to me as a software engineer. Writing golang code is suddenly effortless and some of the cases you would have missed is just a Tab away.</p>
<p>Our design discussions and ideas streamlined into code super easily. I boilerplate architecture and code structures in Claude and give it to my team, and they use Cursor to fill the business logic. Everything is so simple and effortless.</p>
<p>Our goal is to backup a 25 TB database. Where do we get this database was the question. You guessed it right!</p>
<p>“I want to test the back up of a 25 TB database, I want to generate 25 TB of random data and then load this data as rows to my database. Give me a combination of bash and SQL scripts to generate and load this data in the least time possible. I have enough CPU, RAM and Storage available. I want to be able to build this test database in the next 2 days.”</p>
<p>Boom a script is ready. My mentor and I make some more minor tweaks to the script and run it overnight in parallel; we actually ran the script to load 50 TB by mistake.</p>
<p>Oh God! What do we do now? We decide to go for the backup test, and start the backup tool. The database backup takes 17 hours (we do a local test with two 72 cores, 1.5 TB of RAM machines connected physically using 10 GBe ethernet) and the restore about the same time. The 36 hours to verify the contents turn out to be the most nerve wrecking. With Claude and Cursor we did it! We beat the expectation by 2X.</p>
<p>All of this was done in less than 2 months, building the capability to back up a 50 TB database would have been estimated for 6-12 months otherwise.</p>
<h1 id="heading-2025-the-path-to-become-a-10x-developer-with-ai">2025 the path to become a 10X developer with AI</h1>
<p>In late 2024 Claude, ChatGPT and other tools introduce the ability to create projects. My team starts using it and we get a enterprise account for all the team members. It is super easy to create a shared context and chats using projects. All projects have a shared context and information regarding requirements, answering styles etc. AI models are more advanced and we find new ways to ingest software development projects.</p>
<h1 id="heading-what-i-have-understood">What I have understood.</h1>
<p>I understand the need of effective prompting to derive the best results out of the AI tool. The AI is trying to answer your prompt with the least number of tokens possible; which means there is a lot of data it might not consider while providing you with a response. It’s important to be clear in your prompt what you expect from a single query. You must be able to build context and ask the AI tool solve the problem in parts similar to how you would interact with humans in the real world. Do not rely on it solely yet, use your own judgement and experience to review the results. I know this might not be the case in a couple of months.</p>
<p>As a leader in my organization says the next big programming language is going to be english; which I agree with. Prompt engineering for specific domains where R&amp;D has never explored might see new light with the cost of experimenting with code and research reducing significantly.</p>
<p>Let me see if I can keep this up and write a few more with my experiences; share some best practices etc. I will try not to use AI tools to generate content, so that this experience is as authentic as possible and I don’t forget how to write sentences and paragraphs in English.</p>
]]></content:encoded></item><item><title><![CDATA[Mastering Disaster Recovery - Part 3 : Business Continuity and Disaster Recovery Planning]]></title><description><![CDATA[In my quest to master disaster recovery, I discovered that backups are a crucial part of any disaster recovery plan. But what exactly is disaster recovery planning?
In this blog, let's explore what a disaster recovery plan is, why we need it, and how...]]></description><link>https://blog.srigovindnayak.com/mastering-disaster-recovery-part-3-business-continuity-and-disaster-recovery-planning</link><guid isPermaLink="true">https://blog.srigovindnayak.com/mastering-disaster-recovery-part-3-business-continuity-and-disaster-recovery-planning</guid><category><![CDATA[business continuity]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[Backup]]></category><category><![CDATA[Disaster Recovery Planning]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sun, 02 Jun 2024 14:07:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1717337137266/5058cac9-efc6-481e-86d1-5611f52bca08.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my quest to master disaster recovery, I discovered that backups are a crucial part of any disaster recovery plan. But what exactly is disaster recovery planning?</p>
<p>In this blog, let's explore what a disaster recovery plan is, why we need it, and how it contributes to an organisation's broader business continuity plan.</p>
<h1 id="heading-the-objective-is-business-continuity">The objective is business continuity</h1>
<p>Business continuity is the capability of an organisation to continue operations of products and services following any disruptive events. These disruptive events include natural disasters like earthquakes, floods and solar storms; other events like terror attacks, arson, security breaches and data breaches. Disruptive events can also be socio-economic factors like global recession, trade deficits, and loss of customers.</p>
<p>Organisations plan and create systems to prevent and prepare for recovery from any of the above mentioned threats. The plan enumerates a range of disaster scenarios and lists the roles, responsibilities and steps to recover regular trade. The key focus is on preparedness, protection, response and recovery strategies.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717335845852/07af7e66-e996-4853-ae8d-b89ea2658308.png" alt class="image--center mx-auto" /></p>
<p>While business continuity focuses on the overall resilience of an organisation, disaster recovery planning is a crucial subset that specifically addresses the restoration of IT systems and technology operations.</p>
<h1 id="heading-disaster-recovery-planning-is-a-subset-of-business-continuity">Disaster recovery planning is a subset of Business Continuity</h1>
<p>Business continuity (BC) and disaster recovery (DR); often collectively referred to as BCDR, are two related but distinct approaches to ensuring an organisation's resilience. BC is a proactive strategy that focuses on minimising risks and ensuring the organisation can continue to operate and deliver products and services during a disaster event by defining ways for employees to continue their work. On the other hand, DR is a reactive subset of BC that concentrates on the specific steps needed to resume IT systems and technology operations after a disaster occurs, and is implemented only when a disaster actually strikes.</p>
<p>To develop an effective disaster recovery plan, it is essential to first understand what the business needs to protect. This involves creating an inventory of critical assets, identifying stakeholders, and cataloging procedure and business process documents.</p>
<h1 id="heading-understand-what-the-business-needs-to-protect">Understand what the business needs to protect</h1>
<h2 id="heading-create-an-inventory">Create an Inventory</h2>
<h3 id="heading-what-do-we-need-to-protect">What do we need to protect?</h3>
<ul>
<li>IT Equipment: Catalog all hardware, software, and network infrastructure, including servers, workstations, routers, and storage devices.</li>
</ul>
<h3 id="heading-who-are-we-dependent-on">Who are we dependent on?</h3>
<ul>
<li><p>Contractors: List all external contractors and their roles in maintaining or supporting the organisation's IT systems.</p>
</li>
<li><p>3rd Party Vendors: Identify all vendors providing essential services, products, or support to the organisation's IT operations.</p>
</li>
</ul>
<h3 id="heading-where-do-we-keep-data-and-where-do-we-backup-to">Where do we keep data and where do we backup to?</h3>
<ul>
<li><p>Primary Sites: Document the main locations where business operations and IT systems are housed.</p>
</li>
<li><p>Recovery Sites: Identify alternative locations that can be used to resume operations in the event of a disaster.</p>
</li>
</ul>
<h2 id="heading-identify-stakeholders-and-key-responsibilities">Identify Stakeholders and key responsibilities</h2>
<h3 id="heading-who-are-the-decision-makers">Who are the decision makers?</h3>
<ul>
<li><p>Executive Management: Support, approve, and communicate the disaster recovery plan; make critical decisions during disasters.</p>
</li>
<li><p>Business Unit Leaders: Collaborate with IT to identify critical processes; provide input for BIA; develop business continuity plans; ensure team awareness; participate in testing and training; coordinate during disasters.</p>
</li>
</ul>
<h3 id="heading-who-will-implement-and-maintain-it">Who will implement and maintain it?</h3>
<ul>
<li>IT Department: Develop, implement, and maintain technical aspects of the plan; prioritise IT system recovery; ensure data integrity and availability; conduct testing and updates.</li>
</ul>
<h2 id="heading-catalog-procedure-documents-and-business-process-documents">Catalog Procedure Documents and Business Process Documents:</h2>
<ul>
<li><p>Procedure Documents: Create and maintain a comprehensive set of documents detailing the step-by-step procedures for critical IT operations, such as system backups, data restoration, and emergency response protocols.</p>
</li>
<li><p>Business Process Documents: Document all essential business processes, including their dependencies on IT systems, to ensure a clear understanding of how technology supports the organisation's operations.</p>
</li>
</ul>
<p>Once the critical assets and stakeholders have been identified, it is crucial to collaborate with the business to determine the potential impact of a disaster. This involves conducting a Business Impact Analysis (BIA), Threat and Risk Analysis (RA), and Impact Analysis.</p>
<h1 id="heading-collaborate-with-business-to-decide-what-is-the-impact-of-a-disaster">Collaborate with Business to decide what is the impact of a disaster</h1>
<h2 id="heading-business-impact-analysis-bia">Business Impact Analysis (BIA):</h2>
<ul>
<li><p>Identify and prioritise critical business functions and processes.</p>
</li>
<li><p>Determine the potential impact of disruptions on each function or process, including financial losses, reputation damage, and regulatory consequences.</p>
</li>
<li><p>Establish recovery time objectives (RTOs) and recovery point objectives (RPOs) for each critical function or process.</p>
</li>
</ul>
<h2 id="heading-threat-and-risk-analysis-ra">Threat and Risk Analysis (RA):</h2>
<ul>
<li><p>Identify potential threats to the organisation's IT systems and operations, such as natural disasters, cyber-attacks, and equipment failures.</p>
</li>
<li><p>Assess the likelihood and potential impact of each threat.</p>
</li>
<li><p>Develop strategies to mitigate or eliminate identified risks.</p>
</li>
</ul>
<h2 id="heading-impact-analysis">Impact Analysis:</h2>
<ul>
<li><p>Evaluate the consequences of disruptions on the organisation's overall operations, including the impact on employees, customers, and stakeholders.</p>
</li>
<li><p>Identify the interdependencies between various business functions and IT systems.</p>
</li>
<li><p>Determine the resources required to maintain critical operations during a disaster event and to recover from disruptions.</p>
</li>
</ul>
<p>With a clear understanding of the potential impact of a disaster and the critical assets that need protection, the organisation can finally build a comprehensive Disaster Recovery Plan (DRP). The DRP should outline the scope, objectives, and general contents necessary for effective disaster recovery.</p>
<h1 id="heading-finally-build-the-disaster-recovery-plan">Finally build the Disaster Recovery Plan</h1>
<p>A disaster recovery plan is a document with information about how to resume operations from any disruptions. The disaster recovery plan mainly consists of an organisation’s IT infrastructure. The goal of the disaster recovery plan is to minimise data loss, recovery time and ensure system integrity and availability is returned to an acceptable level.</p>
<p>With the BIA, RA and impact analysis reports, we can identify the the impacts of disruptive events and sets the context for RPO and RTO objectives.</p>
<h2 id="heading-scope-of-a-drp">Scope of a DRP</h2>
<p>The scope of a DRP encompasses all critical IT systems and infrastructure that support the organisation's core business processes.</p>
<h2 id="heading-objectives-of-the-drp">Objectives of the DRP</h2>
<ol>
<li><p>Minimise downtime and data loss during a disaster event</p>
</li>
<li><p>Ensure the timely restoration of critical systems and applications</p>
</li>
<li><p>Maintain the integrity and availability of data and systems</p>
</li>
<li><p>Provide clear guidance and direction to staff involved in the recovery process</p>
</li>
<li><p>Comply with regulatory requirements and industry best practices</p>
</li>
</ol>
<h2 id="heading-general-contents-of-a-disaster-recovery-plan">General contents of a disaster recovery plan</h2>
<ol>
<li><p>Introduction</p>
<ul>
<li><p>Purpose of the DRP</p>
</li>
<li><p>Scope of the plan</p>
</li>
<li><p>Objectives of the plan</p>
</li>
</ul>
</li>
<li><p>Roles and Responsibilities</p>
<ul>
<li><p>DRP team members and their contact information;</p>
</li>
<li><p>Roles and responsibilities of each team member</p>
</li>
</ul>
</li>
<li><p>Incident Response</p>
<ul>
<li><p>Incident detection and reporting procedures</p>
</li>
<li><p>Incident classification and prioritisation</p>
</li>
<li><p>Communication plan (internal and external)</p>
</li>
</ul>
</li>
<li><p>Inventory</p>
<ul>
<li><p>Inventory of critical IT systems and infrastructure</p>
</li>
<li><p>Identify backup tools and mechanisms for different workloads and infrastructure</p>
</li>
<li><p>Primary Site information</p>
</li>
<li><p>Secondary Site information</p>
</li>
<li><p>Off-site Backup Location</p>
</li>
</ul>
</li>
<li><p>Business Impact Analysis (BIA)</p>
<ul>
<li><p>Identification of critical business processes</p>
</li>
<li><p>Recovery Time Objectives (RTOs)</p>
</li>
<li><p>Recovery Point Objectives (RPOs)</p>
</li>
<li><p>Prioritisation of recovery efforts</p>
</li>
</ul>
</li>
<li><p>IT Systems Recovery Procedures</p>
<ul>
<li><p>Backup and data replication procedures</p>
</li>
<li><p>Step-by-step recovery procedures for each critical system</p>
</li>
</ul>
</li>
<li><p>Vendor and Third-Party Coordination</p>
<ul>
<li><p>Contact information for key vendors and third-party service providers</p>
</li>
<li><p>Procedures for coordinating with vendors during a disaster</p>
</li>
</ul>
</li>
<li><p>Testing and Maintenance</p>
<ul>
<li><p>Schedule for regular DRP drills, testing and exercises</p>
</li>
<li><p>Procedures for updating and maintaining the DRP</p>
</li>
<li><p>Post-incident review and lessons learned</p>
</li>
</ul>
</li>
<li><p>Appendices</p>
<ul>
<li><p>Contact lists (employees, vendors, stakeholders)</p>
</li>
<li><p>System and network diagrams</p>
</li>
<li><p>Copies of critical documents and agreements</p>
</li>
</ul>
</li>
</ol>
<p>By developing a well-structured Disaster Recovery Plan that encompasses all the essential elements discussed in this blog post, organizations can significantly enhance their ability to recover from disruptive events and ensure business continuity.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Disaster Recovery Planning is crucial to an organisation's business continuity plan. Understanding the business's needs, risks, and potential impacts is key before creating such a plan. The overall cost and complexity of the disaster recovery plan may vary depending on these needs.</p>
<p>The first step is to document the disaster recovery plan. Regular testing and DR failover during maintenance windows can help the IT team identify gaps in the plan and suggest improvements.</p>
<p>When major infrastructure is changed or added, the DR plan must be updated with the latest backup procedures and recovery mechanisms. A regular test plan should also be implemented.</p>
<p>Businesses should regularly audit their Disaster Recovery plans and DR drill documentation.</p>
<h1 id="heading-references">References</h1>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Business_continuity_planning">Business continuity planning</a></p>
<p><a target="_blank" href="https://cloudian.com/guides/disaster-recovery/disaster-recovery-5-key-features-and-building-your-dr-plan">What Is Disaster Recovery? - Features and Best Practices</a></p>
<p><a target="_blank" href="https://www.techtarget.com/searchdisasterrecovery/definition/disaster-recovery-plan">What is a Disaster Recovery Plan (DRP) and How Do You Write One?</a></p>
]]></content:encoded></item><item><title><![CDATA[My journaling journey using Notion]]></title><description><![CDATA[Remote work and daily updates
My career began in 2020 amidst the lockdown, a challenging time for remote work adaptation. Initially, the learning curve was steep due to the lack of in-person interactions. Resolving simple queries involved cumbersome ...]]></description><link>https://blog.srigovindnayak.com/my-journaling-journey-using-notion</link><guid isPermaLink="true">https://blog.srigovindnayak.com/my-journaling-journey-using-notion</guid><category><![CDATA[journal]]></category><category><![CDATA[notion]]></category><category><![CDATA[KnowledgeManagement]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sat, 27 Jan 2024 12:26:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1706358089716/b1c04f6e-88f4-48db-b49a-7e35ae93ec14.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-remote-work-and-daily-updates">Remote work and daily updates</h2>
<p>My career began in 2020 amidst the lockdown, a challenging time for remote work adaptation. Initially, the learning curve was steep due to the lack of in-person interactions. Resolving simple queries involved cumbersome steps: checking availability, sending messages, and hoping for timely responses. Often, the wait for guidance led to either self-resolution or forgotten issues.</p>
<p>The most challenging part was the early morning standup meetings. After working late nights, waking up at 9 AM to share updates was exhausting. I often struggled to articulate my progress and questions. To address this, I started noting down my daily activities and queries using my phone's notes app. This practice allowed me to efficiently communicate during standups, even when partially awake.</p>
<h2 id="heading-adding-some-structure">Adding some structure</h2>
<p>By mid-February 2021, while working on a complex component, I faced the challenge of effectively sharing and recording vast amounts of research, ranging from articles and forums to research papers. Initially, I shared these resources via chat threads, but as discussions expanded, it became difficult to track and retrieve information.</p>
<p>I attempted to organise this information in a Notepad file, but soon faced issues with chronological order and relevance. This led me to explore other tools.</p>
<h2 id="heading-starting-with-notion">Starting with Notion</h2>
<p>I revisited Notion, an app I had previously registered for but seldom used. I created daily pages titled with the date and sections like “What did I do yesterday?” and “What will I do today?” This format provided clarity and ease in tracking my activities and action items. However, after two months, I encountered difficulties in retrieving specific information from the daily records.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706358154777/95052dfd-046c-42de-967b-3ed3e40f2473.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-calendars-and-databases-on-notion">Calendars and databases on Notion</h2>
<p>To overcome this, I utilised Notion’s calendar database feature, organising daily updates and research into a calendar grid. I adopted descriptive titles for each entry, facilitating easy retrieval through the search function. This system proved highly effective, and I began creating a dedicated calendar database for each month. It became an invaluable resource for recalling solutions and sharing knowledge with colleagues.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706358171913/f57d46cd-833f-4b98-b223-b2676eb6f037.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-my-daily-routine-journaling">My daily routine journaling</h2>
<p>I dedicate 15 minutes daily to reflect on my accomplishments and pending tasks. Summarizing discussions and my contributions helps maintain mindfulness about my work and ensures no task is overlooked. This practice has streamlined my morning routine, offering a clear starting point each day.</p>
<p>Sharing knowledge about previously encountered problems and their solutions has become effortless. I am always just a search away from finding and sharing relevant articles or commands with my teammates.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Keeping a work journal is like having a secret work buddy. It's your daily roadmap, highlighting what needs doing and celebrating what you've done. It saves you from those awkward "I forgot" moments in meetings, acting like your own work highlight reel. Plus, it's a fantastic way to see how much you're growing in your job. And when work gets too much? Jotting things down is like a mini stress-buster session. So, in a nutshell, a work journal is your go-to for staying organized, being on top of your game, and keeping cool under pressure.</p>
]]></content:encoded></item><item><title><![CDATA[Mastering Disaster Recovery - Part 2 : Off-site Backups]]></title><description><![CDATA[In the previous article, I wrote about the seven levels of disaster recovery. Read it here for more context: Mastering Disaster Recovery - Part 1 : Seven Levels
The first level of disaster recovery is to back up data to a magnetic tape or disk drive ...]]></description><link>https://blog.srigovindnayak.com/mastering-disaster-recovery-part-2-off-site-backups</link><guid isPermaLink="true">https://blog.srigovindnayak.com/mastering-disaster-recovery-part-2-off-site-backups</guid><category><![CDATA[tapebackup]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[tape]]></category><category><![CDATA[Backup]]></category><category><![CDATA[Backup Strategy]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sat, 02 Dec 2023 19:31:45 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1701545402545/f6643946-0c9a-4d9b-84d9-8a8616d9e659.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous article, I wrote about the seven levels of disaster recovery. Read it here for more context: <a target="_blank" href="https://blog.srigovindnayak.com/mastering-disaster-recovery-part-1-seven-levels"><strong>Mastering Disaster Recovery - Part 1 : Seven Levels</strong></a></p>
<p>The first level of disaster recovery is to back up data to a magnetic tape or disk drive stored off-site. In this blog, I will provide a quick overview of off-site backups.</p>
<h1 id="heading-off-site-backup">Off-site Backup</h1>
<p>The goal of offsite backup is to store data in a different location from its origin. Make a copy of the data over the network to an off-site storage media. Off-site storage media includes tape, network-attached storage (NAS), or a cloud storage solution. The storage of off-site data is also known as <strong>vaulting</strong>.</p>
<p>Data is crucial for any organization, but what happens when disaster strikes the main data center? This is where off-site backups become essential. They provide a secure and accessible alternative to keep your data safe. Let's explore how off-site backups work and why they're so important for protecting your data.</p>
<h2 id="heading-core-propositions-for-off-site-backups">Core Propositions for Off-site Backups</h2>
<h3 id="heading-protection-against-complete-system-failures">Protection against complete system failures</h3>
<p>If a disaster or a cyber attack hits the main office, we can use off-site backups to get everything back up and running. This way, we keep the business going no matter what happens.</p>
<p>Data can be lost not just because of system crashes, but also if the hardware, like hard drives, breaks down. Hard drives don't last forever and can stop working, causing data loss and expensive delays. Drives with moving parts break down faster than those with flash memory, but all types have a limit on how much they can be used. Off-site backup in a different location is a good way to protect against losing data if a hard drive fails.</p>
<h3 id="heading-geo-redundancy">Geo-redundancy</h3>
<p>The geographical separation of data in off-site backups is a critical aspect of a robust data protection strategy. By storing data in a location physically distant from the primary site, off-site backups provide a vital safeguard against local disasters such as fires, floods, earthquakes, or even man-made events like theft or vandalism. This separation ensures that, even if a catastrophic event were to completely compromise the primary business location and its on-site backups, the off-site data remains unaffected and secure.</p>
<h3 id="heading-better-security">Better security</h3>
<p>In the case of tape backups and optical disc backups, storage media can be kept in isolation with physical security in place. Tapes are physically durable and immune to cyber threats, making them reliable for disaster recovery, especially when stored off-site. Cloud storage solutions on the other hand provide features like immutability (write once read many - WORM) which prevent data from being overridden or deleted.</p>
<p>For example, in the case of a ransomware attack on the on-site backup of an organization; off-site backups provide the required levels of immutability and physical isolation to ensure that business can continue.</p>
<h3 id="heading-optimization-of-space">Optimization of Space</h3>
<p>Off-site backups ensure that your primary disks are not being utilized by backups. This saves space and would require less space upgrades to the storage array. Storing critical files of the operating system and applications ensures that the primary server runs smoother. Off-site backups using cloud storage and tape offer cost-effective solutions compared to traditional on-site backups.</p>
<h2 id="heading-storage-destinations-for-off-site-backups">Storage Destinations for Off-site backups</h2>
<h3 id="heading-network-attached-storage-nas">Network Attached Storage (NAS)</h3>
<p>Network Attached Storage (NAS) is a popular storage solution. Linux and Windows operating systems by default allow network file storage mounting capabilities. Protocols like Network File Share (NFS) and Storage Message Block (SMB) are used to mount storages. Backup applications either mount the storage on to a local file system or session mount for data transfer. It's relatively easy to setup and is cost effective since no additional infrastructure is required. The network connection between two environments between different locations is the only requirement. The costs associated with network peering and management might be high.</p>
<h3 id="heading-cloud-storage">Cloud Storage</h3>
<p>Cloud storage stands out for its minimal upfront costs and flexible scalability. Users pay only for the storage they use, with the ability to easily adjust as needs change, avoiding the high initial investment in hardware and infrastructure required for on-site backups. Additionally, cloud providers handle maintenance, security, and infrastructure management, significantly reducing operational and maintenance costs.</p>
<h3 id="heading-tape-storage">Tape Storage</h3>
<p>Tape backups, while requiring some initial investment in tapes and drives, are more affordable than establishing a full on-site data center. They are particularly cost-effective for long-term archival storage due to their low cost per unit of storage and long shelf life. Tapes also don't require energy for data storage, which further reduces ongoing costs. Tapes are physically durable and immune to cyber threats, making them reliable for disaster recovery, especially when stored off-site.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Implementation of off-site backups is more than just a safety measure; it's an essential component of the disaster recovery plan. By off-site data storage, organizations can safeguard against a spectrum of risks, from natural disasters to sophisticated cyber threats.</p>
<p>The utilization of cloud and tape storage options not only enhances data security but also offers a cost-effective and scalable solution to traditional on-site methods. This approach not only ensures business continuity in adverse scenarios but also contributes to the overall efficiency and resilience of the IT infrastructure.</p>
<h1 id="heading-references">References</h1>
<p><a target="_blank" href="https://www.techtarget.com/searchdatabackup/definition/off-site-backup">What Is Off-Site Backup? | Definition from TechTarget</a></p>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Off-site_data_protection">Off-site data protection - Wikipedia</a>'</p>
<p><a target="_blank" href="https://www.liquidweb.com/blog/offsite-backups/">Offsite Data Backup [Top 5 Reasons for Data Security] | Liquid Web</a></p>
<p><a target="_blank" href="https://datastorage-na.fujifilm.com/tape-storage-vs-disk-storage-getting-the-facts-straight-about-total-cost-of-ownership/">Tape Storage vs. Disk Storage: Getting the Facts Straight about Total Cost of Ownership Calculations - Fujifilm Data Storage</a></p>
]]></content:encoded></item><item><title><![CDATA[Mastering Disaster Recovery - Part 1 : Seven Levels]]></title><description><![CDATA[When discussing business continuity plans, it's important to understand the concepts of high-availability (HA) and disaster recovery. High-availability is a system's ability to remain resilient against single points of failure, ensuring consistent pe...]]></description><link>https://blog.srigovindnayak.com/mastering-disaster-recovery-part-1-seven-levels</link><guid isPermaLink="true">https://blog.srigovindnayak.com/mastering-disaster-recovery-part-1-seven-levels</guid><category><![CDATA[Disaster recovery]]></category><category><![CDATA[Backup]]></category><category><![CDATA[replication]]></category><category><![CDATA[RPO]]></category><category><![CDATA[rto ]]></category><dc:creator><![CDATA[Srigovind Nayak]]></dc:creator><pubDate>Sun, 12 Nov 2023 18:27:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1699813251421/f5635ac6-59dc-44d8-88eb-5ae92c028d95.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When discussing business continuity plans, it's important to understand the concepts of high-availability (HA) and disaster recovery. High-availability is a system's ability to remain resilient against single points of failure, ensuring consistent performance and uptime. However, HA alone is not sufficient. Organisations must also have a robust disaster recovery strategy to quickly restore infrastructure and data with minimal data loss in the event of a disruption.</p>
<p>In this blog, I will provide an overview of disaster recovery and introduce the seven levels of disaster recovery, setting the stage for a deeper exploration in future blogs.</p>
<h2 id="heading-disaster-recovery"><strong>Disaster Recovery</strong></h2>
<p>Disaster recovery is a crucial aspect of maintaining or re-establishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or cyberattack. It's essential for keeping all critical aspects of a business functioning despite significant disruptive events. Effective disaster recovery requires well-thought-out policies, procedures, and tools to ensure business continuity.</p>
<h3 id="heading-measuring-data-loss-and-recovery-time"><strong>Measuring Data Loss and Recovery Time</strong></h3>
<p>In the event of a disaster, an organisation's primary goal is to restore all systems rapidly while minimising data loss. These objectives are quantified as Recovery Time Objective (RTO) and Recovery Point Objective (RPO):</p>
<ul>
<li><p><strong>Recovery Time Objective (RTO)</strong>: This is the duration required to restore infrastructure and data to resume business operations.</p>
</li>
<li><p><strong>Recovery Point Objective (RPO)</strong>: This represents the acceptable amount of data loss, measured in time, from the point of the disaster.</p>
</li>
</ul>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/File:RPO_RTO_example_converted.png"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/RPO_RTO_example_converted.png/500px-RPO_RTO_example_converted.png" alt="Example showing longer 'actual' times that do NOT meet either RPO or RTOs ('objectives'). Diagram provides schematic representation of the terms RPO and RTO." class="image--center mx-auto" /></a></p>
<h3 id="heading-the-need-for-a-secondary-site"><strong>The Need for a Secondary Site</strong></h3>
<p>A secondary location equipped with comparable infrastructure—like computing resources, storage, and networking—is necessary, particularly when the primary site is not immediately recoverable. The data restored at this secondary site is crucial for continuing business operations.</p>
<h3 id="heading-states-of-infrastructure-and-data-layers"><strong>States of Infrastructure and Data Layers</strong></h3>
<p>The secondary site can be either active or passive. For instance, while the computing, network, and storage might be active, if the site lacks the necessary data (or state) to function as the primary site, data restoration is needed. In this scenario, the data layer is in a passive state, which impacts the RTO during disaster recovery.</p>
<h3 id="heading-considerations-for-your-disaster-recovery-plan-drp"><strong>Considerations for Your Disaster Recovery Plan (DRP)</strong></h3>
<p>To effectively establish a DRP, businesses must discuss their domain-specific needs to determine appropriate RPO and RTO requirements. For example, banks typically require very low RPO and RTO, aiming for minimal downtime, whereas a university or research organisation might tolerate some data loss and a longer recovery period.</p>
<h3 id="heading-from-backups-to-continuous-data-replication-the-7-tiers-of-disaster-recovery"><strong>From Backups to Continuous Data Replication: The 7 Tiers of Disaster Recovery</strong></h3>
<p>Achieving desired RPO and RTO goals involves understanding the different levels of disaster recovery, ranging from level 0 to level 6. Each level offers varying degrees of data protection and recovery speed, with increasing cost and complexity.</p>
<ol>
<li><p><strong>Level 0 - No Off-Site Data</strong>: This basic level involves storing data exclusively on-site, without off-site backups. It's the most cost-effective but carries the highest risk of total data loss in case of on-site disasters. Ideal for small, non-critical setups.</p>
</li>
<li><p><strong>Level 1 - Backup Tapes Off-Site</strong>: Involves backing up data to magnetic tapes stored off-site. It's a more secure option than Level 0 but can be slow in data recovery. Suited for institutions where data recovery speed is not a critical factor.</p>
</li>
<li><p><strong>Level 2 - Disk Backup Off-Site</strong>: Faster recovery is possible as data is backed up onto disk-based systems off-site. It’s more expensive than tape backups but allows for more frequent backups. Suitable for medium-sized businesses prioritising recovery speed.</p>
</li>
<li><p><strong>Level 3 - Electronic Vaulting</strong>: Data is sent in batches to an off-site location at regular intervals. It strikes a balance between backup frequency and costs, ideal for organisations with moderate data-change rates.</p>
</li>
<li><p><strong>Level 4 - Point-in-Time Copies</strong>: Offers frequent snapshots of data, providing multiple recovery points. This level is storage-intensive and ideal for businesses with high transaction rates or those maintaining critical systems.</p>
</li>
<li><p><strong>Level 5 - Transaction Integrity</strong>: Ensures all transactions are captured up to the point of failure, offering high data integrity. It's technically complex and ideal for setups where transactional consistency is crucial, like financial institutions.</p>
</li>
<li><p><strong>Level 6 - Zero or Near-Zero RPO</strong>: Provides continuous data protection with almost instantaneous recovery and minimal data loss. It's the most sophisticated and costly solution, suitable for large enterprises or critical government systems.</p>
</li>
</ol>
<h3 id="heading-conclusion">Conclusion</h3>
<p>In disaster recovery planning, accurately defining Recovery Point Objective (RPO) and Recovery Time Objective (RTO) is crucial for business resilience. These objectives dictate how quickly and effectively a company can bounce back from disruptions. However, implementing these objectives through appropriate disaster recovery tiers involves a careful balance of costs and capabilities. A successful DR plan aligns with the organisation's risk tolerance and budget, ensuring that the level of investment is proportional to the potential risks and impacts. In essence, a well-crafted DR plan not only protects critical business functions but also aligns with the organisation's financial strategy, ensuring long-term stability and growth.</p>
<h2 id="heading-references">References</h2>
<p><a target="_blank" href="https://www.linkedin.com/pulse/high-availability-vs-disaster-recovery-whats-why-matters-nasser/">High Availability vs Disaster Recovery: What's the Difference and Why it Matters for Your Business</a></p>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/High_availability">High availability</a></p>
<p><a target="_blank" href="https://www.cloud4u.com/blog/seven-tiers-of-disaster-recovery/">7 tiers of disaster recovery</a></p>
<p><a target="_blank" href="https://www.cloud4u.com/blog/disaster-recovery-planning/">Basic Steps for your Business Continuity &amp; Disaster Recovery plan</a></p>
]]></content:encoded></item></channel></rss>