Plan for Successful Workload Migrations

By | September 19, 2020

In this post, I’ll share about the considerations and methods relating to migration Virtual Machines or workloads. I will not be covering the technical options in this post, but more about what ought to be considered.

For those who have been in the industry long enough, you would have experienced the era where lots of physical servers were virtualised using the P2V (Physical to Virtual) process. Those were the days where VMware Converter, PlateSpin, and a few others were heavily used. Today, there are far fewer physical servers to migrate, and some of these names have faded from memories.

What is commonly seen nowadays are V2V (Virtual to Virtual). Once ESXi to Hyper-V had a bit of a buzz. Now, what I’m asked often is about migrating to AHV. Migrating from ESXi or Hyper-V being the most common, followed by Physical servers, and Amazon EC2.

Regardless of the source and destination types, there are always some key considerations to check through for every single workload. The planning phase is extremely important, in comes the cliché “If you fail to plan, you plan to fail”. This is very true for workload migrations, if not properly planned, you can run into trouble and cause a really bad outage.

Mass migration activities that are poorly handled are known to be RGEs (Resume Generating Events). On the flip side, very successful migration factories are known to produce lots of engineers who claim to know vSphere well, but all they know is follow-the-doc migration.

And I digressed… back to the topic.

No. 1 rule for migration – Plan, Plan and Plan

Of course, in order to plan, there are a few technical aspects that must be understood. Read on to see what they are.

1. Supportability on the destination platform

For me, this is the first question that must be answered, especially if keeping to a supported configuration is important to your organisation. Supportability considerations include, Operating System, OS level clustering and Application. Do be aware there is a difference between supportability and functionality. Supportability means if you open a support ticket with the vendor, do they have the commitment to help resolve the issue, especially if it turns out to be a bug on their product. Functionality is simply, whether if it works or not. Something that works, but un-supported is still un-supported, and most of the time you are at risk to be on your own if issues arise.

  • Operating System – you’ll want to find out two things
    • target platform vendor supports hosting the Operating System and version your virtual machine uses
    • and from the other angle, whether your OS vendor will provide support if you run their OS on the target platform
  • Application – some application vendors care about the underlying platform. You would want to know if your application vendor cares.
    • Most of the time, application vendors are more concerned about the operating system the apps run on
    • Some application vendors will go further to qualify the underlying virtualisation platform that the virtual machines run on
    • There are a few that will go even further to qualify exact hosting solutions

These checks will have to be done for every single workload, application and virtual machines. Assuming they all check out, or you have no concern about supportability, read on for other considerations.

2. Downtime availability vs Risk of Downtime

If I haven’t made it clear by now, migration inherently carries risks. The ultimate price to pay if the risks are not carefully assessed, will be long, unplanned outage, or worst still data loss.

While under some conditions there are methods to perform online migrations without shutting down virtual machines, they are not risk free. vMotion is probably one of the most well known form of migrations, and for our purpose, the most common form of online migration across infrastructure will be vMotion without shared storage. Also known as shared nothing vMotion, or xvMotion. It is a great technology, but it does not have 100% success rate.

If vMotion is to fail, the best case is zero impact, and the process rolls back transparently. However, this is a real double edged sword. It is possible to run into a situation that a VM gets stunned during migration, and the stun goes into extra time, taking a very long time. There’s no indication on how long the stun can complete, and to just abort carries risk of corruption. Large and busy virtual machines are at higher risk of such occurrences. The question then comes down to, is the application able to take that chance? If no, then alternatives need to be considered.

Each workload will have to be assessed if it can take on the risk of any unplanned downtime during migration, and if the worst case scenario is to occur, can it handle the extended outage to wait for a restore from backup?

If the answer is no, the risk is not acceptable, the alternative will be to cater for downtime for the cutover. How long is the downtime varies on the techniques used for the actual migration and cutover.

For most situations, 1 hour cutover window per virtual machine is reasonable.

3. Importance for ease of roll back

If you are wondering why is there a need for roll back, I urge you to read this part carefully. One of the risks of migration, is that even if the virtual machine cuts over successfully, something may end up broken.

Data may be fully intact, application is running, but for some reason there is just some issue that prevents full function or performance. Most of the time, these can mostly (not 100%) be avoided with pre-testing and validations. Well, not 100% because, there could be some legacy behaviour, or hard coding that only the developer from 10 years ago knows, but he is not in the company anymore. Or, the whole exercise is botched by a pre-existing issue but no one realises it until migration finishes. Yup, this happens more often that you know.

Though the need to roll back is rare, it does happen more often than we know. The key will be that, to consider how easy, or quick can the roll back be done.

Some migration techniques, again using xvMotion as an example can take a long time to roll back. xvMotion literally copies the entire virtual machine to the destination and the source no longer have a copy of the virtual machine. It will take as long to roll back, as it took to do the initial migration. So for a large multi-terabyte virtual machine that took 10 hrs to migrate across, it will take another 10 hrs to roll back. If in between it took 5 hours to troubleshooting, that’s a good 25 hrs window of risk!

Other techniques that involves creating a new copy of the virtual machine, followed by a short outage and cutover, can allow a much quicker roll back. All that is needed is to power off the destination copy and power up the original source VM. All can be done in well in minutes. Even if the initial cutover took 1hr, and another 5hrs of troubleshooting. If the call is made to roll back, it would be just another 15 minutes.

4. Pre-migration testing and validation

As mentioned in the previous point, some testing prior to migration is strongly encourage to mitigate risk of workload failures when operating at the new environment. The more comprehensive the checks and validation, the higher the chance of success.

What sort of tests and validations are useful? Below I’ll list some common tests, but think about what your application needs are and add on to the list.

  • Physical connectivity and network network failuover
    • Network connectivity across all relevant uplinks
  • IP connectivity and routing
    • Network connectivity to the desired subnet
  • Network connectivity to other services
    • e.g. AD, DNS, DB, etc.
  • Network connectivity from upstream services
    • e.g. network load balancers, application servers
  • Compute Capacity
    • if there is sufficient CPU and Memory capacity to host the workload
    • and to achieve the minimum desired compute performance
  • Storage Capacity
    • if there is sufficient storage capacity to accept the workload
    • and to achieve the minimum desired storage performance
  • Network bandwidth
    • if there is sufficient free network bandwidth to support the workload
  • Data protection services
    • if the target environment can provide the desired level of backup
    • consider backup capacity and performance
    • as well as restoration performance
  • Disaster recovery protection
    • particularly if your workload has existing DR protection, the destination need to provide at least the same or better RTO and RPO capabilities

It is also possible to perform a test migration, where the workloads are migrated, but into an isolated network. This allows a broader range of tests, while keeping production running. Of course, this will have to be based on some sort of clone based migration, where a copy of the VM is created, rather than a move of the VM.

Start learning something about acid transactions database for a great data management software.

5. Establish a pre-migration baseline

The last thing we want to perform in a migration is to declare a roll back. It will be a pity to roll back due to pre-existing issues that were not detected. The least that should be done is to note down any known errors or issue with the workload. If the workload is deemed healthy enough to be migrated, at least there is clarity that any observed errors post migration can be determined to be pre-existing and not due to the migration.

Few examples to look at include, Windows Event Logs and relevant Application Logs. Take note (or better, screen capture) any known errors, and document them. They will be used for comparison post migration.

6. Define post validation checks

So far we have looked at several pre-migration activities. What is equal if not more important are the post migration validation checks. These checks are crucial to declare that a migration has completed successfully, and the applications are fully functional, and performing well.

The more critical the application is to the business, the mode comprehensive the checks will need to be.

While most will be able to draft a list of function checks, what is commonly missed are performance checks. Remember, once a migration is declared successful, realising any issue a few days later can be quite a disaster and very hard to roll back.

Hence, noting down as much performance baseline as possible prior to migration is important to use as comparison post migration. Granted, not all performance validations are possible, but gather as much information as possible. Especially if there is a need for any troubleshooting or tuning later on. It helps to focus and set a reasonable target if something is not performing up to expectations.

7. Identify key resources

Migration is and can be a very large activity involving the whole village. Especially for large enterprises with multiple large teams. Knowing the overall strategy and identifying key tasks across teams is critical. Having a strong program management team is essential, to drive progress and success.

Identifying key technical resources for an application to run well is absolutely important. For example, the application owners, end users for testing, system administrators, backup administrators, etc.

Equally important, make sure everyone knows their role and what is expected of them.

Wrapping up…

As you can see, to be successful with workload migrations is no small feat, especially as the scale increases. Lots of planning, coordination and testing are part and parcel of successful migrations.

If you or your team does not have enough experience or confidence to pull off any migration, never hesitate to engage some assistance from professional service teams. The Nutanix Xpert Services team are available to provide assistance for any migration to and from Nutanix.

In future posts, I will write about different migration scenarios, and the options to choose from. For example, if you are migrating from VMware vSphere with 3-tier architecture to VMware vSphere with Nutanix AOS, or to Nutanix AHV.

Other advance topics include migrating clustered virtual machines are also in the pipeline. Thanks for reading.

Leave a Reply

Your email address will not be published.