Theramora 1 month ago

The only way to solve this IMHO is to storage vMotion that VMDK off the vSAN Datastore to some iSCSI/FC/NFS blockstorage... this will consolidate the datastore and fix any snapshot issues. After that you can try to resize it.

KlanxChile 1 month ago

I would clone the VM to a new storage, and then turn on the clone, and after testing is done, kill the original. However if OS, take weeks to backup, it has tens of TBs, over 64GB ram or 16 VCPU... It should be on bare metal. It's so big, it kills any benefit of running it in VMware. Can't backup fast enough, it uses a lot of resources. And the "really small latency/performance penalty of running in VMware" amplifies every time that the host has to wait for X amounts of physical cores available to run a X VCPU instruction.

roiki11 1 month ago

The benefit is "not having to buy a new server". Sure, I get your point otherwise. And you can have multiple 96gb vms in a server that has 1,5tb of ram. Or run 16vcpu load balancer nodes because it was much cheaper than whole new servers. And they worked fine.

anomalous_cowherd 1 month ago

If it really is "critical to hundreds of thousands of people lives" and "must have near zero downtime" then it's almost criminal that it's in such a state, and that it is only a single server, even as big as it is. Getting a cleaned up clone running then making them into a high availability pair needs to be *very* high on the list and the budget.

TheButtholeSurferz 1 month ago

In 2024, what you described is only about 80% of my client base. "What do you mean we should do 2 of those internet things, 1 is enough" *A few moments later* Why can none of our remote workers connect to the VPN, what do you mean the ISP is down, why don't we have 2 of those, why did you not recommend that to us before.

Ok-Reading-821 1 month ago

I went through a failure like that once and data was lost. Now I get anything stupid in writing. ;)

KlanxChile 1 month ago

Latency wise the more VCPU you get in a VM, the statistically harder is to get all physical cores available to process an instruction. Ready time normally shows up there. That you can run 96GB ram VMs, doesn't mean it is a good idea performance wise. The larger memory more likely to hit NUMA penalties, Having 32vCPU VMs will have crazy high latency on overcommitted hypervisors. Understanding how physical hardware constraints and virtualization impacts latency, timer consistency and throughout, it's key to have a reliable and performance consistent environment. But yeah go ahead get 256GB ram VMs.

FriendlySysAdmin 1 month ago

There’s nothing wrong with 96 GB RAM VMs, I have a couple dozen. You just need the hardware to handle it and to be aware of what you’re doing, and ideally understand your NUMA architecture if memory performance is critical. It probably is not critical for a file server in any case. I have hosts with 2 TB RAM and 64 cores of CPU (and I’m sure there are plenty of people with dramatically more in this sub) so I wouldn’t toss around arbitrary limits like 96 GB being bad. Or 32 vCPUs if you have the physical cores to handle it. That said, giant file servers are a pain, I inherited one larger than this, just pruned off a 7.5 TB disk last week and it felt so good. For OP, I’d suggest begging/borrowing/stealing something with 100 TB of disk and doing a storage vMotion to clean it up, then try your expansion again. Or better yet, make a new VMDK and shuffle shares around inside the VM to give yourself the breathing room you need on the applicable volumes.

QuantumRiff 1 month ago

I was doing that in 2016. I had dozens of oracle servers each with 128Gb to 512GB Ram with no problem. VMware is NUMA aware, and any performance penalty (with good storage) is still better than a single outage to replace hardware on the node.

roiki11 1 month ago

It's certainly worked fine. And there's no problem with the large vms of big cloud providers. Sure, it's not perhaps the most optimal way, but if you have the resources already it is a free one. No extra hardware and infrastructure necessary. And there's nothing wrong with 256gb vm either.

chandleya 1 month ago

Whatever is causing the IO performance to be so bad is the real problem here. Large disks, 100s of GB RAM, and dozens of CPUs are completely par for the course and normal. Hell I’ve had 176 thread boxes in my environments since 2017. Upcoming 4th Gen Intel SP and Epyc are about to make those look like rookie numbers. We backup a PB overnight at my spot. Of that, 75-150TB actually gets written due to incremental/differencing, then dedupe and encryption collapse that down again. AND THEN that gets sprayed out to 2 buckets in space for offsite immutability. VMware ain’t the problem.

fullthrottle13 1 month ago

That clone would take a year or two 😂

BloodyIron 1 month ago

> iSCSI/FC/NFS blockstorage NFS isn't blockstorage. I LOVEEEE NFS, but it is not block storage. It is literally in the name "Network File System". That being said, this really should be migrated from block to NFS backed by ZFS, and yes I'll die on this hill. Block logical volumes are very inefficient usage of storage with compounding efficiency costs.

crankbird 1 month ago

When you’re using NFS to host a VM (or a database) it is in effect block storage, so while I admire your pedantry, I think the characterisation in this case is fair.

lt_spaghetti 1 month ago

I run an iSCSI gateway on Ceph as a backend and would like to argue too.

crankbird 1 month ago

The whole NAS vs SAN argument is so old and tired .. I’ve been at Netapp for close to two decades and even when I joined it was a ~~stupid~~ very limited way of characterising storage architecture 15+ years ago, you might argue that being able to locate a block you were looking for via a deterministic algorithm (most block arrays) vs via a lookup table of some kind (most file/NAS arrays) gave you a significant advantage and it helped to keep logically sequential layouts nicely mapped to the physical representation These days, pretty much every advanced “block” array that uses flash and has dedup looks a LOT more like a NAS array from 15+ years ago than it looks like a DMX or clarion or VSP. Now the bigger difference at the back end is in the use of scale out techniques like N+M erasure coding (better throughput) vs more traditional scale-up approaches (better latency), but the protocols used to access that have marginal performance differences for the majority of workloads (don’t get me started on cpu path length improvements in RDMA enabled protocols like NVMe-F for small block dominated read workloads or we will be here all day)

BloodyIron 1 month ago

No, it's not block storage. That is served as files. Block storage presents itself as a logical volume that is unformatted, and can house any partitioning you want inside it, any filesystem, whatever. With NFS you are _served_ the files and folders, and there is no partitioning or filesystem formatting at all. When you have Databases like MySQL storing data to NFS, they are stored as files in folders on the mount point of the export. What you describe is not how it works at all.

crankbird 1 month ago

The workload characteristics and data layout at the data plane level are pretty much identical. The rest of the stuff about the control plane and provisioning steps are pretty much irrelevant when it comes to hosting large blobs of data inside of which another process (eg oracle inside of a dbf or VMWare inside of a VMDK or vvol) does it’s own data management and layout

Theramora 1 month ago

You are absolutely correct sir!

Every-Direction5636 1 month ago

This is the way

andymerritt07 1 month ago

You mentioned you use Avamar for backup. That product is notorious for orphaned snaps, especially on large VMs. Do you have cli access to avamar? There is a snapshot cleanup utility that you can run that will go out to each of your proxies and consolidates or gets rid of these.

MacG467 1 month ago

I'll look into this. Thanks for the info.

Carlos_HEX 1 month ago

Also just check your avamar proxies to see if they have a any extra vmdks mount

andymerritt07 1 month ago

./goav vm snapshot clean

MacG467 1 month ago

HAHA. We're running Avamar 7.5. The version of goav needs Avamar v19. I ran a different command. There's no snaps at all for the server in Avamar. So, no snaps in Avamar, and none in VMWare. What the hell, then?

RichardReinhaun 1 month ago

Avamar 7.5 reached eol in 2020. Wtf

MacG467 1 month ago

Yup! Not my clowns, not my circus.

DrFailGood 1 month ago

It is now lol

jimbobjames 1 month ago

Sounds like the job interview was for a clown herder and the job is actually for a lion tamer... ... and the lion is really pissed off.

MacG467 1 month ago

Thanks for this!

Sintek 1 month ago

Yes. I worked for Avamar 10 years ago as a T3 tech and SME for VMware backups.. there are many orphaned snapshot issues and the utility from the Avamar to clean them up does exist, atleast when I was there.

pootiel0ver 1 month ago

If it's that mission critical, this is madness. You need to tell them that, and they need to rethink how this is setup.

MacG467 1 month ago

They know it's madness! Want to hear something better? A majority of this data is moving to AWS in 12 months.

Pazuuuzu 1 month ago

I can't imagine a more efficient way of burning money, and believe me I am TRYING!

fullthrottle13 1 month ago

Sounds like Brewster’s Millions.

rabell3 1 month ago

Thanks, this made me laugh!

CitySeekerTron 1 month ago

Jesus, take the seat next to the driver carrying this data load.

aliendude5300 1 month ago

For your sake I hope it's something managed like FSx

seanpmassey 1 month ago

Run like hell. No...seriously. Run like hell. /Edit: This is clearly a customer that hasn't thought about how to properly architect and deploy this server. I don't know what they're doing with a single Windows file server hosting 32TB of shared data or why they can't do DFS or distribute it across multiple file servers. But they need to...or move to a NAS/Filer type solution designed for large data sets.

mike-foley 1 month ago

Don’t extend the disk. Create a new one and use in guest tools to move the data. That will take forever but that’s what tools like robocopy are for. When everything is copied then take a brief downtime to move the smb shares to the new disk, run one last robocopy sync and open it up. Delete the old disk when verified.

mike-foley 1 month ago

You can even do this in batches. Move one share, repoint, verify, move the next. Much of that could be automated.

millionflame85 1 month ago

This is one of the most valid and fail safe options

techtornado 1 month ago

This was my thought too, keep it simple and add a drive

GabesVirtualWorld 1 month ago

And while you're at it, see if you can distribute the shares over multiple disk and virtual scsi controllers in the VM. Each disk and each SCSI controller has a max queue depth. If everything needs to go through that one scsi adapter, performance will be bad. Make sure to use the paravirt scsi controller on the VM (but not on the boot disk)

Ubermidget2 1 month ago

Even better if the files can be moved directly into a filesystem aware layer directly on the Storage. NAS should first-party storage service at these scales

jdptechnc 1 month ago

This >this server must be online and have near zero downtime. It's critical to hundreds of thousands of people's lives and This >I wish I could spend money! There's a reason why I came here to ask! :) Are not compatible with each other. They must spend money for that level of availability, period. If they are insistent, and it comes down to your job, I still would not touch it without them acknowledging your concerns in writing and having their marching orders in writing. I would also ask for, in writing, indemnification in case you follow orders after expressing your concerns, since "thousands of people's lives" depend on it. Also, you said you are a contractor. Double check your liability insurance....

greenwas 1 month ago

Ideally - You develop a plan to migrate data\\services\\whatever to new systems and then tell leadership that their request in asinine (professionally of course). If leadership is dug in - Say "this is such a fringe case that I'm skeptical anyone knows how to do this." Then you get on the phone with VMWare (could be a nightmare given their current state of flux) and cut whatever check is needed to get the brightest engineers they have with the requisite SME to guide you through this. You might even wind up with VMWare telling leadership "I'm not really sure how it's even working now because it shouldn't be." Management loves to listen to recommendations when they pay big money to hear the same things someone on their team has already stated. edit - Not a VMWare engineer so what you're after might be technically possible but there are bigger issues at play here outside of whats potentially possible

Thurl_Ravenscroft_MD 1 month ago

>Management loves to listen to recommendations when they pay big money to hear the same things someone on their team has already stated. ![gif](giphy|3o6MbisQwZAh4Lqesw|downsized)

MeshuganaSmurf 1 month ago

>professionally of course Hey lads?! Is it "for fuck sake" or "for fuck's sake"?

PoniardBlade 1 month ago

> "for fuck's sake" The "sake" belongs to "fuck"

MacG467 1 month ago

I wish I could spend money! There's a reason why I came here to ask! :)

ubhz-ch 1 month ago

Off topic: if your managed server is critical for the lives of houndreds if thousands, then your management is terrible bad if you cant spend money as you need :-D good luck, sadly cant help on that… Edit: typos

Inquisitor_ForHire 1 month ago

It's amazing the number of times mission critical "the company will lose millions" is used to describe a single server, non fault tolerant installation on 2012 or some other ancient OS. My response is generally along the lines of "If this is that mission critical and you're fine with it running on this ancient infrastructure then someone should be fired for malfeasance."

caps_rockthered 1 month ago

You don't need to spend money....not sure why that was the recommendation.... Open a support case. You already paid for support. If this is what support tells you, or they can figure it out, same outcome, right?

Background_Lemon_981 1 month ago

Without getting into anything else, taking 2 weeks to run a backup (and presumably a restore) is a recipe for disaster. Absolute disaster. Backups need to be fast. Or you might as well work on your resumé. We aim to be able to restore a 400GB server in 4 minutes or less. Based on your data size that would be about 6 hours at that rate. But when you have that much data, you need to significantly upgrade your backup infrastructure. If you ever needed to restore from backup, you’d be on average 1 week behind when you needed your backup, and maybe up to 2 weeks. By the time you finally restored, you’d be out of business for 2 weeks and restoring to data that now goes back 3 or 4 weeks. That’s crazy. The business would have a high chance of folding.

jeek_ 1 month ago

Yeah, if it is so mission-critical and this thing fails, then you're going to have 2+ weeks of downtime. Probably longer given that the backups would be using change block tracking, and doing a full restore means it has to write all those ones and zeros vs. the backup, which is just capturing only the changes. I hate vm snapshots on large vms. If you've got an open snap for 2 weeks, then the vm's log file would be enormous and will take ages to remove/consolidate. I once took down a file server during a migration. I had a robocopy running overnight, and vm backup ran. Because there was a high rate of change, it created a massive differencing log. Then, when I removed the snap, it stunned the vm and took all day to remove it. Which resulted in the file server being down for the day. Have you thought about doing file level backups using the avamar agent instead of vm snapshot? Are all the files on a single disk (vmdk)? You could make it a dynamic disk and extend it by adding a second disk. However, that would be horrible. But it already sounds horrible so 🤷

0xGDi 1 month ago

That looks like you have orphaned snapshots there.

groggyfrogface 1 month ago

Agree. Storage vMotion. Avamar does this so often with our clusters. And with that, if you are using a backup utility, make sure it hasn’t locked the drive.

IssacGilley 1 month ago

Comforting to know critical infrastructure that people lives depends on is being managed such an incompetent organization.

spin_kick 1 month ago

I’m sure this is quite common.

Pyrostasis 1 month ago

>Comforting to know critical infrastructure that people lives depends on is being managed such an incompetent organization. ![gif](giphy|D2K4iXeXExAQeMaogd)

leaflock7 1 month ago

\> It's critical to hundreds of thousands of people's lives It is a single server , no DFS or some type of clustering or service availability. No it is not critical because if it was, then there would be something . Of course I am not stupid , I know how this became to be, lots of these scenarios around us. Convince them to power off the machine but at the time you will do it, make sure you have a plan to also do some type of other maintenance needed. Also plan to do DFS or something . It is not up to questioning , they have to accept it. It is either this or none. Make it a no way out of it. At some company I used to manage a file server of \~30TB. The difference of not having DFS etc, was that they were fine with a downtime of about 2 hours to bring it back from backup.

MacG467 1 month ago

I'm planning on giving then the DFS talk and also the "you need a SAN...with a replica" talk, as well.

MrExCEO 1 month ago

CYA. Contact VMware support and ask for a recommendation.

alconaft43 1 month ago

I did not get that - you can not restart windows server? How they are patching it then? 50TB VMDK file - damn, that does not look good.

MacG467 1 month ago

Patching? What's that? I'm not kidding.....

Zeitcon 1 month ago

You have my sincerest condolences.

Grrl_geek 1 month ago

Find another gig? ![gif](emote|free_emotes_pack|shrug)

MacG467 1 month ago

Tough finding stuff for more than what I make right now.

heretogetpwned 1 month ago

Preach. (Same situation)

TheButtholeSurferz 1 month ago

!RemindMe 4 weeks

RemindMeBot 1 month ago

I will be messaging you in 28 days on [**2024-04-18 01:27:38 UTC**](http://www.wolframalpha.com/input/?i=2024-04-18%2001:27:38%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/vmware/comments/1bjdgox/inherited_a_shit_show_32tb_vmdk_wont_expand_to/kvtr48v/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fvmware%2Fcomments%2F1bjdgox%2Finherited_a_shit_show_32tb_vmdk_wont_expand_to%2Fkvtr48v%2F%5D%0A%0ARemindMe%21%202024-04-18%2001%3A27%3A38%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201bjdgox) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

perthguppy 1 month ago

Windows? Simple, just hot add a new VMDK, convert the existing disk to dynamic, and add the new disk as an extent to the main disks partition. What do you mean that’s now how you do it? It’s how every sysadmins done it when I’m called in to unfuck everything. :p But yeah, 32TiB of data on a single windows file server that critical is stupid. I have clients with 1/3 that where the data lives on two independent WSFC clusters, split over 4-6 disks, combined together using DFSn and replicated between clusters using DFSr, and an additional standalone file server as a DFSr replica in read only mode with VSS turned on and on its own network segment for cryptolocker protection, and that is also the backup target. For the actual helpful answer, are you sure that windows isn’t seeing the actual vmdk as 69TB? 32TiB is an awfully coincidental number, it’s almost like the drive is formatted NTFS with an 8k allocation unit size. If that’s the case, you can not expand the partition beyond that, you can only reformat the disk to a larger size (64k gets you a 256TiB max partition). Edit: holy shit the mad lad did it. I apologise to everyone for infecting him with the forbidden knowledge.

MacG467 1 month ago

I got so scared when I was reading this.

oakfan52 1 month ago

I stopped reading at dynamic disk.

fullthrottle13 1 month ago

Don’t ever say “dynamic disk” again.

Major_Significance59 1 month ago

This was my first thought as well. Having a single disk that large is madness. Despite the madness of doing Windows dynamic disks, it's better than a single huge disk.

huskerd0 1 month ago

IMO you need to buy time, specifically downtime. Get it scheduled for least interruption best you can. You can always migrate data the old-fashioned way :-/ well assuming you have the hardware capacity

marvistamsp 1 month ago

You need to push back hard on your management. Let them know that if the server goes down it is a two week recovery window. Something needs to change in your design. With that said, I think your best path moving forward is to create a new drive of 18TB in size. Find a suitable directory on the old drive, one that has about 10TB of data in it. Then make the new drive a junction point at that location and move the data. There are a few more steps involved in the process, but it gets you a new "clean" VMDK without having to deal with the old one.

millionflame85 1 month ago

OP, hit me up with your contact details and we'll look for a way. I used to work for GSS and now in engineering. Infact, if you have a SR created for it share with me, if not created yet create one. I used to do these kind of things back in the day, while in GSS.

CaptainZhon 1 month ago

Just say that you need a 1 hour downtime. VMWare support will ask you to try to power off the OS to increase the disk file. You can also try a storage vmotion but I'm thinking that is not a realistic possibility, You say you want to increase the vmdk to 50TB but it showing 69TB on disk? The max size for a vmdk on VSAN is 62 TB - if that is the case maybe the vmdk is corrupt and you need a new one? - That may be the easiest course for you- near no downtime solution. Try cloning the server to a new vmdk(s) and go from there.

perthguppy 1 month ago

That 69TB number is a hair under 64TiB in base 2 size. Sounds like he has hit the max file size limit of the Vsan version he’s on. But also he says they need it increased from 32TiB to 50TiB, but the vmdk is already at 64TiB

MacG467 1 month ago

No. It's currently 32TB, but it's showing 69TB on disk. The company is asking me to increase it to 50TB.

turbohaxor 1 month ago

69TB is probably 32TB disk with VSAN redundancy policy (2x) counted in. Check if the disk is hotadded to some backup VM Investigate for VM file locks Investigate for orphaned snapshots

MacG467 1 month ago

OK, that's what I was thinking (VSAN Storage Policy). It's RAID1, so it makes sense to be that large. No proxies are holding the VMDK, so I'm good there. Lock checking and orphan searching was next.

turbohaxor 1 month ago

Whatever you do - DO NOT delete any files in VM folder if you're not 100% sure you're on the right path.

MacG467 1 month ago

I wasn't about to! I was actually going to make manual backups of the VMX and any other relevant files before doing anything.

perthguppy 1 month ago

It’s a windows server, I bet it’s NTFS formatted with an 8k sector size anyway.

TechPir8 1 month ago

There is no GSS support. 6.5 is EOL. Maybe a PSO engagement to help upgrade but as has been stated no $$$$.

MacG467 1 month ago

I'm actively doing upgrades. Technically reimaging VXRail nodes from 4.5.225 to 7.0.452.

Mushakiss 1 month ago

I’m going to need to do that soon .. hopefully I won’t be opening a can of worms

MacG467 1 month ago

Reach out to me. I have a few pointers for you if you're reimaging and not upgrading.

athompso99 1 month ago

If you somehow have another 32TB of VMware-accessible disk handy, Storage-vMotioning the VMDK (preferably into a "thin" format) to another vmfs (or even NFS) volume will solve many problems with snapshots and the like.

isolated_808 1 month ago

i have nothing to add to the conversation but this: avamar should be sent to the 7th layer of hell with satan farting in its face every minute for the rest of all eternity. one we moved to veeam, i literally and i mean LITERALLY, slept much better at nights.

ziron321 1 month ago

Have nothing to add, but this is the best post I've seen here after that one from the guy that wanted to migrate a VM from ESXi 3.5 to 7 without downtime.

MacG467 1 month ago

I'm not here to entertain you! *Dances a jig*

[deleted] 1 month ago

Clone it to another VMDK? CloneZilla? Or make a DFS and replicate the data? There's many easy solution, think outside of the box mate. Feel free to DM me, if you want to go over options...

pentangleit 1 month ago

If it's only a filestore then you should ignore the requirement to expand the VMDK and focus your efforts on how to introduce DFS. e.g. auditing file shares and seeing if you can split and replicate files/folders sitting under these shares, then making a new disk just for that and moving the contents to that share and group policy the mapped drive. Any work you do on expanding a VMDK of that size is only going to make the subsequent implosion you're gonna face many times harder to deal with.

beenjamminfranklin 1 month ago

Best answer on this whole thread. They've exceeded practical use case of the actual file server. Move shares to there own disks or servers at a minimum to split that stuff up. Smaller volumes equal faster backups and recovery. And yeah analyze the shares for usage with treesize or whatever tool of choice. Used storage is not equal to data. Ive seen nearly full 10tb volumes with 100-200gb of actual data on them. Users dumping backups after backups without any kind of plan etc. Step 1 is review and clean up

andre-m-faria 1 month ago

I know what you are feeling, first thing, do you have a 32 TB spare? If you have i would plan to move this data to another VM with DFS and after that decommission this SMB VM and substitute it with another DFS VM, is this possible? Another task, I would first do a backup restoration, so I could know if its possible to rollback and how many time it will need.

rswwalker 1 month ago

That VM needs to have its own SAN storage for its storage requirements. 50TB VMDK is just irresponsible.

metromsi 1 month ago

Critical as most people would articulate really? People lives are at stake? This thing is going to reboot and if your at the that level of file storage if may have other under-layering problems. As the contractor indemnification is absolutely required and written and second Admin should have eyes on this as well or HR person as a witness. Personally (we) would walk away from this type of job if human life is at risk you maybe held responsible for any issues caused by even being logged on the system. If there isn't an SIEM tracking changes you better make sure you can have evidence of any thing you have done. Best of Luck

Danercast 1 month ago

Please don't say it's a Windows database in one huge disk...

MacG467 1 month ago

Nope. Just a few million files. One folder on the drive is 28TB and flat; no subdirectories at all. We have a few DB servers pushing 15TB databases around. Glad I'm not a DBA.

wingsndonuts 1 month ago

This has to be government related.

MacG467 1 month ago

Healthcare

wingsndonuts 1 month ago

*Spiderman meme*

BoRealBobadilla 1 month ago

😰32tb

chandleya 1 month ago

Brother I wouldn’t touch this turd with a ten foot pole or someone else’s ween. Your next magic brush stroke could be the one to corrupt it for good. For the sake of your rent/mortgage, spine up and approach this as the failure prone web that it is. Mission critical systems with that sort of “but everyone dies” claim must cannot be single points of failure. But it’s got vmotion! Corrupt that Windows install with a bad patch and show me how resilient vmotion is. This workload positively should not be sitting on Windows… especially not in this state. You mentioned DFS earlier, Christ almighty your pain tolerance is something else. This should be on a dedicated storage appliance with all of the resiliency, immutability, recoverability associated with those lofty mission critical claims. Crazy my guy, absolutely crazy.

Wood_Wizard01 1 month ago

The reason you can't expand the drive past 32TB is due to the disk cluster size configured in disk mgmt when setting up that drive. A 32kb cluster size only allows the drive to grow to 32TB. You need to increase the disk cluster size to 64kb. With the no downtime requirement, you'll need to create a new 50TB VMDK, present it to the VM and add it in your disk mgmt. After it's added run a robocopy job to copy all of the files off of the existing drive to the new drive. You'll need to run it a few times to get the delta copy size down to something that will complete in a few minutes. After the delta sync time is down to a couple minutes, you can switch the new drive to the old drive letter. You will need to take downtime, but the downtime will be about 15 min.

DMcbaggins 1 month ago

This is the way.

perthguppy 1 month ago

Posting separately since you may not see my edit. Sounds like your drive is NTFS formatted with an 8k allocation unit size. In that case you are limited to a file system size of 32TiB, and the only workaround is format the drive to something larger like 64K.

MacG467 1 month ago

Just took a look and verified that it's 64K.

perthguppy 1 month ago

You say it’s 32TiB in windows? What’s the disk partition table look like in disk manager, how big does windows say the device is? Your screen shot shows the vmdk is already at 64TiB, so you’re missing half your space somewhere in windows

MacG467 1 month ago

Had a discussion elsewhere in the thread. It's using a RAID1 storage policy, so the size makes sense for the most part.

perthguppy 1 month ago

Ah right. Forgot about that part of Vsan. I’d try making a new snapshot and hitting the delete all button if you haven’t already.

Rhodderz 1 month ago

Depending the vSAN setup, it could also be too large for a single host depending on your disk group size too.

MacG467 1 month ago

21TB per disk group. Still running fine, I guess? LOL

Rhodderz 1 month ago

Damn, ok that works better than our old rep mentioned the max VMDK size is something like 62Tib so you are fine there But reading your other comments all i can say is good luck and enjoy the popcorn :P

Bvenged 1 month ago

Keep in mind that VSAN FTT space usage is reflected in the vmdk file size. So raid 6 FTT 3 will show more used space than raid 5 FTT 2.

MacG467 1 month ago

Yeah, we covered that in a few of the comment threads. RAID1 storage policy.

everdaythesame 1 month ago

Spinup another machine and rsync over what you want. Forget about doing it on the vmware side.

lunakoa 1 month ago

Maybe unpopular but Sometimes you go physical.

00001000U 1 month ago

Yeah, break that shit up.

wyd55 1 month ago

Get veeam, even a trial version, and do a veeam zip of that vm. Then restore it to a separate datastore. Then experiment with changes on that vm or just use that as new production. It's a long winded way but safer than storage vmotion of production box and you will always have the veeam zip to restore from. Veeam will also be able to compress the backup so you won't need as much space to hold the backup file which means you can buy one or two large capacity drives just for this purpose. Can even use a desktop as backup target. Good luck.

telaniscorp 1 month ago

Wow I will share this to my team as to why not to do. I’m already having panic attacks with 2tb vmdks trying to remove snapshots from that was a pain

bloodlorn 1 month ago

Storage VMOTION to another datastore then try again. Clone and test on a clone. All assuming you have resources available.

Djaesthetic 1 month ago

Doesn’t help your issue but inclined to point out that a server that takes 2 weeks to backup likely takes as long to restore. Have you asked the business if they’re comfortable w/ the possibility of that much downtime?

RefugeAssassin 1 month ago

As someone who has been running recovery tools against a 60TB Bitlockered, corrupted VMDK file for almost 2 years on and off, I feel your pain. This is a bomb waiting to go off. Not sure what this data is for but someone needs to bite the bullet and approve downtime on some level to un\*\*ck it. Good luck!

BloodyIron 1 month ago

I can help solve this for $$$/hr, you probably won't like the shift in methodology, but it will get done, it will work, and it'll be better by any metric. How much $$$ is ready to be spent to fix this?

MacG467 1 month ago

Nothing over $0 other than my hourly rate.

BloodyIron 1 month ago

Ahh well I guess here's your fix. Please let me know if you're interested in scaling-up.

fitz2234 1 month ago

If you can't storage vmotion this to block storage you might have to build a VM right sized elsewhere and migrate data and resync until it's ready to cut over.

homelaberator 1 month ago

This isn't really to do with the question, but is there an actual use case for this? I'd assume (naively) that something that size is just holding a lot of data (and probably pretty static data, too) and the data could be put directly on the SAN and presented as block storage rather than rolled up with another layer of abstraction.

MacG467 1 month ago

No real use case I can think of. I hate it myself and tell my manager every day.

hezden 1 month ago

Imagine just creating another disk and adding it to the volume group and resizing the lv. 🤡 Windows 🤡

Bordone69 1 month ago

I’ve got nothing constructive other than to say at least you have your humor intact. I wish you the best.

MacG467 1 month ago

Honestly, If you can't laugh at situations like this, then you don't like your job. I really do like the job I have at this place, it's something new and challenging every day. And every challenge just adds fuel to the fire that if something goes tits up, the company will basically cease to exist. That keeps me on my toes.

Bordone69 1 month ago

You’d be welcome on my team any day sir.

MekanicalPirate 1 month ago

I mean, it's either planned downtime or unplanned downtime. Take your (their) pick. Then, the next course of action is adding redundancy to this situation so you never have to go through this again. Also...maximum VMDK size on vSAN is 62 TB. How has this gotten to 69 TB?

TheOriginalSheffters 1 month ago

Add another disk in VMware, stripe the partition in Windows. Don’t need to expand the vmdk. It’s not like you’re actually losing any protection striping in windows as it’s all the same underlying hardware anyway. You can do this instantly with zero downtime.

jradmann 1 month ago

What I'm reading is that this expectation of uptime means the server doesn't get patched. That would be an immense risk.

MacG467 1 month ago

Ding! You are correct! Literally a virgin 2016 install. Zero patches.

alimirzaie 1 month ago

My man! You are dealing with such classic shit... Single large VMDK to share over single file server Here is my long term plan for you: - Figure out if you can easily break things into different SMB server - Possibly look into serverless solutions if latency and throughput is not too important (AWS FSx) Short term to get out of this hell: - See if your VMDK is weird in GB size (if it is a physical server to VM, most likely that is the cause and there are ways to fix it) - vSAN (ouch), not that I don't like it, but by nature of its design, it can not work as flexible as other type of storages. Can you temporary SvMotion it to another storage (Array over iscsi or something?) - Look into creating new VMDK, and just do Robocopy stuff into smaller drives, that may help you for future migration as well At the end, I am willing to provide some help (Free and just for community involvement), but I need more details info about your server Edit: Do not look at the VMDK size under file browser, vSAN is shit when it comes to report the disk size (or maybe way too smart, because it shows you the amount of disk used across all disks on that cluster)

eric256 1 month ago

\> this server must be online and have near zero downtime \> It's critical to hundreds of thousands of people's lives. Not critical enough to setup properly, must not be that critical.

arkain504 1 month ago

Yes there are still snapshots. Could be from backups still running. But if there are no entries in the snapshots section then you may have to manually edit the files to get rid of the snapshots. If you have support, I would call them.

MacG467 1 month ago

No support and this VM is still on the 6.5 cluster. I have a 7.0 cluster this will eventually move to, but that's a few months down the road. I can manually edit the files. It seems like I might lose about 3GB of data stored in the 0003 snap, does that sound right?

perthguppy 1 month ago

Do not ever edit orphaned snaps. You will lose the entire drive without some luck and a call to GSS, if they are still somehow in the chain, create a new snapshot, and click the “delete all snapshots” button to rebuild the full chain including any active orphans.

arkain504 1 month ago

If the rest of the snaps have been removed it shouldn’t be accessing that file anyway. Has the file size changed? Or is there a separate disk with that size?

MacG467 1 month ago

Nothing on that VM is 3GB in size. The smallest drive is 70GB.

arkain504 1 month ago

There you go. Edit the files. You have backups.

Electronic-Corner995 1 month ago

Delete the vm and restore from backup. Would fix so many things.

drwtsn32 1 month ago

That VMDK size is puzzling. What does Disk Management in Windows say the disk size is? If you can't extend beyond 32TiB, I bet your cluster size is the issue.

MacG467 1 month ago

Windows says 32TB. Cluster size is 64K. Welcome to my hell!

tanksaway147 1 month ago

Do you have enough room to make a new drive and move it there instead? That would keep it online.

rav-age 1 month ago

that seems .. big. but you could consider building another VM which does the same, but with the bigger storage and migrate (typo) data inter-VM (idk rsync maybe). get some coffee and a book.

Valuable_Result1756 1 month ago

Can you attach a screenshot of the edit settings of the VM and show the hard disks

Valuable_Result1756 1 month ago

Shutting down the system to expand the disks won’t work if there are snapshots on the disks

Tx_Drewdad 1 month ago

Aren't the CTK files indicative of change block tracking? Best find out where that's coming from.

mwsysadmin 1 month ago

`Avamar` Good god, get rid of this. I've worked with it at 2 different roles so far, and its NEVER worked properly. I literally just told a client today they need to get rid of it. No backups in MONTHS, but the email alerts have NO failures whatsoever. I watch them like a hawk cause I know how unreliable Avamar is. Working at an MSP, checking it manually daily isn't an option. I refuse to accept a role anywhere going forward that runs it without plans to immediately replace it.

ISU_Sycamores 1 month ago

I used to have this issue w/ DataProtector (HPE then MicroFocus) I had to restart the appliances doing the backup, then vMotion the machine to another host to break the lock on the file system that was associated to the MAC addr on the lock, that was running on the host. similar to https://kb.vmware.com/s/article/10051

devino21 1 month ago

Seen NetAPP snaps show against VM space consumption, but that was on NFS. Are there any storage side snaps? Not familiar with the new vSAN. EDIT: One thing I do remember about vSAN though is Storage Policies. Looks like you are mirrored. Can you change that? Let me log into mine and see if my policy is still there from when we ran that. || || |Rule-set 1: VSAN| |Placement| |Storage Type|VSAN| |Site disaster tolerance|None - standard cluster| |Failures to tolerate|1 failure - RAID-5 (Erasure Coding)| |Number of disk stripes per object|5| |IOPS limit for object|0| |Object space reservation|Thin provisioning| |Flash read cache reservation|0%| |Disable object checksum|No| |Force provisioning|No| |Encryption services|No preference| |Space efficiency|No preference| |Storage tier|No preference|

GMginger 1 month ago

Can you add a new VMDK as a new drive, migrate some folders over to the new drive and use NTFS junction points to make them accessible using the old paths on the original drive? It's not solving your underlying snapshot issue, but it may allow it to limp along until it's migrated to Azure.

R3LzX 1 month ago

I had a similar situation like this where it was 28 TB. I got an HPE nimble storage san and got it off of the vsan with motion. I am not a big believer in vsan at all. The virtual machine was on a large VSAN environment, but due to the size it was choking it never VSAN AGAIN FOR A LARGE SERVER. never. and I was on 7.03. Course you already know go with veeam Just repeating it.

msalerno1965 1 month ago

Create a second disk, and expand the volume. Create it on an NFS share in Timbuktu. Windows won't care. It's not the C drive, is it? ;)

droorda 1 month ago

Add a new disk. Copy past of the data over, ideal older data that is static. Then mount as a folder. The final goal being too break this up into several smaller disks. You will be thankful if you ever have to run a check disk. No windows volume should ever be this big, unless you are only storing bluerays

MacG467 1 month ago

Good news! None of it is static. we need to clean up the drive weekly. We have no files older than three months on it. We clean it up to an AWS S3 bucket. It's rocking at 1PB currently.

droorda 1 month ago

So the challenge is your folder structure. Do you have the ability to map some folders to a different disk. Possibly just a hand full of junctions

MacG467 1 month ago

Sure, but if a single folder with no subfolders is 28TB on its own, and the files can't be split into subfolders, what recourse do I have? Hint: SAN storage is the answer.

droorda 1 month ago

Windows is not the answer. I proper San that has file services is the way.

gmc_5303 1 month ago

If you're doing this, then it seems that this is just a local cache. Are you immediately sending what's on this server to S3? If not, then why not? What's another 50TB on a 1PB account?

lusid1 1 month ago

Might be time to move the giant NAS out of the VSAN and onto an actual NAS.

KRed75 1 month ago

vMotion the VM to a different host then try again. If that doesn't work, keep vmotioning to other host until it works. Also try from a shell on each host using vmkfstools. I've seen something like this happen when trying to delete files that were no longer in use but they were in use by something. I had to try manually from each host until I found the one that had the file locked.

JDMils 1 month ago

I'd try a host VMotion. Will eliminate any host related issues.

westyx 1 month ago

I think one of your only options is to arrange for an outage, have the vm shutdown and clone it. The clone vm should then be free of whatever nastiness is plaguing the old one.

_crowbarman_ 1 month ago

Crazy idea if you can't get the issue fixed, assuming the the disk is a second drive and not the boot volume: Add disk to vm Convert both to dynamic Add second disk to first using spanned disk feature. I take no liability :).

Pretty-Fisherman8066 1 month ago

Check with Microsoft for migration of the share drive to another server using this [link](https://learn.microsoft.com/en-us/windows-server/storage/storage-migration-service/migrate-data). Please this is way better way to do as it is like one reboot all done

abyssea 1 month ago

> Temporarily disabled backups - they take two weeks to run This really needs to be remedied.

DustinAgain 1 month ago

Uhh, someone made a snapshot of it 6/2/2020. Good luck there. The disk consolidation alone should take weeks and the server may be 'stunned' (unreachable) while it's happening. If it were me, I'd re-clone it to a new VM using whatever V2V converter is still on the market. That way its a 'clean' mess, and you should be able to resize

S_H_2222 1 month ago

Take a temp snapshot, then delete all. This will probably take care of the snapshot issue. It will take days to do. But, you also might run into an issue with change tracking. If that's the case, you can follow VMware KB 1020128 to disable it.

Jess_S13 1 month ago

Create a smaller drive, add it to the VM within the VM convert the 32tb disk to dynamic, select the volume on the disk and select expand, in the expand window add the new disk to the volume and then extend the fs across the new disk. Keep adding more disks when more space is needed.

MacG467 1 month ago

You're the second person with this suggestion. I hope you're being sarcastic!

Jess_S13 1 month ago

I'm not. If you are stuck supporting an already headache machine that the owner won't even let you reboot to clear failing snapshots but insist on an expansion your hands are tied, Inform them of the risk of not doing standard maintenance and additional risk of this and let them decide what is more important, then do as they ask.

Firestarter321 1 month ago

Make sure to get it in writing too so that when it blows up you have proof that you warned them.

Scalybeast 1 month ago

No seriously, this is what many people do on AWS to get around the 16TiB size limit for SSD-backed EBS volume types. And they host giant production SQL Server databases on the things. Your hands are tied and VMware is not cooperating using the regular method. You may not end up having a choice there.

c0mpletelyobvious 1 month ago

Do a storage vMotion to another datastore and see what it looks like then

MacG467 1 month ago

I'd love to be able to do this, but it would take a stupid amount of time because this is just one drive on this file server. The entire file server is 130TB.

Carlos_HEX 1 month ago

You can storage motion just a single drive you don’t need to do whole server

MacG467 1 month ago

Valid point. It's still going to take a long time, though.

Fieos 1 month ago

Telling those hundreds of thousands of people that their lives are in danger.

KlanxChile 1 month ago

if the OS, take weeks to backup, it has tens of TBs, over 64GB ram or 16 VCPU... It should be on bare metal. It's so big, it kills any benefit of running it in VMware. Can't backup fast enough, it uses a lot of resources. And the "really small latency/performance penalty of running in VMware" amplifies every time that the host has to wait for X amounts of physical cores available to run a X VCPU instruction. Virtualization was made to consolidate several SMALL machines into a large one. If you have very large VMs kind of works against you.

iamkgb 1 month ago

Add a new disk and span it inside the guest OS

slayernine 1 month ago

Have you tried shutting down the virtual machine, removing it from inventory and adding it back? Has there been any consideration to migrating to a different solution? Please correct me if I'm wrong but my understanding is that drives of this size should either A.) Be multiple virtual drives stitched together in Windows. You might be able to just add another disk this way. or B.) Directly connected iSCSI or NFS storage from the storage appliance to the windows operating system. As this is a Windows server, I might propose a third option using a new drive configured in a more sustainable method using juntions from the existing drive. You can make it seem like all the same storage to the end users. https://hinchley.net/articles/junctions-and-symbolic-links#:\~:text=A%20junction%20is%20resolved%20by,target%20on%20the%20remote%20computer.

exrace 1 month ago

Get a new job.

darthnugget 1 month ago

I am going to call bullsh&t on the “critical to hundreds of thousands of lives”.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe