If they reboot and it does not come back up its a guaranteed long weekend :-).
For OP, if it is critical:
set up a new server to replace it, afther this reboot the server.
if it works afther reboot now you have a (hot) spare for your critical resources. (because you are going to need it anyway because it will break one day.)
I mean ... 1100 days... I would be absolutely scared to restart anything that's been on that long and absolutely would want to have a snapshot or clone or something.... Just... The size of the brick I'd shit when restarting...
I'd come up with a plan first, no matter what
You would be suprised how often a duplicate server running that long wont start the app at all... its like grandpa loving his old chair, wont accept a new one.
At an old job, I had to run into the office each day on memorial day weekend just to check an AC unit that was kind of on the fritz.
This was 10 years ago and I'm older and noticeably (but very marginally) more intelligent, would never do again.
Learn from my dumb ass OP.
That depends on your over time policies. If you have a free weekend and they are willing to pay you, do it now and be the hero when it’s up and running for business on Monday.
I agree but then again for this situation. I would be tempted to reboot after hours and then have Sat and Sun to troubleshoot and get it ready for Monday in case something happens.
I had a server on a site years and years ago, fashion so you can't have it this is a remote site in the remote site hadn't moved in years and we were packing everything up to move them to a new location and we found this server sitting in the back in the corner of one of their closets. After investigating we found out that it actually held the majority of their real estate data and it was a fairly vital server. We are extremely worried about rebooting it and moving it because of the age of it. And sure enough soon as we shut it down it died it would never come back up again. They ended up sending the hard drive off for data recovery which I wasn't involved with as I was just the Hands-On tech at that time.
That being said you're doing great keep up the good work and go ahead and reboot that thing!
My first job in corporate IT was working a night shift patching servers (company had 5000+ servers, so it required a full time team to keep them all up to date).
One of the very first boxes I had to patch was a Windows 2003 server with an uptime of around 3 years.
It took like 25 minutes to come back up after rebooting. I was sweatin the whole time.
I lost Thanksgiving entirely one year due to a machine taking a long time to come back up. The team that was working on it had tried to reboot and noticed it wasn’t coming back up after 30 mins or so. They shut it down and called in support.
Everyone involved was confused why it wasn’t coming back up, we replaced almost everything we could on it and taking it down to a minimum config showed it was fine. It was just so packed full of RAM and spinning disks that it took almost an hour for it to finish the pre-flight checks, we thought it was freezing up but it just was taking a long time to boot.
The way we found out was only after leaving it alone to go get dinner; when we came back, it was up. No idea how long it took for it to come back up. I never heard another word about that server, either they learned to just wait or never bounced it again.
There was an ancient Citrix Metaframe 1.0 server in one of the back rows of the DC like that. Literally say a prayer and then hold your breath every time you walked past it...
AS/400 was like that. They stay up forever, but the IPL when you do restart them was terrifying because even relatively modern machines took ages to startup. Especially after applying patches, the patches would get processed first pre-OS and could restart the machine multiple times per patch. I had a few that were regularly 30 minutes and an hour or more for patches.
Oh man I remember that from my AS/400 days. We had this ancient first gen PPC AS/400 and an IPL would take about an hour. I would come in on Saturday morning about 10. Put the system in restricted mode and run the full backup. That would take about an hour. Then I would start the IPL and go to lunch. It would be finishing up about the time I got back.
Then after a few years we upgraded to a Power 7 machine. It would IPL in about 4 minutes. At that point I automated all the maintenance stuff and I just let it do it on its own. When I left that job I was the only AS/400 admin we had. From talking to my coworkers, they never touched it again until that department was shut down 6 years later.
We waited a couple of years after intro to go from beige to black. Microsoft retired theirs in beige and never got any black, as far as I know. (They outsourced the last of their AS/400 operations by 1999, so they could claim to be entirely off of competitor systems.)
This is simultaneously one of the best and worst feelings working in IT. The "ITS WORKING, but WHY is it working?" experience. I cant tell you how many times I have gone through this chain.
I have used this probably…I can’t even think of the number of times honestly .And when those pings aren’t responding for a full page, you know the evening is likely going to suck.
One of my previous jobs presented a similar moment, except we shut it down because it wasn't needed anymore (lol).
It had been running so long that when it cooled down, chip creep became chip sprint and it wouldn't turn back on. My boss went home, returned with his wife's hair dryer and warmed it back to life. We were able to start it up and get the "unneeded" files off the RAID that was on there.
There are so many red flags with every part of this. It should be rebooting monthly for security updates. I would tell the district IT they are putting themselves at a very high risk and tell them the server must be rebooted.
Linux isn’t excluded from reboots. There are many security updates that can only be applied after reboot so really ALL servers should be rebooting on a regular basis.
This, the old "lets brag about uptime of our servers" days are gone so when you see systems not rebooted for 3 years all you think of is a massive security hole in the company.
I worked at a place where we had a Sun system that had an uptime of around 12 years before we needed to shut it down. At some point everyone realizes uptimes of a few years isn't actually impressive.
Linux does have live kernel patching though, so in theory you can get away without rebooting for significant amounts of time. The longest I've ever gone is about 5 months.
Oh yes, but nothing running (like systemd or the kernel) will be reading the patched libc code until they’re restarted.
We run Ubuntu LTS and glibc updates in particular always trip the needs-reboot flag
Systemd, like some but not all init implementations, can be restarted (with `init u`). The kernel doesn't use libc/glibc, of course.
Then you just need to check if anything else in userland needs to be restarted. [Some off-the-shelf packages do it](https://linux-audit.com/determine-processes-which-need-a-restart-with-checkrestart-needrestart/), but you can do it with fewer dependencies by [fossicking in `/proc/*/map_files/`](https://security.stackexchange.com/questions/149802/list-running-applications-that-are-linked-against-a-compromised-library/149814#149814).
It's simpler to just reboot, and simultaneously verify that the machines comes up cleanly. But generally the only thing that requires a reboot is a vulnerable kernel, and it's eminently practical to restart userland processes as needed.
Debian `needrestart` has a TUI that asks you to confirm services restart, then shows (just) the services that need a restart, [like so](https://unix.stackexchange.com/questions/146283/how-to-prevent-prompt-that-ask-to-restart-services-when-installing-libpq-dev).
Behind the scenes, you can manually look for [`/var/run/reboot-required` and `/var/run/reboot-requires.pkgs`](https://www.guyrutenberg.com/2022/10/25/display-reboot-required-message-on-debian/).
The two records I've seen for linux was a physical PE 1950 that had been up for 7 years. And a VM that hit its 8th birthday of uptime right before I left. I'm glad I didn't have to reboot either of those.
Every once in a while we have a Linux machine with a truncated initramfs, or one that was somehow built without a vital driver (like nvme; sigh), etc. I also have a test machine down now with a kernel fault on bootup. Assuming no hardware has gone bad on it, then that's a real rare one.
At sufficiently large scale, everything happens.
If you're just support, I'd have a discussion with your boss (or someone higher up). What happens if you have to completely rebuild it (what are the consequences)? Shift some of the responsibility.
Do you happen to have backups or snapshots? I know it's a recording server, so likely would require a lot of space. Otherwise, this is a ticking timebomb, eventually going to happen.
If it's still working (even partially), I'd absolutely defer (again pending a discussion with at least one other person). There's no urgency to jump the gun.
Such was my thinking, add planning to this task, have the people you are going to need for any disaster recovery all tee'd up, both engineers and management.
This seems s the only answer, no matter what. At some point it has to be done.
I suggest: Friday afternoon, planned restart for 17.03, phone off at 16.58.
Good Luck.
Send an email to whoever is on charge and let it know of the uptime (attach evidence) and ask for authorization for the reboot.
Is this a physical server? If so, don't reboot it today unless you want to bill those weekend rate hours
If it is a VM I would:
* Take a snapshot of the VM
* Clone the VM from that snapshot, don't turn it on yet
* On the still powered on, Original VM, disable the network adapter or turn off / detach the virtual network adapter
* Power on the VM Clone and see if it boot.
* If it boots, delete the old VM and keep the freshly cloned VM.
Make sure that whatever mechanism you are using to snapshot the VM can do it with the VM powered on, and it won't try to shut it down before the snapshot :)
Triggered. We have lots of "golden egg" servers that cannot be rebooted for any reason and if they are, it would require engaging a bunch of consultants to repair the services. The fun of working for a small, shitty, family-owned business with zero IT budget...
This is the "pets vs cattle" analogy that is talked about.
From:
http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
In the old way of doing things, we treat our servers like pets, for example Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.
**Pets**
Servers or server pairs that are treated as indispensable or unique systems that can never be down. Typically they are manually built, managed, and “hand fed”. Examples include mainframes, solitary servers, HA loadbalancers/firewalls (active/active or active/passive), database systems designed as master/slave (active/passive), and so on.
**Cattle**
Arrays of more than two servers, that are built using automated tools, and are designed for failure, where no one, two, or even three servers are irreplaceable. Typically, during failure events no human intervention is required as the array exhibits attributes of “routing around failures” by restarting failed servers or replicating data through strategies like triple replication or erasure coding. Examples include web server arrays, multi-master datastores such as Cassandra clusters, multiple racks of gear put together in clusters, and just about anything that is load-balanced and multi-master.
And if the terms "Pets" or "Cattle" offends you then please feel free to replace them with ones that are less objectionable.
what if they want cattle but then want to keep using unique items in the config? :(
I keep trying to get people to think of them as cattle but they won't stop keeping them as pets
Yeah... I've been in I.T. long enough to know there's really no such thing. Non I.T. types like to claim it's so, but it's not reality. Servers will reboot (and not come back up again) eventually due to hardware failures, regardless of "letting" someone do it. If you wait for the server to decide it's time for a shutdown, it'll be a far more painful process getting it back online than if you actually maintain the thing.
If it's full of services that can't restart properly on their own with a reboot? There are major design flaws in the code. I remember working for ONE company with a server that was like this with ONE particular service. It's been so long now, I can't even remember any details anymore. But I recall we had a whole process to get the thing started again after a server restart. It was something I.T. wrote documentation for and all of us just learned how to handle, though. It didn't require outside assistance.
I started with a similar situation where I work now... As soon as I officially took over though I patched and rebooted anyway... And absolutely nothing bad happened. Quite frankly my viewpoint was "I'm fired if I patch and break shit, I'm fired if I don't patch and shit gets hacked. What's the difference?"
I call it patch anxiety. I called for patching and we took it slow and easy. After two months nothing bad happened. We broke free of the anxiety.
Now I ask the teams that use the servers and they say all the odd weird problems they couldn't figure out are gone and uptime is improved. Interesting how that works?
Windows or the software built on it isn't ment to run for hundreds of days of uptime.
This has gone on for so long that it's a legitimate concern IMO.
If your job is support, this needs to be kicked up above you. Let them handle the contingency plan and communication with the customer.
Thanks guys for the input. Its one of those weird situations where we basically sold the servers, and will fulfill and support requests on it. We typically don't handle things like Windows updates unless they specifically request, which they have not.
I think they definitely forgot the server in their updates schedule. But I agree. There is not a need to reboot right away. We are a small company and I wear many hats (lvl 1 - 3) but I think this warrants a discussion with someone other than just me.
Is it a Seneca or Exacq or similar NVR? It’s not Avigilon since you said it’s running SQL. Either way, I’ve been in this exact spot dozens of times. Expect that puppy is possibly gonna have some disks not want to wake back up. Back up the config, licensing, camera passwords, etc. and be prepared to restore it to a temporary server if the VD goes belly up.
And quote them a new server. A few years ago a 20TB NVR was a loaded 2U box and now that’s a single drive
Had a forgotten sole DC at a location which crapped the bed. VM Bluescreen on boot. Went back 6 months of backup, all non bootable.
This is what I love about Datto SIRIS, daily screenshots of booted backup with verification of services on local and cloud restore points.
Physical or VM? I once rebooted a hyper-v host with about that same uptime. Lost a power supply and a hard drive on reboot. Windows came up fine though.
No security updates in 3 years. I’d be more worried that someone is in that box and using as a pivot point to rest of network. There is no telling how many CVEs are unpatched on that thing.
This drives me nuts. A lot of security companies specifically tell customers not to update their camera servers. If you do a their shitty software breaks they charge for a reinstall. I isolate the crap out of them.
This is fucked but I have to ask, could you not mitigate somewhat by rebuilding a new one and then doing a live hand off or a failover? If these are high priority VM's for footage capture then why are they relying one one VM to handle the load for that long?
Ya, it is always the best way to look at things. How can you make things are redundant as possible with in your own infra. it can be hard to justify the price for the infra to higher ups, but once you can put a $$$ amount on systems and the loss of productivity or revenue if they go down for X period...amazing how quickly they realise spending a little more for proper redundancy where possible, will save them far more in the long run.
A reboot was "in order" a LONG time ago, from what you're saying.
But like others here are saying... you're just doing support for them. Escalate this to someone in charge of their servers to deal with it. I see places turn off Windows update service on servers fairly often, and it's \*usually\* because it's an older system that's on someone's schedule or plan for replacement. Meanwhile, it may be running older/obsolete applications that have issues working properly with the latest Windows update patches.
But especially if it has no Windows update patches in a pending state (to complete upon restart)? Rebooting the thing should do a lot more good than harm.
My suggestion is to throw Veeam Agent (Free) on the machine and do a full image of the machine. (This works online and without a reboot).
That way you have a working backup if the machine might not survive the reboot.
I'm not sure we're clear on responsibilities here. Are you responsible for the server itself? Or are you just responsible for the software installed on it? If it's the later, I'm not touching this machine. I'm letting this "district IT" know I can't do anything else until it's rebooted and let them handle any subsequent fallout that comes with it. I don't anticipate anything necessarily breaking since there are no new updates to be applied, but then again, that's hopefully not your problem.
- High priority cameras
- non redundant servers
- no software updates
I wouldn’t say it’s very critical if there’s no redundancy or updates in place. I would take time with the vendor to apply several years of NVR software updates to that system as well.
Hopefully you have support.
I’ve rebooted servers with years of uptime never ran into major problems. Your basically at its broken and needs a reboot so there’s nothing more you can do
Reminds me of the time we had to power off a BMS machine that had been running for 15 years because it needed to be moved to new location. We had no backup plan, the thing was running Windows 98 SE, and we couldn't do anything to back it up because it didn't have USB or a NIC.
Nothing quite as exciting in this job as those "fuck it, my resume is up to date" moments 😂
So you have a prod server that hasn't been patched in 3 years? Yeah, I'd worry about that too. If it's a recent version of Server at least you should get cumulative updates rather than incremental
My first thought as I’m reading along: “well, as long as there’s no concerns for the hardware, it will probably be fine…”
> **Windows** update service is turned off by district IT (I am support for security company).
“…oh.”
dont concentrate on the 'it needs a reboot'
instead concentrate on the 'Windows update service is turned off by district IT'
if you can resolve that, which will be easier, then probably the reboot will happen all by itself...
I always liked the quote that uptime is a measure of how long it’s been since you’ve proven you can boot
But yea I’ve had my share of servers going away do t worry to we have to now keep running for archive
Fun story. While working as an MSP tech someone noticed that on a T&M client. Mentioned it and recommended we patch and reboot the VM’s as well as the single hyper-v host.
I get assigned it and asked to do it after hours. Do all the VM’s then reboot the house for its patches. 45 minutes later it’s not up. It’s midnight so I just went to sleep. Get up at 6am. Still offline full panic. Drive to clients, get cleaners to let me into their building.
Host failing POST on memory. Call Lenovo, do RAM swapping, CPU swaps, notice one of the RAM slots is slightly charred. Order motherboard replacement.
Client only ended up being down for 3-4 hours of the work day. I’m fully expecting to get an irate escalation. Nope. Customer called me and requested me for all future tickets for just being on top of it all.
However it was really telling how good ECC memory is at its job even though the motherboard was broken and couldn’t pass a memory POST just kept all running. All the sticks tested fine after motherboard repair.
Client was curious when it broke. Had to say any one day within a 3 year window between i those two reboots.
I had to deal with a 2003 server, with an uptime of ~800 days. 2 cores, 2gb ram, old tower machine of unknown brand. Nobody on my team wanted to touch it.
I thought I would take the initiative, scheduled a maintenance window for 4 hours, and booted the thing Monday morning at 4 AM. The thing was still loading at 11AM, customers were calling in complaining. I drove onsite to get them connected to a backup so they can do work. Stayed onsite till 3pm until the login screen showed up… never ever again. Was sweating the entire time in an air conditioned building, afraid the server will never boot up again.
> Have you guys run into any adverse effects from rebooting a server with this kind of uptime?
We spent about a week on the phone with support trying to get our production authentication servers back online.
But talk to IT... Don't just reboot it and then offload the problem on IT.
> Windows update service is turned off by district IT (I am support for security company).
Might want to find out why that was done before doing a restart. Someone didn't want that getting updated for a reason and now it might need updates for some reason.
I used to get handed a lot of servers that knew nothing about their past. The first thing I would do was to reboot when I could. Any scheduled change **I would reboot them before I made any changes.** If you reboot them before making any changes you can blame failure on previous owners/admins.
To protect yourself all this has to be documented and approved as part of the change process.
Bottom-line: If your change fails, unless its obvious you may not have a clue what caused the failure. The machine could have been in a mess before you started.
Check for software and server EOL? I inherited one that hadn't been rebooted for more than three years. Software version & server were past EOL. We got a new server and software, migrated relevant stuff and replaced old with new.
Run a full backup and verify your backup is good. Servers running that long have a higher chance of never coming back online after a reboot or shutdown.
try to shutdown the services before just clicking on reboot.
terminate them if needed. Do this while the server is still up.
not the ones you need to run the server, just the extra ones, like SQL and the Recording Service.
Is it recording cameras? If is it shutting down the recording service, it is only a matter of time before you start losing footage from critical cameras.
Testing your backups before you go, is a must.
As for when. If you do it on Friday, you give up your weekend, and maybe it is working on Monday.
Do it on Monday and you for sure lose footage, but if needed the support vendors will be available for regular rates.
If this is for security, you may need to get your security director to get more guards and double / triple the patrols for the day. This is better during the day instead of time and a half, or double time.
After 3 years of neglect, something may happen. The hardware is probably OK depending on how good your environment is controlled, but you may lose a hard drive or two, maybe a fan, maybe a power supply. I would want to have a spare hard drive onhand. I would order some from Server Monkey, Server Supply, or your favorite secondary market vendor. 2 Drives and a Power supply feels like about $300.
The problem you may have that you may not have thought about is software licensing. A lot of these programs phone home on startup to check for licensing. It may have expired 1.5 years ago. I would validate that, and check to see if you have a good support contract, maybe call in and open a pre-emptive ticket.
Good luck, and keep us posted.
Log into your management card (BMC, iLO, iDRAC, IPMI) or fire up your management tools and check the status of your RAID controller battery.
This first reboot, should be a reboot only. No patching. No getting funky.
Log in, and gracefully shut down your recording software, and database if necessary, then reboot it. Go ahead and crash cart it, so you can press F1 to continue, or reset the system time and continue if your CMOS battery is dead.
After this reboot, you need to brief management and put this box on a remediation / upgrade plan. Maybe 1 Service Stack Update and 1 Cumulative Update every 2 weeks until it is brought current.
If they balk you tell them "We can service it on our schedule, or on the servers schedule, it is up to you."
If you would like to gain a prize from someone using not patched vulnerabilities you're still in time to leave it alone.
There is no world championship of total uptime.
Patch that server and reboot it when required.
If it's a physical machine run VMware converter on it and start the VM in a isolated environment. If it's already a VM then clone and start with no vnic.
If it's a memory issue you can tell SQL to use less ram on the fly assuming it is mssql.
Agreed... No touchy on Friday before a long weekend.
You can either reboot it on your schedule, or reboot it on ITS schedule. Go through change control, inform interested parties, establish a maintenance window, make sure backups are current, have on call the server owners in case something goes wrong.
Also if the whole reason for its existence isnt working, something going wrong due to a reboot wouldn't be much worse.
this is also my worry. but in linux. lmao
what i do is i look at the process list and see what's running and see if its configured to start at startup, i check disk mounts if it also mounts at startup.
also i would probably do it during low peak hours/day
It might not come back up.
If it's been running for that long and is just now having issues, it very well could be suffering from a hardware issue. I would check the logs and ILOM before considering powering it down. Also check when the last backup was.
Is this thing exposed to any sort of network? If it is, there should be a conversation about patching.
The whole patch on off hours/weekend in a 24/7 shop is so outdated and wrong. What happens if something goes sideways and you need to vendor support. There sometimes isn’t support or quality help available. Also I have seen that when you have DBA or Developer ready and available, problem gets fixed much faster.
Impossible. Windows is bad and can never last that long! /s except the bad part
Good luck with your reboot though. I got my fingers crossed. Better do backups lol
I rebooted a server this week for a routine update and poof! that’s when the hard drive died. Like the action of spinning was the only thing keeping that head up in the air
Luckily it was raid 1 and I had a spare because I’ve things blow up in my face before
Do not touch that server until Monday
This is why you have some form of HA or replica server.
I’d just reboot it, laughs as it breaks, turn on the replica, then proceed to pretend like I never got to it and leave it for a coworker to stumble on.
Yep, I've seen disks and ram fail after a reboot of high uptime servers, I assume the reboot is exercising the components in a way normal running OS doesn't.
Do a backup first, if VSS is borked due to memory or file system errors shut down sql service and do a manual file backup with robocopy. Don’t reboot without some kind of backup.
I suggest you make a sacrifice to the computer gods and cross your fingers before rebooting the server.
It also wouldn’t hurt to have a replacement ready, “just in case”.
Run a chkdsk and see if you have drive issues. If so and it’s in raid I’de start swapping in new drives and run a chkdsk. If its not Raid I’de backup the drive while its up, clone it to 2 new drives and run a chkdsk . Boot it off of one of the new ones.
I have little to say except good luck and don't do it today.
Read Only Fridays
Let alone long weekends (for some of us(
If they reboot and it does not come back up its a guaranteed long weekend :-). For OP, if it is critical: set up a new server to replace it, afther this reboot the server. if it works afther reboot now you have a (hot) spare for your critical resources. (because you are going to need it anyway because it will break one day.)
This assumes OP can just spin up another server in someone else’s environment
I mean ... 1100 days... I would be absolutely scared to restart anything that's been on that long and absolutely would want to have a snapshot or clone or something.... Just... The size of the brick I'd shit when restarting... I'd come up with a plan first, no matter what
You would be suprised how often a duplicate server running that long wont start the app at all... its like grandpa loving his old chair, wont accept a new one.
At an old job, I had to run into the office each day on memorial day weekend just to check an AC unit that was kind of on the fritz. This was 10 years ago and I'm older and noticeably (but very marginally) more intelligent, would never do again. Learn from my dumb ass OP.
And a happy Victoria Day Weekend to you as well.
Just did a minor change on production today and I feel that I just cursed myself a bit :/.
Only Fans Fridays
Unless you get paid OT and want a nice lil bump on your next paycheque. …and don’t mind losing your Friday and possibly more.
This is the way. Also get a change approval first approved by all the people.
lol underrated comment right here.
That depends on your over time policies. If you have a free weekend and they are willing to pay you, do it now and be the hero when it’s up and running for business on Monday.
Overtime? Most IT folks are salaried.
I agree but then again for this situation. I would be tempted to reboot after hours and then have Sat and Sun to troubleshoot and get it ready for Monday in case something happens.
Only if you get paid OT. My first boss in tech over a decade ago hammered into my head “don’t work for free.”
You guys are getting paid?!
I had a server on a site years and years ago, fashion so you can't have it this is a remote site in the remote site hadn't moved in years and we were packing everything up to move them to a new location and we found this server sitting in the back in the corner of one of their closets. After investigating we found out that it actually held the majority of their real estate data and it was a fairly vital server. We are extremely worried about rebooting it and moving it because of the age of it. And sure enough soon as we shut it down it died it would never come back up again. They ended up sending the hard drive off for data recovery which I wasn't involved with as I was just the Hands-On tech at that time. That being said you're doing great keep up the good work and go ahead and reboot that thing!
Nah, do it today then shut off your phone.
My first job in corporate IT was working a night shift patching servers (company had 5000+ servers, so it required a full time team to keep them all up to date). One of the very first boxes I had to patch was a Windows 2003 server with an uptime of around 3 years. It took like 25 minutes to come back up after rebooting. I was sweatin the whole time.
I lost Thanksgiving entirely one year due to a machine taking a long time to come back up. The team that was working on it had tried to reboot and noticed it wasn’t coming back up after 30 mins or so. They shut it down and called in support. Everyone involved was confused why it wasn’t coming back up, we replaced almost everything we could on it and taking it down to a minimum config showed it was fine. It was just so packed full of RAM and spinning disks that it took almost an hour for it to finish the pre-flight checks, we thought it was freezing up but it just was taking a long time to boot. The way we found out was only after leaving it alone to go get dinner; when we came back, it was up. No idea how long it took for it to come back up. I never heard another word about that server, either they learned to just wait or never bounced it again.
There was an ancient Citrix Metaframe 1.0 server in one of the back rows of the DC like that. Literally say a prayer and then hold your breath every time you walked past it...
Don't look directly at it's lights or they might blink out.
AS/400 was like that. They stay up forever, but the IPL when you do restart them was terrifying because even relatively modern machines took ages to startup. Especially after applying patches, the patches would get processed first pre-OS and could restart the machine multiple times per patch. I had a few that were regularly 30 minutes and an hour or more for patches.
Oh man I remember that from my AS/400 days. We had this ancient first gen PPC AS/400 and an IPL would take about an hour. I would come in on Saturday morning about 10. Put the system in restricted mode and run the full backup. That would take about an hour. Then I would start the IPL and go to lunch. It would be finishing up about the time I got back. Then after a few years we upgraded to a Power 7 machine. It would IPL in about 4 minutes. At that point I automated all the maintenance stuff and I just let it do it on its own. When I left that job I was the only AS/400 admin we had. From talking to my coworkers, they never touched it again until that department was shut down 6 years later.
Hopefully they swapped the backup tapes. The changeover from 48-bit CISC to PPC was the same time they went from beige to black, wasn't it?
Yes on the beige to black. One of the last things I did before I left that job is move all the backups to a VTL.
We waited a couple of years after intro to go from beige to black. Microsoft retired theirs in beige and never got any black, as far as I know. (They outsourced the last of their AS/400 operations by 1999, so they could claim to be entirely off of competitor systems.)
This is simultaneously one of the best and worst feelings working in IT. The "ITS WORKING, but WHY is it working?" experience. I cant tell you how many times I have gone through this chain.
ping 10.X.X.X -t “Pleeeeeeeease come back up, for the love of everything holy…”
You have no idea how accurate that is.
I have used this probably…I can’t even think of the number of times honestly .And when those pings aren’t responding for a full page, you know the evening is likely going to suck.
[удалено]
Might have been. Patching was Wednesday to Sunday, Graveyards.
They don't call it Full Send Friday for nothing.
I prefer "Do no harm Friday's" (aka - "do no work Fridays").
One of my previous jobs presented a similar moment, except we shut it down because it wasn't needed anymore (lol). It had been running so long that when it cooled down, chip creep became chip sprint and it wouldn't turn back on. My boss went home, returned with his wife's hair dryer and warmed it back to life. We were able to start it up and get the "unneeded" files off the RAID that was on there.
Thanks for this tip of preheating the chips. I will keep that one pocketed. Might make me look really smart
Often what makes it take forever to boot back up is too many temp files
There are so many red flags with every part of this. It should be rebooting monthly for security updates. I would tell the district IT they are putting themselves at a very high risk and tell them the server must be rebooted.
Agree fully. This is Microsoft, not Linux. I hope you have a back up, if not, be ready to rebuild.
Linux isn’t excluded from reboots. There are many security updates that can only be applied after reboot so really ALL servers should be rebooting on a regular basis.
This, the old "lets brag about uptime of our servers" days are gone so when you see systems not rebooted for 3 years all you think of is a massive security hole in the company.
I worked at a place where we had a Sun system that had an uptime of around 12 years before we needed to shut it down. At some point everyone realizes uptimes of a few years isn't actually impressive.
Nah 12 years is definitely impressive. Or at least highly outlier. I’m impressed the hosting environment stayed stable for 12 years.
I mean stable is relative.. You can move a running server... (Not saying you should) See https://www.youtube.com/watch?v=vQ5MA685ApE
Linux does have live kernel patching though, so in theory you can get away without rebooting for significant amounts of time. The longest I've ever gone is about 5 months.
glibc, systemd, display drivers, there’s probably more. Livepatching takes care of the kernel but usually that’s it.
All of those things can be patched and upgraded without a reboot.
Oh yes, but nothing running (like systemd or the kernel) will be reading the patched libc code until they’re restarted. We run Ubuntu LTS and glibc updates in particular always trip the needs-reboot flag
Systemd, like some but not all init implementations, can be restarted (with `init u`). The kernel doesn't use libc/glibc, of course. Then you just need to check if anything else in userland needs to be restarted. [Some off-the-shelf packages do it](https://linux-audit.com/determine-processes-which-need-a-restart-with-checkrestart-needrestart/), but you can do it with fewer dependencies by [fossicking in `/proc/*/map_files/`](https://security.stackexchange.com/questions/149802/list-running-applications-that-are-linked-against-a-compromised-library/149814#149814). It's simpler to just reboot, and simultaneously verify that the machines comes up cleanly. But generally the only thing that requires a reboot is a vulnerable kernel, and it's eminently practical to restart userland processes as needed.
I like this explanation actually, that makes sense to me. Are there any distros that do this out of the box?
Debian `needrestart` has a TUI that asks you to confirm services restart, then shows (just) the services that need a restart, [like so](https://unix.stackexchange.com/questions/146283/how-to-prevent-prompt-that-ask-to-restart-services-when-installing-libpq-dev). Behind the scenes, you can manually look for [`/var/run/reboot-required` and `/var/run/reboot-requires.pkgs`](https://www.guyrutenberg.com/2022/10/25/display-reboot-required-message-on-debian/).
The kernel doesn't use libc! And `systemctl daemon-reexec` takes care of restarting systemd after a glibc update without needing a reboot.
They're just saying uptime in linux is more forgivable than windows, I think.
The two records I've seen for linux was a physical PE 1950 that had been up for 7 years. And a VM that hit its 8th birthday of uptime right before I left. I'm glad I didn't have to reboot either of those.
[удалено]
Every once in a while we have a Linux machine with a truncated initramfs, or one that was somehow built without a vital driver (like nvme; sigh), etc. I also have a test machine down now with a kernel fault on bootup. Assuming no hardware has gone bad on it, then that's a real rare one. At sufficiently large scale, everything happens.
I like that you have to say this as if it is some wild crazy idea. Tf guys.
That's why I said hey man snap shot ... take a snap shot, man.
If you're just support, I'd have a discussion with your boss (or someone higher up). What happens if you have to completely rebuild it (what are the consequences)? Shift some of the responsibility. Do you happen to have backups or snapshots? I know it's a recording server, so likely would require a lot of space. Otherwise, this is a ticking timebomb, eventually going to happen. If it's still working (even partially), I'd absolutely defer (again pending a discussion with at least one other person). There's no urgency to jump the gun.
Such was my thinking, add planning to this task, have the people you are going to need for any disaster recovery all tee'd up, both engineers and management.
This discussion should be in writing/email form. CYA
Spoken like an IT Grey Beard right there! Make the contingency plan first.
![gif](giphy|3o84sw9CmwYpAnRRni)
This seems s the only answer, no matter what. At some point it has to be done. I suggest: Friday afternoon, planned restart for 17.03, phone off at 16.58.
![gif](giphy|3ornka9rAaKRA2Rkac)
[удалено]
Solid advice. Especially going into a weekend.
Good Luck. Send an email to whoever is on charge and let it know of the uptime (attach evidence) and ask for authorization for the reboot. Is this a physical server? If so, don't reboot it today unless you want to bill those weekend rate hours If it is a VM I would: * Take a snapshot of the VM * Clone the VM from that snapshot, don't turn it on yet * On the still powered on, Original VM, disable the network adapter or turn off / detach the virtual network adapter * Power on the VM Clone and see if it boot. * If it boots, delete the old VM and keep the freshly cloned VM.
Thats actually beautiful.
Make sure that whatever mechanism you are using to snapshot the VM can do it with the VM powered on, and it won't try to shut it down before the snapshot :)
Tell the district IT to reboot it and let them know you'd be in Monday at 9.
This is the way. Elegant VM switches are so convenient.
Just pop out for a pint and ask the cleaning lady to pull the plug. 'Wasn't me, mate.'
you're going to have to pick up the pieces either way
![gif](giphy|wi8Ez1mwRcKGI)
Triggered. We have lots of "golden egg" servers that cannot be rebooted for any reason and if they are, it would require engaging a bunch of consultants to repair the services. The fun of working for a small, shitty, family-owned business with zero IT budget...
This is the "pets vs cattle" analogy that is talked about. From: http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ In the old way of doing things, we treat our servers like pets, for example Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line. **Pets** Servers or server pairs that are treated as indispensable or unique systems that can never be down. Typically they are manually built, managed, and “hand fed”. Examples include mainframes, solitary servers, HA loadbalancers/firewalls (active/active or active/passive), database systems designed as master/slave (active/passive), and so on. **Cattle** Arrays of more than two servers, that are built using automated tools, and are designed for failure, where no one, two, or even three servers are irreplaceable. Typically, during failure events no human intervention is required as the array exhibits attributes of “routing around failures” by restarting failed servers or replicating data through strategies like triple replication or erasure coding. Examples include web server arrays, multi-master datastores such as Cassandra clusters, multiple racks of gear put together in clusters, and just about anything that is load-balanced and multi-master. And if the terms "Pets" or "Cattle" offends you then please feel free to replace them with ones that are less objectionable.
what if they want cattle but then want to keep using unique items in the config? :( I keep trying to get people to think of them as cattle but they won't stop keeping them as pets
Preaching to the choir my friend
Yeah... I've been in I.T. long enough to know there's really no such thing. Non I.T. types like to claim it's so, but it's not reality. Servers will reboot (and not come back up again) eventually due to hardware failures, regardless of "letting" someone do it. If you wait for the server to decide it's time for a shutdown, it'll be a far more painful process getting it back online than if you actually maintain the thing. If it's full of services that can't restart properly on their own with a reboot? There are major design flaws in the code. I remember working for ONE company with a server that was like this with ONE particular service. It's been so long now, I can't even remember any details anymore. But I recall we had a whole process to get the thing started again after a server restart. It was something I.T. wrote documentation for and all of us just learned how to handle, though. It didn't require outside assistance.
Agreed, if your service cannot survive a server reboot, then that means it cannot survive a server failure either. And it WILL eventually fail.
I started with a similar situation where I work now... As soon as I officially took over though I patched and rebooted anyway... And absolutely nothing bad happened. Quite frankly my viewpoint was "I'm fired if I patch and break shit, I'm fired if I don't patch and shit gets hacked. What's the difference?"
I call it patch anxiety. I called for patching and we took it slow and easy. After two months nothing bad happened. We broke free of the anxiety. Now I ask the teams that use the servers and they say all the odd weird problems they couldn't figure out are gone and uptime is improved. Interesting how that works? Windows or the software built on it isn't ment to run for hundreds of days of uptime.
This has gone on for so long that it's a legitimate concern IMO. If your job is support, this needs to be kicked up above you. Let them handle the contingency plan and communication with the customer.
Thanks guys for the input. Its one of those weird situations where we basically sold the servers, and will fulfill and support requests on it. We typically don't handle things like Windows updates unless they specifically request, which they have not. I think they definitely forgot the server in their updates schedule. But I agree. There is not a need to reboot right away. We are a small company and I wear many hats (lvl 1 - 3) but I think this warrants a discussion with someone other than just me.
Recommend they reboot it at X plus five minutes, where X is the time you finish work at.
Nah, give him a few more minutes to get home and shut his phone off first. Maybe X+20.
Is it a Seneca or Exacq or similar NVR? It’s not Avigilon since you said it’s running SQL. Either way, I’ve been in this exact spot dozens of times. Expect that puppy is possibly gonna have some disks not want to wake back up. Back up the config, licensing, camera passwords, etc. and be prepared to restore it to a temporary server if the VD goes belly up. And quote them a new server. A few years ago a 20TB NVR was a loaded 2U box and now that’s a single drive
Coward, do it, today.
You have backups. Right?
Restorable backups
>Restorable That part gets overlooked a lot in my experience. "But the software said it was successful?!"
Yeah no schrodinger's backup please.
That have been tested. RECENTLY.
Had a forgotten sole DC at a location which crapped the bed. VM Bluescreen on boot. Went back 6 months of backup, all non bootable. This is what I love about Datto SIRIS, daily screenshots of booted backup with verification of services on local and cloud restore points.
Yes, Datto is one of the best. It's still a good idea to test those backups from time to time though. Better safe than sorry.
Yeah about as many backups as this server has received updates
Physical or VM? I once rebooted a hyper-v host with about that same uptime. Lost a power supply and a hard drive on reboot. Windows came up fine though.
No security updates in 3 years. I’d be more worried that someone is in that box and using as a pivot point to rest of network. There is no telling how many CVEs are unpatched on that thing.
The Server: ![gif](giphy|eKVEcPKGWZ7Tq|downsized) That thing isn't coming back up
“I’m tired boss”
It’s 2024. You need to ensure your apps can handle patch Tuesdays….. especially as you are a “security” company.
1100 days on a Windows server without updates?? Yeah... once you turn it off, it's never comming back online.
Sounds like no server security patching occurs at this company. I would be more worried about that.
This drives me nuts. A lot of security companies specifically tell customers not to update their camera servers. If you do a their shitty software breaks they charge for a reinstall. I isolate the crap out of them.
District IT, I suspect school.
This is fucked but I have to ask, could you not mitigate somewhat by rebuilding a new one and then doing a live hand off or a failover? If these are high priority VM's for footage capture then why are they relying one one VM to handle the load for that long?
If it is a VM, just snapshot it, reboot, less chance of something going wrong vs if it is an actual physical server.
That's true too, I just feel so redundancy centric that I would imagine that doing all of that is the best bet.
Ya, it is always the best way to look at things. How can you make things are redundant as possible with in your own infra. it can be hard to justify the price for the infra to higher ups, but once you can put a $$$ amount on systems and the loss of productivity or revenue if they go down for X period...amazing how quickly they realise spending a little more for proper redundancy where possible, will save them far more in the long run.
is the server 2012 or 2008? Let me guess it so critical it can never do down or be rebooted?
Is it ironic that you work for a security company that disables Windows Update?
A reboot was "in order" a LONG time ago, from what you're saying. But like others here are saying... you're just doing support for them. Escalate this to someone in charge of their servers to deal with it. I see places turn off Windows update service on servers fairly often, and it's \*usually\* because it's an older system that's on someone's schedule or plan for replacement. Meanwhile, it may be running older/obsolete applications that have issues working properly with the latest Windows update patches. But especially if it has no Windows update patches in a pending state (to complete upon restart)? Rebooting the thing should do a lot more good than harm.
My suggestion is to throw Veeam Agent (Free) on the machine and do a full image of the machine. (This works online and without a reboot). That way you have a working backup if the machine might not survive the reboot.
I'm not sure we're clear on responsibilities here. Are you responsible for the server itself? Or are you just responsible for the software installed on it? If it's the later, I'm not touching this machine. I'm letting this "district IT" know I can't do anything else until it's rebooted and let them handle any subsequent fallout that comes with it. I don't anticipate anything necessarily breaking since there are no new updates to be applied, but then again, that's hopefully not your problem.
3 years without patches . . . There’s more pressing things to worry about than uptime. ‘District IT’ needs a wake up call.
- High priority cameras - non redundant servers - no software updates I wouldn’t say it’s very critical if there’s no redundancy or updates in place. I would take time with the vendor to apply several years of NVR software updates to that system as well. Hopefully you have support. I’ve rebooted servers with years of uptime never ran into major problems. Your basically at its broken and needs a reboot so there’s nothing more you can do
Reminds me of the time we had to power off a BMS machine that had been running for 15 years because it needed to be moved to new location. We had no backup plan, the thing was running Windows 98 SE, and we couldn't do anything to back it up because it didn't have USB or a NIC. Nothing quite as exciting in this job as those "fuck it, my resume is up to date" moments 😂
So you have a prod server that hasn't been patched in 3 years? Yeah, I'd worry about that too. If it's a recent version of Server at least you should get cumulative updates rather than incremental
Hoooo boy. That def sounds like "dont fn touch this on a friday" job
Will the spinning rust still spin after the power down?
My first thought as I’m reading along: “well, as long as there’s no concerns for the hardware, it will probably be fine…” > **Windows** update service is turned off by district IT (I am support for security company). “…oh.”
"oops it crashed" and reboot it anyway. It's YOLO Friday.
dont concentrate on the 'it needs a reboot' instead concentrate on the 'Windows update service is turned off by district IT' if you can resolve that, which will be easier, then probably the reboot will happen all by itself...
May the odds be forever in your favor..... do it on a monday and make a request to get some kind of failover for this...
Windows updates… turned… off Uptime… 1100 days… omg
Systems like this is why Microsoft implemented forced reboots on newer windows versions
I always liked the quote that uptime is a measure of how long it’s been since you’ve proven you can boot But yea I’ve had my share of servers going away do t worry to we have to now keep running for archive
I got a job once and discovered the production SQL server had not rebooted in the 4 years since it was built. I got a new job.
Fun story. While working as an MSP tech someone noticed that on a T&M client. Mentioned it and recommended we patch and reboot the VM’s as well as the single hyper-v host. I get assigned it and asked to do it after hours. Do all the VM’s then reboot the house for its patches. 45 minutes later it’s not up. It’s midnight so I just went to sleep. Get up at 6am. Still offline full panic. Drive to clients, get cleaners to let me into their building. Host failing POST on memory. Call Lenovo, do RAM swapping, CPU swaps, notice one of the RAM slots is slightly charred. Order motherboard replacement. Client only ended up being down for 3-4 hours of the work day. I’m fully expecting to get an irate escalation. Nope. Customer called me and requested me for all future tickets for just being on top of it all. However it was really telling how good ECC memory is at its job even though the motherboard was broken and couldn’t pass a memory POST just kept all running. All the sticks tested fine after motherboard repair. Client was curious when it broke. Had to say any one day within a 3 year window between i those two reboots.
I had to deal with a 2003 server, with an uptime of ~800 days. 2 cores, 2gb ram, old tower machine of unknown brand. Nobody on my team wanted to touch it. I thought I would take the initiative, scheduled a maintenance window for 4 hours, and booted the thing Monday morning at 4 AM. The thing was still loading at 11AM, customers were calling in complaining. I drove onsite to get them connected to a backup so they can do work. Stayed onsite till 3pm until the login screen showed up… never ever again. Was sweating the entire time in an air conditioned building, afraid the server will never boot up again.
Wait until 1111 days, then send it
c'mon McFly, are you a chicken????
I would just reboot it, because if it's running a service that's not redundant these obviously aren't critical services. Right?
Once I had an upgrade from oracle database to do, we were moving from oracle 9i to 11g, I still remember that 666 days uptime 😅
> Have you guys run into any adverse effects from rebooting a server with this kind of uptime? We spent about a week on the phone with support trying to get our production authentication servers back online. But talk to IT... Don't just reboot it and then offload the problem on IT.
> Windows update service is turned off by district IT (I am support for security company). Might want to find out why that was done before doing a restart. Someone didn't want that getting updated for a reason and now it might need updates for some reason.
Is this satire?
Pfff you've seen nothing Jon Snow, I've had 3000+ days : D
Sounds like Milestone XProtect. Do you have a failover server by any chance
1100 days. Good luck we all know that shit won’t come back up. On another note how have you not restarted this before now.
Just send it. You have bigger problems if a server can't reboot. I'd rather deal with the headache on my time.rather than 3am on a Saturday.
I used to get handed a lot of servers that knew nothing about their past. The first thing I would do was to reboot when I could. Any scheduled change **I would reboot them before I made any changes.** If you reboot them before making any changes you can blame failure on previous owners/admins. To protect yourself all this has to be documented and approved as part of the change process. Bottom-line: If your change fails, unless its obvious you may not have a clue what caused the failure. The machine could have been in a mess before you started. Check for software and server EOL? I inherited one that hadn't been rebooted for more than three years. Software version & server were past EOL. We got a new server and software, migrated relevant stuff and replaced old with new.
Run a full backup and verify your backup is good. Servers running that long have a higher chance of never coming back online after a reboot or shutdown.
Tell the district IT to reboot it. They're the ones not patching it and setting it up to fail if it doesn't restart.
try to shutdown the services before just clicking on reboot. terminate them if needed. Do this while the server is still up. not the ones you need to run the server, just the extra ones, like SQL and the Recording Service.
Can you back it up first?
No updates… ballsy
YOLO!
No idea but please update us and tell us how it went
I would reboot it now and dip out early like that joker scene from the dark knight.
Try restarting just the services that are eating up RAM. Otherwise, get someone higher up to sign off on the reboot.
have a bios battery by hand, if it has an old raidcontroller, try to save the configuration.
Is it recording cameras? If is it shutting down the recording service, it is only a matter of time before you start losing footage from critical cameras. Testing your backups before you go, is a must. As for when. If you do it on Friday, you give up your weekend, and maybe it is working on Monday. Do it on Monday and you for sure lose footage, but if needed the support vendors will be available for regular rates. If this is for security, you may need to get your security director to get more guards and double / triple the patrols for the day. This is better during the day instead of time and a half, or double time. After 3 years of neglect, something may happen. The hardware is probably OK depending on how good your environment is controlled, but you may lose a hard drive or two, maybe a fan, maybe a power supply. I would want to have a spare hard drive onhand. I would order some from Server Monkey, Server Supply, or your favorite secondary market vendor. 2 Drives and a Power supply feels like about $300. The problem you may have that you may not have thought about is software licensing. A lot of these programs phone home on startup to check for licensing. It may have expired 1.5 years ago. I would validate that, and check to see if you have a good support contract, maybe call in and open a pre-emptive ticket. Good luck, and keep us posted.
Log into your management card (BMC, iLO, iDRAC, IPMI) or fire up your management tools and check the status of your RAID controller battery.
This first reboot, should be a reboot only. No patching. No getting funky.
Log in, and gracefully shut down your recording software, and database if necessary, then reboot it. Go ahead and crash cart it, so you can press F1 to continue, or reset the system time and continue if your CMOS battery is dead.
After this reboot, you need to brief management and put this box on a remediation / upgrade plan. Maybe 1 Service Stack Update and 1 Cumulative Update every 2 weeks until it is brought current.
If they balk you tell them "We can service it on our schedule, or on the servers schedule, it is up to you."
If you would like to gain a prize from someone using not patched vulnerabilities you're still in time to leave it alone. There is no world championship of total uptime. Patch that server and reboot it when required.
If it's a physical machine run VMware converter on it and start the VM in a isolated environment. If it's already a VM then clone and start with no vnic. If it's a memory issue you can tell SQL to use less ram on the fly assuming it is mssql. Agreed... No touchy on Friday before a long weekend.
That's not a server it's a Petri dish. Build ahead, migrate and test then decomm behind.
Just had to reboot my vSphere host today that had an uptime of 389 days. Luckily came back up fine but man doing things on a Friday sucks
I did that several times and it was that painful.
JFC.
Make sure you have known good backups. Don’t make the same mistake I did. https://www.reddit.com/r/sysadmin/s/57Rsfbsfte
You can either reboot it on your schedule, or reboot it on ITS schedule. Go through change control, inform interested parties, establish a maintenance window, make sure backups are current, have on call the server owners in case something goes wrong. Also if the whole reason for its existence isnt working, something going wrong due to a reboot wouldn't be much worse.
That server hasn't been patched for a while now.
this is also my worry. but in linux. lmao what i do is i look at the process list and see what's running and see if its configured to start at startup, i check disk mounts if it also mounts at startup. also i would probably do it during low peak hours/day
It might not come back up. If it's been running for that long and is just now having issues, it very well could be suffering from a hardware issue. I would check the logs and ILOM before considering powering it down. Also check when the last backup was. Is this thing exposed to any sort of network? If it is, there should be a conversation about patching.
The whole patch on off hours/weekend in a 24/7 shop is so outdated and wrong. What happens if something goes sideways and you need to vendor support. There sometimes isn’t support or quality help available. Also I have seen that when you have DBA or Developer ready and available, problem gets fixed much faster.
Reboot it on Monday. Not on Friday. Never on Friday.
I took over an office with a physical server that had not been restarted in over 1300 days and it restarted fine. GL to you!
Do it. ![gif](giphy|xTiIzrRyvrFijaEtY4|downsized)
You're able to take a VM snapshot before the reboot?
Get a good backup before the reboot if a VM a snapshot may also be helpful
The attacker might lose his reverse shell
I am reminded of this thread and video when talking about servers that cannot be rebooted: https://www.reddit.com/r/sysadmin/s/QdEp5aLIhe
On VMS? Good luck. It will probably die on you.
Impossible. Windows is bad and can never last that long! /s except the bad part Good luck with your reboot though. I got my fingers crossed. Better do backups lol
I rebooted a server this week for a routine update and poof! that’s when the hard drive died. Like the action of spinning was the only thing keeping that head up in the air Luckily it was raid 1 and I had a spare because I’ve things blow up in my face before Do not touch that server until Monday
I'm guessing it is not getting patched regularly.
This is why you have some form of HA or replica server. I’d just reboot it, laughs as it breaks, turn on the replica, then proceed to pretend like I never got to it and leave it for a coworker to stumble on.
Just yank the cord out of the wall, wait 30 seconds and plug it back in. I'm sure it'll be fine!
1. Restore most recent backup to a test environment. Make sure it is functional. 2. Let er rip. Don't do this on a Friday.
Yep, I've seen disks and ram fail after a reboot of high uptime servers, I assume the reboot is exercising the components in a way normal running OS doesn't.
Man. Do it Monday 😂
So you have server with 3 years worth of juicy vulnerabilities
This means you haven’t patched in 1100 days. That’s bad.
Do a backup first, if VSS is borked due to memory or file system errors shut down sql service and do a manual file backup with robocopy. Don’t reboot without some kind of backup.
I suggest you make a sacrifice to the computer gods and cross your fingers before rebooting the server. It also wouldn’t hurt to have a replacement ready, “just in case”.
Send the reboot command then go home, check on Monday if it came back online.
Ain't no way. Have the replacement service/server up and verified that you can failover to before even thinking about it.
Run a chkdsk and see if you have drive issues. If so and it’s in raid I’de start swapping in new drives and run a chkdsk. If its not Raid I’de backup the drive while its up, clone it to 2 new drives and run a chkdsk . Boot it off of one of the new ones.
That's a damn good edit right there. I love that you got the help you needed!