March 13, 201412 yr First... if someone tells you that a RAID system never fails, tell them they don't have a clue as to what they are talking about. I have it on good authority that they can, and do. Were we attacked? No, the system shutdown was not due to any activity to shut our forums down. We had indications that a disk was unstable; much of what you have seen in 503 errors we now attribute to that and what follows in this story. The "disk" finally did fail last week sometime after 3:40 a.m. on Thursday morning. The whole system crashed. That’s not supposed to happen in a RAID array. The system should continue to operate with a failed disk, or depending upon configuration, multiple failed disks. We replaced the “bad” disk on Saturday and when we did, the remaining three disks started showing indications that they too were on the road to failure, but did not show that until we replaced disk 1. Importantly those three disks have not failed - yet. And we'll come back to those three disks below. The answer to why the RAID array did not do what it should have done is that all four disks became corrupted. Disk 1 was just the first to fail. We decided to go ahead and start the IDERA backup process Tuesday night in spite of the wobbly three disks and we were encouraged that no errors were showing as late as 20 hours into the process. Nearly 30 hours later, the system came back online after we did some additional dialing and tweaking. Okay, now about those three questionable disks in the RAID array today… We have ordered and have now received a total of five disks… We will replace the three existing failing drives, and we will then expand our system to go from a total of four to six drives. By doing that, we will have a hot spare, and have a fully RAID 5 configuration which will allow two disks failures and a system that continues to operate. The replacement of three “bad” disks should be pretty straight forward. We will replace each on one day, let it stabilize and let the RAID process do its thing and then replace the second the following day, and third the following after that. In other words, it is going to take three days and six trips to and from the colo to get that job done, but short of shutting the system down and reloading the best backup that is the quickest and safest way to proceed. The system is now going to show some sporadic slowness over the next day or two. The reason for that is that the RAID system will be doing its thing with the "parity" process (and I never thought of myself as an I.T. guy - now I is... Parity... Love that word!) As for all the time it has taken… I can’t begin to tell you how much of a goat rope this whole thing has been. Support delays from HP, Idera, NETWORK2000, PCCW, and Equinix or confusion about “who’s on first” have contributed hours of lost time. We had our tech support people sitting in the cage at our colo Tuesday from 11 a.m. until nearly midnight, mostly waiting for support responses. We had the colo accept our first replacement drive on Friday. Monday they refused to accept the two expansion disks and they were shipped back to HP. HP had to resend us drives yesterday. We have been forced to have all of our hardware shipped to the tech support’s office some miles away, and he now has to make sure the hardware gets to the colo – time and travel that we pay for. By the end of next week, if not sooner, we expect to have a fully populated RAID array, operating to spec and with a hot spare if things go catywampus again. More importantly to you, as a member, is that we believe the 503 errors will finally go away (yes, there was one about 20 minutes ago, but that was my doing, not the system's). When all is said and done, we think the system will be back to good health and operating at its peak performance.
March 13, 201412 yr Tom, Thank you for all the hard work that you and everyone on Team AVSIM have been performing to get everything back online.
March 13, 201412 yr Nice to see you getting your parity bits in order Tom If there's one thing that I have learnt in 30 years of IT support, it is never say never - closely followed by "it will only go in one way" Scott
March 13, 201412 yr Glad you are back Tom, and interesting read although after I read these technical things I wonder why I bother....it mostly sounded like blah blah blah we are back up blah blah blah. I guess that is why we have IT guys. Mark CYYZ
March 13, 201412 yr Great news, I feel the forums a bit speedier, many thanks to Tom and all the AVSIM Team. Just wondering if support for TapaTalk will be back someday. My bad! I see TapaTalk support is working! Alexander Colka
March 13, 201412 yr Glad to have AVSIM back, I'm not saying this as a Tom fan*oy as someone put it here: http://www.isitdownrightnow.com/avsim.net.html but as a proud AVSIM supporter. \Robert Hamlich/
March 13, 201412 yr It is very nice to have AVSIM back, I was quite concerned. João Alfredo It is impossible to please Greeks and Trojans É impossivel agradar Gregos e Troianos
March 13, 201412 yr Thanks a lot Tom, glad everything got solved. And yes, I was also afraid.... Best regards,Luis Hernández Main rig: self built, AMD Ryzen 7 5700X3D (with SMT off and CO -50 mV), 2x16 GB DDR4-3200 RAM, Nvidia RTX 5060Ti 16GB, 256 GB M.2 SSD (OS+apps) + 2x1 TB SATA III SSD (sims) + 1 TB 7200 rpm HDD (storage), ID-Cooling SE-224-XTS air cooler, Viewsonic VX2458-MHD 1920x1080@120-144 Hz (G-sync compatible), Windows 11. Running P3D v5.4 (with v4.5 scenery objects as an additional library, just in case), FSX-SE, MSFS2020, MSFS2024 and even FS9! Lossless Scaling for all my sims. What a godsend...Mobile rig: ASUS Zenbook UM425QA (AMD Ryzen 7 5800H APU @3.2 GHz and boost disabled, 1 TB M.2 SSD, 16 GB RAM, Windows 11 Pro). Running FS9 there .VKB Gladiator NXT Premium Left + GNX THQ as primary controllers. Xbox Series X|S wireless controller as standby/mobile.
March 13, 201412 yr Thank you for still making the Avsim community possible! -Sean L PPL + IFR, SEL HP/Complex.. LAS WN Ground Ops
March 13, 201412 yr Nice work, Tom! Thank you, and to all those who worked hard at getting her back up, thank you as well! Don
March 13, 201412 yr Author Great news, I feel the forums a bit speedier, many thanks to Tom and all the AVSIM Team. Just wondering if support for TapaTalk will be back someday. My bad! I see TapaTalk support is working! Yes, it it working and take a look at the number of users being served.
March 13, 201412 yr Whoa, that was quite the headache for you and the team to say the least. Strange not having avsim for a week. Like a whole community disappeared without a trace, something like that MH370 flight. One thing Ive learned working with computers and electronics over 25 years.... There are no guarantees. The best redundancy can still fail. Still reduced greatly, but can fail. The only guarantees in life are Death and Taxes. Glad is all up and running now. Welcome back. CYVR LSZH I7-14700k 64gb 6000Mhz DDR5 ASUS z690 ROG STRIX Gaming RTX 4080 Super,
March 13, 201412 yr Glad to see your'e back up again! I, too, have noticed a performance increase. Hope it holds. I've seen a failure like that before on standalone servers with onboard RAID controllers. One disk goes bad, and on replacement the whole array went south. Not fun. Thankfully you have good backups! Tony
Create an account or sign in to comment