Thirteen Production Disasters That Could Happen at Anytime ASP.NET

Startups don’t always have adequate resources in necessary areas like system administration,data base administration,tech support, QA,and so on.In a startup environment, developers some times end up having multiple roles and may not be prepared for production challenges.At Pageflakes, we had several downtime and dataloss problems in our first year of operation due to production disasters.However, all of these major problems taught us how to build a secure reliable production environment that can run a medium scale web site with 99 percent up time.

The Hard Drive Crashes,Overheats

We experienced hard drive crashes frequently with cheap hosting providers. There were several times when a hard drive would get overheated and turn itself off,and some were permanently damaged.Hosting providers used cheap SATA drives that were not reliable. If you can spend the money, go for SCSI drives. HP has a variety to choose from and it’s always better to go for SCSI drives on web and data base servers.They are costly, but will save you from frequent disasters.

Pay attention to disk speed rather than CPU speed. Generally, the processor and bus are standard,and their performances don’t vary.Disk I/O is generally the main bottleneck for production systems.For a database server,the only thing you should look at is disk speed.Unless you have some very CPU-intensive slow queries, the CPU will never peak.Disk I/O will always be the bottleneck for data base servers. So, for databases, you need to choose the fastest storage solution. A SCSI drive at 15,000 RPM is a good solution—anything less than that is too slow.

The Controller Malfunctions

This happens when the servers aren’t properly tested and are handed over to you in ahurry.Before you accept any server,make sure you get a written guarantee that it has passed all sorts of exhaustive hardware tests.Dell’s server BIOS contains test suites

HD Tach testing hard disk speed

HD Tach testing hard disk speed

for testing controller, disks,CPU,etc.You can also use Burn In Test from PassM ark(www.passmark.com) to test your server’s capability under high disk and CPU load.Just run the benchmark test for four to eight hours and see how your server is doing.Keep an eye on the CPU’s temperature meters and hard drives to make sure they don’t overheat.

The RAID Malfunctions

A RAID (Red undant Array of Inexpensive Disks)combines physical hard disks into a single logical unit by using either special hard ware or software. Often,hardware solutions are designed to present themselves to the attached system as a single hard drive, and the operating system remains unaware of the technical workings.Software solutions are typically implemented in the ope- rating system and are also presented to the RAID drive as a single drive.

At Page flakes, we had a RAID malfunction that resulted in disks corrupting data. We used Windows 2003’s built-in RAID controller, but we have learned not to depend on the software RAID and now pay extra for a hardware RAID. Make sure when you purchase a server that it has a hard ware RAID controller.

Once you have chosen the right disks,the next step is to choose the right RAID configuration.RAID 1 takes two identical physical disks and presents them as one single disk to the operation system.Thus,every disk write goes to both disks simultaneously. If one disk fails,the other disk can take over and continue to serve the logical disk.Nowadays,servers support “hot swap ”-enabled disks where you can take a disk out of the server while the server is running.

The controller immediately diverts all requests to the other disk.The controller synchronizes both disks once the new disk is put into the controller.RAID 1 is suitable for web servers.

Pros and cons of RAID 1

Pros:

• Mirroring provides 100 percent duplication of data.

• Read performance is faster than a single disk, if the array controller can perform simultaneous reads from both devices on a mirrored pair.You should make sure your RAID controller has this ability.Otherwise, the disk read will become slower than having a single disk.

• Delivers the best performance of any redundant array type during a rebuild. As soon as you insert a repaired replacement disk,it quickly synchronizes with the operational disk.

• No re-construction of data is needed. If a disk fails, all you have to do is copy to a new disk on block-by-block basis.

• There’s no performance hit when a disk fails; storage appears to function normally to outside world.

Cons:

• Writes the information twice. Because of this,there is a minor performance penalty when compared to writing to a single disk.

•The I/O performance in a mixed read-write environment is essentially no better than the performance of a single disk storage system.

•Requires two disks for 100 percent redundancy,doubling the cost.How ever,disks are now cheap.

• The size of the drive will be the same size as the disk that has the lowest.

For database servers,RAID 5 is a better choice because it is faster than RAID 1.RAID 5 is more expensive than RAID 1 because it requires a minimum of three drives. But one drive can fail with out affecting the availability of data.In the event of a failure, the controller regenerates the failed drive’s lost data from the other surviving drives.

Pros and cons of RAID 5

Pros:

• Best suited for heavy read applications, such as database servers, where the SELECT operation is used more than the INSERT/ UPDATE/DELETE operations.

• The amount of useable space is the number of physical drives in the virtual drive minus 1.

Cons:

• A single disk failure reduces the array to RAID 0, which has no redundancy at all.

• Performance is slower than RAID 1 when rebuilding.

• Write performance is slower than read (write penalty).

The CPU Overheats and Burns Out

One time our server’s CPU burnt out due to overheating.We were partially responsible because a bad query made the SQL Server spike to 100 percent CPU. So, the poor server ran for four hours on 100 percent CPU and then died.

Servers should never burn out because of high CPU usage. Usually, servers have monitoring systems,and the server turns itself off when the CPU is about to burn out.This means the defective server did not have the monitoring system working properly.To avoid this, you should run tools that push your servers to 100 percent CPU for 8 hours to ensure they can with stand it.In the event of overheating, the monitoring systems should turn the servers off and save the hardware. However, if the server has a good cooling system, then the CPU will not overheat in eight hours.

When ever we move to a new hosting provider, we run stress test tools to simulate 100 percent CPU load on all our servers for 8 to 12 hours. Figure shows 7 of our servers running at 100 percent CPU for hours without any problem.None of the servers became over heated or turned themselves off,which means we also invested in a good cooling system.

Stress testing the CPU at 100 percent load for 8 hours on all our servers

Stress testing the CPU at 100 percent load for 8 hours on all our servers

The Firewall Goes Down

Our hosting provider’s firewall once malfunctioned and exposed our web servers to the public Internet un protected.We soon found out the servers became infected and started to auto matically shut down,so we had to format and patch them, and turnon Win dows Fire wall.Now a days,as best practice,we always turn on Win dows Fire wall on the external network card,which is connected to the hardware fire wall.In fact, we also installed a redundant firewall just to be on the safe side.

You should turn off File and Printer Sharing on the external network card (see Figure).And unless you have a redundant fir ewall, you should turn on Win dows this will affect per formance, but we have seen that Windows Firewall has almost no impact on per formance.

Disable File and Printer Sharing from the public network card. You should never copy files via Win dows File Sharing over the Internet.

Disable File and Printer Sharing from the public network card. You should never copy files via Win dows File Sharing over the Internet.

You should also disable the NetBIOS protocol because you should never need it from an external network (see Figure).You server should be completely invisible to the public network, besides having ports HTTP 80 and 3389 (for remote desktop) open.

The Remote Desktop Stops Working After a Patch Installation

Several times after installing the latest patches from Windows Up date,the remote desktop stopped working.Sometimes restarting the server fixed it,but other times we had to uninstall the patch. If this happens, you can only get into your server by using KVM over IP or calling a support technician.

Disable Net BIOS from the public ne twork card. Net BIOS has security vulne rabilities.

Disable Net BIOS from the public ne twork card. Net BIOS has security vulne rabilities.

KVM over IP(keyboard,video,mouse over IP) is a special hardware that connects to servers and transmits the server’s screen output to you.It also takes the key board and mouse input and simulates it on the server. KVM works as if a monitor, key board, and mouse were all directly connected to the server.You can use regular remote des ktop to connect to KVM and work on the server as if you are physically there. Benefits of KVM include:

• Access to all server platforms and all server types.

• A “Direct Connect Real Time” solution with no mouse delays due to conversion of signals.Software has to convert signals, which causes delays.

• Full use of GUIs.

• Full BIOS level access even when the network is down.

• The ability to get to the command line and rebuild servers remotely.

•Visibility into server boot errors and the ability to take action, e.g.,“Non-system disk error,please replace system disk and press any key to continue”or“Power supply failure,press F1 to continue.”

• Complete security prevents hacking from the KVM; a physical connection is required to access the system.

If remote desktop is not working,your firewall is down,or your server ’s external network card is not working, you can easily get into the server using KVM.Make sure your hosting provider has KVM support.

Remote Desktop Exceeds Connection Limit and Login Fails

This happens when users don’t log off properly from the remote desk top by closing just the remote desktop client.The proper way to log off from a remote desktop is to go to the Start Menu and select “Log off.”If you don’t,you leave a session open and it will remain as a disconnected session.When disconnected sessions exceed the maximum number of active ses- sions,it prevents new sessions, which means no one can get into the server.If this happens,go to Run and issue a mstsc/console command.This will launch the same old remote desktop client you use every day, but when you connect to re mote desktops,it will connect you in console mode.Console mode is when you connect to the server as if you are right in front of it and using the server ’s keyboard and mouse. Only one person can be connected in console mode at a time.Once you get into it, it shows you the regular Windows GUI and there ’s nothing different about it.You can launch Terminal Service Manager, see the disconnected sessions, and boot them out.

The Data base Becomes Corrupted When Files Are Copied over the Network

Copying large files over the net work is not safe;data can be corrupted at any time,especially over the Inter net.So,always use Win RAR in Normal comp ression mode to compress large files and then copy the RAR file over the net work.The RAR file maintains CRC and checks the accuracy of the original file while de compressing. If Win RAR can decompress a file properly, you can be sure that there’s no corruption in the original file.One caution about Win RAR compression modes:do not use the Best compression mode.Always use Normal compression mode.We have seen large files get corrupted with the Best compression mode.

The Production Data base Was Accidentally Delete

In early stages, we did not have professional sys admins taking care of our servers.We, the developers,used to take care of our servers our selves.And it was disastrous when one of us accidentally deleted the production data base thinking it was a backup data base. It was his turn to clean up space from our backup server, so he went to the backup server using remote desktop and logged into SQL Server usingthe “sa” user name and pass word. Because he needed to free up some space, he deleted the large “Page flakes” data base. SQL Server warned him that the database was in use,but he never reads alerts with an “OK” button and so clicked OK.We were doomed.

There are some critical lessons to learn from this:

• Don’t become too comfortable with the servers. Take it seriously when working on remote desktop because it can be routine,monotonous work.

• Use a different password for every machine. All databases had the same “sa”pass word.If we had different pass word, at least while typing the password, you can see where you are connecting to.Although this guy connected to the remote desktop on a maintenance server,from SQL Server Management Studio he connected to the primary data base server just as he did last time.SQL Server Management Studio remembered the last machine name and user name. So, all he had to do was enter the pass word, hit Enter, and delete the database. Now that we learned our lesson,we put the server’s name inside the password. So, while typing the password, we know consciously what server we are going to connect to.

• Don’t ignore confirmation dialogs on remote desktops as you do on your local machine. Now adays,we consider ourselves to be super experts on everything and never read any confirmation dialog.I myself don’t remember when the last time I took a confirmation dialog seriously.This attitude must change when working on servers.SQL Server tried its best to inform him that the data base was being used,but, as he does a hundred times a day on his laptop,he clicked OK with out reading the confirmation dialog.

• Don’t put the same administrator password on all servers. Although this makes life easier when copying files from one server to another, don’t do it. You will accidentally delete a file on another server (just like we used to do).

• Do not use the administrator user account to do your day-to-day work.We started using a power user account for daily operations, which limits access to a couple of folders only.Using the administrator account on the remote desktop opens doors to all sorts of accidents. If you use a restricted account,there’s limited possibility of such accidents.

• Always have someone beside you when you work on the production server and are doing something important like cleaning up free space or running scripts,restoring, database, etc And make sure the other person is not taking a nap!

The Hosting Service Formatted the Running Production Server

We told the support technician to format Server A,but he formatted Server B. Un fortunately, Server B was our production database server that ran the whole site.

we had log shipping and there was a standby database server.We brought it online immediately,changed the connection string in all web.configs, and went live in 10 minutes.We lost about 10 minutes worth of data because the last log ship from the production to the standby database did not happen.

From now on,when we ask support crew to do something on a server, we remotely log in to that server and eject the CD-ROM drive.We then ask the support crew to go to that box and see whether there is an open CD-ROM drive.This way we could be sure the support crew is on the right server.Another idea is to leave a file named Format This Server.txt on a drive and inform the support crew to look for that file toidentify the right server.

Windows Was Corrupted Until It Was Reinstalled

The web server’s Windows 2003 64 bit got corrupted several times.Interestingly, the data base servers never got corrupted.The corruption happened mostly on servers when we had no firewall device and used the Windows firewall only.So,this must have had something to do with external attacks.The corruption also happened when we were not installing patches regularly.Those security patches from Microsoft are really important—if you don’t install them in a timely fashion, your OS will get corrupted for sure.Now adays, we can’t run Windows 2003 64 bit without SP2.

When the OS gets corrupted, it behaves abnormally.Sometimes you will see that it’s not accepting inbound connections and this error will appear, “An operation on a socket could not be per formed because the system lacked sufficient buffer space or because a queue was full.” Other times it takes a long time to log in and log off,remote desktop stops working randomly, or Explorer.exe and IIS process w3wp.exe frequently crashes.These are all good signs that the OS is getting corrupted and it’s time for a patch installation.

We found that once the OS was corrupted, there’s no way to install the latest patches and bring it back.At least for us,rarely did installing a patch fix the problem;80 percent of the time we had to format and reinstall Windows and install the latest service pack and patches immediately. This always fixed these OS issues.

Patch management is something you don’t consider a high priority unless you start suffering from these problems frequently.First of all, you cannot turn on “Automatic Up date and Install” on production servers.If you do,Windows will download patches,install them, and then restart itself.This means your site will go down unexpectedly.So,you always have to manually install patches by taking out a server from the load balancer,restart it,and put it back in the load balance.

The DNS Goes Down

DNS providers sometimes do not have a reliable DNS server.Go Daddy was our hos and DNS provider.Its hosting was fine but the DNS hosting was really poor both in terms of DNS resolution time and availability—it went down seven times in two years.When the DNS dies, your site goes down for all new users and for a majority of the existing users who do not have the DNS result cached in a local browser or the ISP’s DNS server.

When visitors, the request first goes to DNS server to get the IP of the domain.So, when the DNS server is down, the IP is unavailable and the site becomes un reachable.

Some DNS hosting companies only do DNS hosting, e.g., NeuStar, DNS Park, and GoDaddy .You should use commercial DNS hosting instead of relying on a domain registration company for a complete package. owever, NeuStar was given a negative review in a DNSstuff test.Apparently, NeuStar’s DNS hosting has a single point of failure that means both the primary and secondary DNS was actually the same server.If that’s true,then it’s very risky.Sometimes this report is given when the DNS server is behind a load balancer,and the load balancer’s IP only is available from the public Internet.This is not a bad thing; in fact, it’s better to have a load balancer distributing traffic to DNS servers. So,when you consider a DNS hosting service,test its DNS servers using DNS stuff and ensure you get positive report on all aspects.However, if the DNS servers are under load balancers,which means multiple DNS servers are serving a shared IP,then DNS stuff will report negative.It will see the same IP for both the primary and secondary DNS even if they are on different boxes.

When choosing a DNS provider, make sure:

• The IPs of the primary and secondary DNS server each resolve a different IP.If you get the same IP, make sure it’s a load balancer’s IP.

• The different IPs are actually different physical computers. The only way to do it is to check with the service provider.

• DNS resolution takes less than 300 ms outside the U.S.and about 100 ms inside the U.S.,if your DNS provider is in the U.S. You can use external tools likeDNS stuff to test it.

The Internet Backbone Goes Down in Different Parts of the World

Internet back bones connect different countries’Internet to gether.They are the information super high way that spans the oceans connecting continents and countries.For example, UUNET is an Internet back bone that covers U.S. and connects with other countries (see Figure).

There are some other Internet backbone companies,including BT, AT&T,Sprint Nextel,France Télécom, Reliance Communications, VSNL, BSNL,Teleglobe (now a division of VSNL International),FLAG Telecom (now a division of Reliance Communications), TeliaSonera, Qwest, Level 3 Communications, AOL, and SAVVIS.

All hosting companies are either directly or indirectly connected to an Internet backbone.Some hosting providers have connectivity to multiple Internet back bones.

The UUNET Internet backbone covers the U.S.and connects with other countries

The UUNET Internet backbone covers the U.S.and connects with other countries

At an early stage,we used an inexpensive hosting provider that had connectivity with one Internet back bone only.One day,the connectivity between the U.S. and Lon don went down on a part of the backbone. London was the entry point to the whole of Europe. So, all of Europe and a part of Asia could not reach our server in the U.S.This was a very rare sort of bad luck.Our hosting provider happened to be on the segment of the back bone that was defective. As a result, all web sites hosted by that hosting provider were unavailable for one day to Europe and some parts of Asia.

So, when you choose a hosting provider,make sure it has connectivity with multiple back bones, does not share bandwidth with telecom companies,and does not host online gaming servers. Both telecom companies and gaming servers have a very high bandwidth requirement. Generally, hosting providers provide you a quota of 1,000 GB per month, but not all companies require that much bandwidth, so multiple companies share a connection to the Internet back bone.Thus, if you and a gaming server are on the same pipe, the gaming server will occupy so much of the shared connection that your site’s bandwidth will be limited and you will suffer from network congestion.

The tracert can reveal important information about a hosting provider’s Internet back bone.Figure shows very good connectivity between a hosting provider and the Internet back bone.

Tracert to the Page flakes server from Bangladesh

Tracert to the Page flakes server from Bangladesh

The tracert is taken from Bangladesh connecting to a server in Washington D.C.Some good characteristics about this tracert are:

• Bangladesh and the U.S. are in two different parts of the world, but there are only nine hops,which is very good.This means the hosting provider has chosen a very good Internet backbone and has intelligent routing capability to decide the best hops between different countries.

• There is only three hops from pccwbtn.net to the firewall. Also,the delay between these hops is 1 ms or less.This proves PCCW Global has a very good connectivity with the Internet backbone.

• There’s only one backbone company, which is PCCW Global, so it has a direct connection with the back bone and there’s no intermediate connectivity.

Example of a tracert showing bad connectivity

Example of a tracert showing bad connectivity

Some interesting characteristics about this tracert include:

• A total of 16 hops and 305 ms latency compared to 266 ms. Therefore, the hosting provider’s network connectivity is bad.

• There are two providers: PCCW Global and Cogent. This means the hosting provider does not have connectivity with tier-1 providers like PCCW Global.It goes via another provider to save money and introduce an additional latency and point of failure.

• Two hosting providers were connected to Cogent and both of them had latency and intermittent connectivity problems.

• There are four hops from the backbone to the web server.This means there are multiple gateways or firewalls.Both are a sign of poor net work design.

• There are too many hops on cogentco.com itself, which is an indication of poor back bone connectivity because traffic is going between several networks to reach the destination web server.

•Traffic goes through five different network segments:63.218.x.x,130.

117.x.x,154.54.x.x, 38.20.x.x, and XX.41.191.x.This is sign of poor routing capability within the back bone.



Face Book Twitter Google Plus Instagram Youtube Linkedin Myspace Pinterest Soundcloud Wikipedia

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

ASP.NET Topics