Archive for the ‘ Toubleshooting ’ Category

I document quite a bit of tutorials and self made howtos in our intranet site so I decided to share this with everyone. It is mainly from the Proxmox wiki but the last few parts are from around the net and my knowledge. As you can see in the Proxmox wiki it added somethings later that fixed my problem but I had to reinstall the packages from scratch.

By the way

dpkg –get-selections | grep <whatever>
dokg –purge <package name>
is a great time saver when trying to fix what mistakes were made.

Feel free to email us if there is something wrong or not necessary.

 

Source:

http://pve.proxmox.com/wiki/ZFS#Native_ZFS_for_Linux_on_Proxmox_2.0

Basically watch what kernel headers you have and make a symbolic link that allows certain packages to build properly,

Install instructions:
apt-key adv –keyserver keyserver.ubuntu.com –recv-keys F6B0FC61
aptitude update

Be warry of which kernel you are running with this one. Change as necessary.

ln -s /lib/modules/2.6.32-11-pve/build /lib/modules/2.6.32-11-pve/source
aptitude install pve-headers-2.6.32-11-pve
aptitude install dkms pve-headers-$(uname -r)
aptitude install ubuntu-zfs

I didn’t seem to need this one but I do it anyway since Im paranoid.
aptitude install spl-dkms

Just in case the kernel gets upgraded do this
aptitude install dkms pve-headers-$(uname -r)
then this to rebuild the packages
aptitude reinstall spl-dkms zfs-dkms

Now lets get it to startup with the system as well. Add the following to /etc/rc.local

#Mount ZFS storage
/usr/local/sbin/zfs-fuse
/usr/local/sbin/zfs mount -a

When the system reboots it tends to mount the directory that zfs has set as a mountpoint before zfs gets a chance since the zfs mount command is called in /etc/rc.local. I want to keep things as pristine as possible so this all can work even if an update occurs that changes things a bit so I will keep it as it is. The only thing I can think of is to make the commands apart of an /etc/init.d script then place it above file system mounting better yet above what ever proxmox is running to cause this.
I believe the issue is from what we set in the “Storage” tab. So i left that the same and will not bother it. Ran the following which creates a symlink to zfs new mountpoint which i changed with the command below.

Change zfs mount point directory no need to create it beforehand.
zfs set mountpoint /mnt/datapool0/vm datapool0

Mount it
zfs mount -a

Then run this to remove the vm subdirectory and create a link to the correct place
rm -rf /datapool0/vm;ln -s /mnt/datapool0/vm /datapool0/vm

Reboot and test. This was a quick fix and still work after reboots. Essentially just create some BS directory then symlink it to the real one afterward.

 

***UPDATE 08/05/2012***
I had an issue with the backups using snapshot instead of suspend. Suspend causes down time so I researched the error.

Undefined subroutine &PVE::Storage::cluster_lock_storage called at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 240

Found this post:
http://forum.proxmox.com/archive/index.php/t-10438.html

I feared that this may mess up the current install of zfs and its dependencies being that we needed the kernel headers before. So I ran the suggested fix and it all came up after a reboot.

“aptitude update && aptitude full-upgrade” then reboot
The backups are working as expected using snapshot and not suspend.

This storage server has been a back and forth ordeal and I was kicking my self profusely trying to find what the hell has gone awry here. Well after spending more money :( getting another HBA for the SCSI enclosure and finding that the card doesn’t work for my purpose, which was a serious downer. I just ordered the PCI-X riser for the 2950 so i can use some of the old cards I have lying around. In the mean time i started getting errors about memory and the CPU from the LCD on the front…WTF!!!! I just got these Procs (at a bargain) and one is bad. Sorry to say i got them from two different vendors so the packing is mixed up and scattered around my work space. It was confirmed when i used memtest and the box froze completely after the first pass. I took out one proc and reran it then left it. I will check on it later, but that’s crazy.

This might explain the I/O Whoas issue with the storage array. Meaning that the first RAID card may not have been an issue after all. Hmmmm…DAMMIT!!! Well i shall move forward and replace the proc after I run a multi pass memtest to make sure the current proc is still good.

UPDATE:

Well a little after I wrote the first portion of this I found that the 8GB of mem that I had may have been the issue but even after testing further I got an error that shows that one of the DIMM slots may have been an issue. So I reseated everything and tested again after deciding to re-purpose the box and oddly enough it passed 4 tests no issues in mem test nor on the Dell LCD. O well I don’t care to make this a storage server any more so it will do as is and since it passed the memtest i guess its good for a dev box. :)

I/O Whoas :(

I think I am suffering from bottleneck issues. What I am witnessing is down right dismal performance from our server. I have dealt with ZFS in the past and this is by far the worse performance numbers I have seen. Now with that being said I know its a configuration issue and that’s why we test these sorts of things in our development rack. The ZFS filesystem and architecture can handle what I’m doing with ease but the issue here has to be drive placement and RAID conf. In my previous post you will see that in the 24 bay SATA enclosure there are drives that have white labels. These are apart of a RAID 1 array that serve as the OS drive that is shown in the BIOS. All drives in the enclosure are connected to an HP SAS Expander that connects to an Adaptec 5445 in the main chassis.

Now when performing an rsync or a disk usage (du) I get slow response and ultimately unresponsiveness from the box. I cant even log in after a while. The most recent issue was after reboot there was a service that was in maintenance mode and caused the startup procedure to halt entirely and drop to a maintenance shell. I eventually got it going but this is odd. I do see that one of the drives that is in the Pink array is having a S.M.A.R.T error but I don’t see it failing yet. It will be replaced soon. I believe I am seeing some serious I/O bottlenecks from this systems current config. I ordered the following to add to the mix:

2x Quad core 2.33 SLAEJ processors
8GB ECC FB Dimm PC2-5300
Final 146GB 15k SAS drive for the Yellow array
Finally a Chenbro 12803 Sas Expander (i have been looking for this!)
2x 32GB SSD (YAY!) for ZIL/L2ARC

I may have to reformat once these parts are added and things are moved around just for ease of mind. I want to use the SSDs to make some sort of increase in IOPS and test how they increase performance. So far im seeing 9 to 32 MB/s in the 146GB 10k U320 RZ2 array (ORANGE) and 20 to 40 MB/s in the 500GB 7.2K SATA array (pink). I will add the 1.5TB 7.2K SATA drives (YELLOW label PINK dots) later once I destroy the array they were in.

I know that ZFS can perform better than this and I will make it better once I figure out the best configuration or reason for the poor performance. For shits and giggles I may just put the OS on its own set of SSDs and velcro them to the case (don’t judge me!) so I can utilize the internal on board SATA connectors. Thus ensuring that the OS partition has its own personal bandwidth highway I guess.

All and all the experience is great since I am trying to learn Open Solaris more. This is exactly why we have a Dev rack to test this on. I love it!

So we got Exchange 2007 migrated over to 2010. We still have to shut down the old server and migrate the SPAM Filter box over to the new VM host but those shouldn’t be too hard (crossing fingers). We used two sources of reference for the migration which were a great help.  The first was really all we needed since we utilize a single server deployment. The second is great because it has a knowledge base connected to it that can also be referenced. So now were up to date and will deprecate the old exchange server tomorrow night. This is apart of our plans to upgrade many core components in our business to help us learn and grow. So far so good but that’s what happens when you got two guys that plan things out all the time. :)

(Patting ourselves on the back)

Exchange Migration in 2Hrs

MS Exchage 2010 Deployment Assistant

OLD SCHOOL!                                  NEW WITH THAT CANDY PAINT!

UPDATE 1:

So everything looked good until i started uninstalling Exchange 2007 from the old server. I decided to follow the tutorial a little backwards and transfer the FSMO roles at the end. The issue that I witnessed was not the inability for users to log into web mail which was one of my first fears during this. The behavior I got was with Outlook accessing the user mailbox. It seems the change over of the back end server showed up sure enough under the user account setup since we are using Outlook anywhere, formerly HTTP/RPC. When the user logs in with Outlook they are prompted for the password but the client shows Disconnected and when trying to just make a new profile I get the following error.

The connection to the Microsoft Exchange Server is unavailable. Outlook must be online or connected to complete this action.”

From the link you see a few fixes that point to RPC issues and GPO pushes. I didn’t really read through it because I was looking for something a bit quicker. I ran the system health check command “Test-SystemHealth” via the Exchange Management Shell and got a warning that the Microsoft Exchange System Attendant service was not running. I researched and started the service. Then I used a technique I learned a while back. Check for services set to “Automatic” but not running. Sure enough the following services were not running.

  • Microsoft Exchange Edge Sync
  • Microsoft Exchange RPC Client Access <—looks like were on to something here
  • Microsoft Exchange Information Store <—kind of a big deal but it was starting up at the time
  • Microsoft Exchange Service Host
  • Microsoft Exchange System Attendant <—seems like this service is a dependency of others

So I restarted them manually where needed and the Outlook users were able to connect. Since I haven’t fully transferred the roles of the other server totally I will say that this is because of that. My reasoning behind this is to make sure the new server can run without the old server, which it has, then transfer the roles accordingly, which i am doing now. Then test again by restarting and ensuring those services are running.

Currently using this article to transfer over the roles, which is super easy and it has pictures :): Transferring FSMO Roles in Windows Server 2008

Update 2:
So after transferring roles and shutting down the old server without removing it from the domain the same behavior is occurring. So i think to myself, “I wonder if there are any updates for Exchange 2010?” Guess what there are two service packs have been released and this did not show up in the Update tool at all nor did the Test-SystemHealth command find it. Ugh well because of limited bandwidth to my colo space so I have to wait 2 hours for each SP download. I may not need both but im going to get them both anyway just in case. I should have just searched myself for updates.

Update 3:
Ok i think were good. So i updated to Exchange 2010 SP2 which should have been done in the first place but I didn’t even know it was released. I had to search for it since MS update didn’t notify like it usually does for SPs. So more down time and restarts. Everything was smooth until those services weren’t starting. Either way its good now after the update and I also set the services to Automatic (Delayed Start) because at one point I saw that the Information Store was still starting while those same services were stopping. This is why i opted to have them start later on. Now they start and my Outlook users are able to connect without an issue. The old server was also deprecated and removed as a DC. Now i have to take it out completely. Which shouldn’t be hard…why did i say that :( Well here is to hoping.

The following services were set to Automatic (Delayed Start)

  • Microsoft Exchange Edge Sync
  • Microsoft Exchange RPC Client Access
  • Microsoft Exchange Forms-Based Authentication
  • Microsoft Exchange Service Host
  • Microsoft Exchange Protected Service Host
  • Microsoft Exchange System Attendant

Update 4:
DONE & DONE. We have finalized the move and deprecated the old Exchange 2007 VM using the tutorial above with some of the services tweaks listed. Basically it was easy but the issues I ran into were just due to updates being applied and taking my sweet ass time getting rid of the old VM. Though I personally like to call it being cautious. Either way everything still works and the old VM host has been deprecated as well and will be removed soon. Phew! Well onto the next project.

There will always be pain :( The latest project in getting the rack back to its former computing glory we have the need for a new and revitalized storage server. We PLANNED to use OpenFiler to provide Fibre Channel targets to our servers. Well before we could even do that we had to install all the new parts and cables then configure the RAID card with some of our 1.5TB drives we had laying around. Well here comes the first issue.

When trying to boot the system the RAID card cant get past the kernel loading phase and just sat there. After troubleshooting, researching, reloading the firmware using a USB floppy disk drive, yeah i went there, no freak’n luck. Then I decide to remove the cables. It works! So added the drives back to the system, connect the cables, and reboot. Were back to square 1 DAM! I know this card, an Adaptec 31605, can utilize these 1.5s because they use to be in a R6 volume using the very same card. Then i thought about what I had to do to get them to work. The card has the latest firmware but guess what. The sweet ass Dell 2950 Gen II system I have doesn’t have the latest BIOS or backplane firmware :( So i guess were using 500GB drives. Since i don’t feel like updating the firmware by using Win-blows and they didn’t find a floppy/flash upgrade option I decided to continue on with the build by installing Openfiler.

Aighty now were cooking. We got the system installed, LACP configured, were pinging in and out…yeah were good baby. OK lets take a look around here and get FC Targets configured. We drop to the command shell and check out the forums on how the hell to do this….QLogic nah I got an Emulex PCI-E card…what there is no support for Emulex……in Openfiler……IM GOING TO BED!

Well what can we do now. I look around and I have used ZFS before as you may know from some of my previous posts.  I have looked into OpenSOlaris but didn’t feel like learning a new OS. WELL guess what, since im not buying a QLogic card right now and I want this bad boy up and storing stuff. I guess its high time to crack open a book or two. Yeah kI know reading a book is not the shortest route but think of the benefits. Plus, fuck Windows. So I snagged a copy of the Open Indiana and the Open Solaris Bible after I read the first page I started to write this entry. I’m so focused right now (sarcasm). I will get through it and practice as I go. If anything this is good for me. I will learn an Enterprise OS environment just in case I need it in the future. Now I gotta learn a new OS’ networking setup all over again. Also I installed the server edition…NO GUI BABY!!!!!!

So we got a computer reformat and load up with software and we have been itching to do this for a bit. We wanted to test it out on our own machines but this is the perfect opportunity. We first start by installing Windows and all the necessary programs, run updates, transfer files over, run a few programs for the first time and get through some of the BS, check to make sure it all looks good. Since this was the first time I misjudge the size of windows a bit and didnt give the utility partition enough pre-allocated space during the Windows XP install process. After everything it was about 11 GBs used. Booted into Ubuntu and resized it real quick without a problem. Booted into Windows to make sure then began the Ubuntu Desktop 10.10 64bit installer.

During the installer I chose the side by side installation and made my own partition setup. Giving our UTILITY PARTITION (Sorry it just sounds so cool) about 16 GB of space and the Swap about a gig to use. The installation process is going on now. Afterward we updated and installed a few programs via ninite.com installer package. Its the best way to quickly install packages that are always up to date. It makes the AVG and Open Office installs so much easier. Aight so the computer is about ready to image. We put cetain tools that were installed like Recuva, Team Viewer and True Crypt (just in case) in a Folder called MJNS Tools (Branding). Finished it off by scheduling some tasks like a weekly scan and a system defragment to keep the customer protected and happy, then we top it all off with a defrag with Auslogic Defragger before imaging.

OK, So were finished and the Ubuntu OS install looks to have went well. Procedure included in this post. The boot menu shows up and I have to change the timing a bit. Boot time is fast even for being located at the end of the disk. Began installing updates and tools that we need to make this a real useful partition as noted in a previous post. Compiled the list of need programs and made it into a one liner apt-get to make it easier later on while updating. Looked around a bit using “df -h” command and found the space that I thought was enough…..was not. The fresh install of ubuntu has already taken up about 3.7 GBs leaving a little over 11GBs for the recovery image. The Windows OS takes up 12.6GBs decided to continued on though because the partitions can always be resized. Read up on the compression methods for fsarchiver and I chose level 7 which is lzma-1. This will be th default for now.Began the long, arduous, pain staking, time com…..WTF ITS DONE ALREADY! Dam we just started playing Donkey Kong Country Returns too. So it took less than about 45 mins to image and compress 12.6 GBs into a nice 5.6GB package. While doing this on another machine we compressed 7.9 GBs into a 3.3GB file which makes it 41.77% of the original size which is a 58.23% compression ratio. This is great because it still leaves space for us to possible save data on the utility partition, restore the windows partition, then copy over the data again without a single reboot. This will be major, especially for those customer who want us to do everything on site.

Once we are imaged and backed up we decided to do a little test. We mount the Windows partition from the Utility Partition and delete the almighty WINDOWS folder….I aint scared!…..and reboot. Well what do you know Windows doesnt work. “O whatever shall I do” cries the customer. MJNS Computer Super Heroes jump in to save the day. “Why not back up your information and restore it fair maiden using our utility partition.

We began the restore which took under 15 mins to do, again WTF ITS DONE ALREADY! We mount it and look around and saw no problem then reboot. Afterward it does a check disk on the file system and boots right into Windows. Our heroes high five each other and leave the customer satisfied with the quick level of service we provide to all those who are protected by MJNS. YAY, were awesome!!!!!

Thats pretty much it we pack em’ up and get them to the customer for setup. We are really not doing anything special and this takes more time than just installing Windows. That being said we have been through a lot when working on a customer’s computer and we know the hassle of dealing with the fact that we may not know for sure the problem or need better tools, but lack a Ubuntu Live CD. The added benefit of quick backups and OS recovery that beats the pants off any HP, Dell, or Acer recovery partition is great. It takes time now, but next time they bring that PC in for some Virus issue its back out the door in no time flat.

Instructions:

MJNS Reformat Procedure

fsarchiver-commands

Update 9/10 Latest ninite file:

MJNS_Ninite_Installer

We are in the planning stages of rolling out our own line of custom built systems. Developing this computer building division will help us gain a few more clients and offer a one stop shop for customers. Affordable systems for home users and business…blah blah blah.

OK now lets get to the cool stuff. My business partner presented the idea that we put our own logo as the BIOS splash screen. A novel idea I said. Then I thought of how I have always wanted a cool utility partition to do diagnostic work or for disaster recovery of files. You know instead of breaking out the Ubuntu CD/USB toolkit for Memtest86+ or downloading programs like chntpw, ddrescue only to have it all disappear at reboot. Which sucks especially if you weren’t finished with the process. Most power users dual boot to get the best of both worlds. Speed, open source and of course  compatibility.

Basically the idea is to install Ubuntu Desktop on another partition that is 13 to 15 GB depending on Windows partition’s image compression. Yeah I know that sounds like a lot but here is what will be on it though. Boot screen with our logo that gives you the choice of Windows, MJNS Utility Partition (Ubuntu Desktop 10.10), and Memtest86+. Now Memtest alone is great to have because I have found memory to be a major issue in a few customer systems. On the utility partition we will have a slew of tools that are listed below. The most useful among them is partimage, gddrescue, photorec, ClamAV, TeamViewer, and chntpw.  Then add the fact that the partition is a full OS with browser, office suite, and more.

We, along with most other IT shops, use Ubuntu to diagnose issues and recover data. It seems like a no brainer of course. Simply one of those things you think of then say, “WHY DIDNT I THINK OF THIS SOONER!” We plan to roll this out not only for custom built systems, but on any computer we reformat. The ability to restore the windows partition on the fly making the turn around time for reformats shorter than 2 – 3 days. A time frame which is sometimes hard to meet depending on our load while other times easy.  I mean jeeeezzzzz it just makes sense.

Tools:

  • Partimage (Image/Restore partition for quick reformats)
  • Teamviewer (remote controlled disaster recovery)
  • ClamAV with ClamTK GUI (others can be added of course)
  • CHNTPW (OMG I forgot my password….Again!)
  • NTFS-3g/NTFS Progs (with ntfsundelete)
  • Recovery programs (photorec,foremost,gddrescue)
  • GParted
  • Nmap/Whois
  • Bonnie++ (Disk Benchmarking)
  • SSH Server
  • Samba (For quick shares and drag/drop files)
  • Filezilla (getting files from our support server)
  • Google Chrome (cause its cool)
  • Customized welcome music, possibly a video and our site as the home page…..BRANDING!

Most of these recovery, boot, utility partitions are closed with little and nothing as far as functionality, this will be far more robust. Testing to begin soon. Spoke to a friend about this and he mentioned that HP acquired out a small opensource distro and they plan to implement something like this in their new systems. I may be bugged or something, but its all good when its open source.

While writing about 2TB of data from a customer’s external drive an internal  drive fails but is revitalized upon restart. Most likely its a case of a I/O timeout or a failing drive. NO Problem! Even though configured as a RAIDZ2 (RAID-6) pool quite a bit of data was written to the array, sustaining speeds that of normal operations. Well we informed the customer of the drive failure, then assured them no data was lost. Rebooted to make sure everything came up and guess who decided to join the party again. The drive came up and ZFS began the resilvering (resyncing/rebuilding) process. Estimated to take 10 hours but to my glee and surprise it finished in 4 hours 13 minutes with 251 GB of data resilvered. We will still order a new drive on Monday to be sure but for now we will begin a scrub on the array to make sure no data corruption occurred. The zpool scrub should take a few hours as well but we can wait. Its better than having a corrupted set of data.

Update when it asks

When java, flash, or adobe reader prompts you for an update, you should do it. Recently a customer needed malware removed. Even with the latest OS and up to date antivirus the system was still infected due to the out of date java plugin. It was minor and thankfully easy to remedy, but it could have been avoided by updating their software on a regular basis or at least when it prompts you to. Out of date software produces vulnerabilities for malware and wide spread infections. Sooo yeah update when it says so, OK. If your unsure then post a question here on our facebook page. An opening like this can destroy a network over time, sometimes in no time at all. We run through system updates on a regular basis to make sure we are in good standing, but even though not very often we are victims even with our best efforts.

Heed it's warning

 

 

Recently, Monday in fact, one of our customers had issue where a drive in their RAID 5 array began to FAIL and cause data corruption across the array. While working with it in the lab another drive failed while trying to backup data……….YEAH, I said that too. My partner and I looked at each other with the almighty “HOLY CRAP!” face and went to buy another drive. Luckily the customer had a backup of their patient database that wasn’t fully corrupted and after 3 days the server was backup and their medical records database was restored to a working state, Thankfully. Serious disaster recovery with a happy ending. We will be working with them closely since we took over this account with the old backup solution in place. By the way, Symantec’s Backup Exec 11d SUCKS! We spent hours figuring out the V-79-57344-65072 error about unable to communicate. It was simply a matter of restoring the old database on a new system and choosing to store it somewhere else and not the old location. This caused hours of frustration for us and to get support on the issue from Symantec would have been $250 dollars. Of course the rep hung up on me when I asked to be pointed to a more detailed Knowledge Base article on the matter. AWESOME! Either way i stayed up till 3:30 AM figuring it out and my partner handled the rest of it the next day with reconnecting the database with the software’s support team. Either way the customer is happy and so are we. :D)  SWEET DEAL!