iSCSI Boot on an Intel NUC

This isn’t something I ever thought I would need to try, but VMware’s deprecation of the usual homelab boot devices in 7.x left me in a bind. To illustrate my problem, here’s the layout of my homelab’s management cluster:

Every host is booting from USB, and ESX50 which provides the shared storage for the management cluster by way of an ONTAP Select virtual storage appliance, is using both of its internal drives to provide shared storage. I’ll have to wipe it out to reinstall ESX on an internal disk. ESX51-54 are the (diskless) hosts within the management cluster, also booting from USB. Solving the ESX50 problem is a challenge for another day. For now I need to do something about my management hosts’s dependency on USB boot.

Now admittedly I could have bought drives for these hosts and populated the m.2 bay, but that would have limited my flexibility for running lab scenarios with these hosts, and I had always intended to convert them to a VSAN cluster someday, or a 4node OTS cluster, or a 4 node StorageGrid. So… how could I avoid sacrificing my NVME slots to being mere boot devices, and maybe keep my options open to the possibility of spinning up a VSAN lab down the road? The least-bad idea seemed to be iSCSI boot. but would it even work?

To get these options to light up in the BIOS, I first had to factory reset. Whatever optimizations I had applied to get USB boot working in the first place were preventing the NIC from coming up in the Add-In config options.

In the Visual Bios, navigate to Devices -> Add-in Config. If you see both the iSCSI Configuration option and the onboard NIC, then you can proceed to configuring iSCSI.

Clicking on iSCSI Configuration takes you into this page, where the first thing you have to do is set the initiator IQN. Any valid IQN will work, and I tested several variants, but I chose to stick with an typical Intel IQN: iqn.1987-05.com.intel:esx54

The next step is to add a boot attempt, but before I could do that I had to save and exit, and re-open the iscsi configuration page. Perhaps it was a fluke of my particular BIOS revision, but I mention it in case you also cannot add a boot attempt after setting the IQN.

To add a boot attempt, navigate down to Add Attempt and press enter.

The values I set for my boot attempt are:
iSCSI Mode: Enabled
Internet Protocol: IP4
Connection Retry Count: 4
Connection Establishment Timeout: 10000 (milliseconds)
OUI-Format ISID: (Default)
Configure ISID: (Default)
Enable DHCP: X (enabled)
Get target info via DHCP: disabled
Target Name: (Target IQN from my NetApp)
Target Address: 192.168.123.64 (the target IP on my NetApp)
Target Port: 3260 (iscsi default port)
Boot Lun: 0
Authentication Type: None

As you can see there is no option for a VLAN tag, so the iscsi target must be reachable by the native VLAN of the onboard NIC. I also tried routing my iSCSI boot traffic, and that also worked, but pushing iSCSI over a router is rarely a good idea.

Next I needed a boot lun. How you provision your boot LUNs will vary from SAN to SAN, but on the NetApp ONTAP based systems it is pretty simple in the GUI:

I had already enabled iSCSI on a StorageVM when I did the original install, so just had to add and map a LUN in System Manager

Here I fill out the basic LUN details:

VMware requires at least 32GB for boot LUNs in vSphere 7, but when I installed onto a local disk it actually consumed 128gb, so I’m making my boot LUNs super-sized. Don’t click save just yet, we need to add the iSCSI initiator IQN to match the BIOS settings used earlier.
Click ‘More Options’, scroll down to the ‘Host Mapping’ section, pick ‘Host Initiators’, then click add-initiator, and fill in the IQN. Give the new Initiator Group a name, and only select this one host’s IQN.

I needed the target IQN earlier to complete the boot attempt settings in the BIOS. That can be found by navigating to Storage VMs, click the iscsi enabled Storage VM, then on the settings tab scroll down to the iSCSI settings card:

Now, if everything goes according to plan, you can boot the NUC from the ESXi installation media, and it should discover the iSCSI target LUN:

Installation was otherwise uneventful, and after it rebooted, it eventually found its boot lun and here we go:

The ‘Normal/Degraded’ status indicates there is only a single path to the LUN, which on this clean install is accurate. There is plenty of additional configuration to do before I can put this back into my management cluster. But, at least I won’t be feeding it new USB sticks every couple of months.

So now with iSCSI boot, from a dedicated host providing shared storage, I have a tiny little converged infrastructure. I’m nicknaming it the ‘NUCPod’, at least until I upset someone in marketing. It may yet run VSAN someday, but I’ll have to bootstrap it from an actual SAN.

Mitigating ESXi 7.0 USB boot device problems

If you’ve upgraded to ESXi 7 and you use USB or SD card boot devices, you’ve likely experienced some host failures. In my case, all of my lab hosts are affected. I had to put the brakes on upgrading past 6.7 after my early host upgrades started burning through boot drives. Like most of the community I was expecting a fix in U3. Instead, VMware chose to deprecate support for USB/SD card boot devices with plans to remove it completely in a future release. This means a full lab rebuild is inevitable, and I’ll get back to that in a future post, but there are some things we can do in the meantime that may stabilize 7.0 and get a little more life out of those boot devices.

First, upgrade to 7.0u2c or later. Although 7.0u3 is out, I’m staying on 7.0u2d for the time being. 7.0u3 has been taking down entire clusters when a thin provisioned VM is powered on, so I’m waiting for some of that dust to settle.

Next, apply the mitigations from VMware KB article 85685, nonchalantly titled: “Removal of SD card/USB as a standalone boot device option”. These changes reduce the IO to the boot device, and hopefully, help them last a bit longer between reinstalls.

The first mitigation is an advanced setting introduced in 7.0u2c that moves the VMware tools to a RAM disk. Apparently excessive reads to the VMware tools are contributing to the premature boot device failures. The RAM disk workaround should be easy to implement, by setting the host’s Advanced setting “ToolsRamdisk” to 1. But even though support was added for moving the tools to a Ramdisk, they forgot to add the actual option to the list of advanced settings.

Let’s start by adding that missing option. SSH into the ESXi host and run the following command:

esxcfg-advcfg -A ToolsRamdisk --add-desc "Use VMware Tools repository from /tools ramdisk" --add-default "0" --add-type 'int' --add-min "0" --add-max "1"

Then set it to enabled with this command:

esxcli system settings advanced set -o /UserVars/ToolsRamdisk -i 1

Neither of those commands will take effect until the next reboot, so shut down all those VMs and…

reboot

After a reboot, the new setting is in the list where it belongs:

The second mitigation is to move the scratch area. This part is thankfully automated. If there are any local disks it will pick one to host the scratch area, if there are no local disks it will use a Ramdisk. If you have shared storage that is not VSAN, you can manually configure scratch to live on shared storage by following this kb.

The last thing on the list from article 85685 is to make sure your USB stick supports 100MB/s and 128TBW of write endurance. mmmmkay. good luck with that.

How effective these mitigations are remains to be seen. I have seen anecdotal reports of people continuing to lose boot devices, and I’ve personally had one fail again already. Maybe the damage had been done before u2c came out, or maybe the USB sticks available to me just can’t take this kind of abuse. Either way, the only way forward appears to be a rolling nuke&pave to get my whole lab moved off of USB boot devices. It won’t be fun, or fast, or easy, but it looks like it must be done.

Running ESXi7.0U1 on a Frost Canyon NUC10I7FNH

Since my last series of NUC host build posts several new NUCs have come on the market, and we already have visibility into the models in the pipeline, but of the 4x4x2 models current and foreseeable, the NUC10i7FNH is the only 6 core part in that form factor. With this model we get:

6 core / 12 thread i7 CPU
64GB RAM is actually supported
1x onboard gigabit NIC, supported in the 7.0U1 build of ESXi
1x TB3 port, that can be leveraged for additional networking
1x M.2 slot for NVME storage
1x SATA 2.5″ bay for additional storage

For this build I’ll be running ESX on a USB stick. I’ll be using an NFS datastore for now. I’ll circle back to this later with some internal SSDs for different lab scenarios, but for now it has a very short BOM:

1 x NUC10i7FNH
2 x 32GB SoDIMM (64gb total)
1 x 32gb USB stick

Physical assembly is as trivial as ever, and this build of ESXi works without customizations, but there are a couple of things in the BIOS worth looking at.

First, we need to disable the TPM. The NUC doesn’t actually TPM chip, but if we don’t disable the imaginary TPM chip ESX will keep warning us that the phantom chip isn’t responding. This setting is buried in a submenu of Security. Navigate to the Security menu, then click “Security Features”, and un-check “Intel Platform Trust Technology”. This option was broken in older BIOS versions, but was fixed in 0.44. My unit arrived with the patched version installed, but if yours did not, you’ll need to update it to make the warning go away.

Click on ‘Security Features’
Uncheck ‘Intel Platform Trust Technology’

The other setting to check is “After Power Failure”, which is again buried in a submenu. Navigate to the Boot menu, then click Secondary Power Settings, and scroll down to “After Power Failure”. The default value is least likely to be the one you want. I prefer “Power On”, but set it to your preferred server behavior.

Now we can proceed with installing ESXi. As long as you use the 7.0U1 iso, everything should just work.

After logging in we can see 12 logical CPUs and 64GB ram. I can work with that.

Setting My Preferences:
Next I apply my personal lab host preferences in the advanced settings:

Mem.Share.ForceSalting = 0, to re-enable transparent page sharing for all VMs.
Misc.BlueScreenTimeout = 30, to allow the host to reboot if it encounters a PSOD
UserVars.SupressShellWarning = 1, because I keep ssh enabled and don’t want it bugging me.

And then I enable the TSM and TSM-SSH services. Select the server from the list, and use the Actions button to set the service to start and stop with host, and start the service.

Networking options:
The onboard NIC works with this build of ESX using the inbox driver, but so does the Apple Thunderbolt NIC, when plugged into the thunderbolt 3 port via the apple thunderbolt 2 to thunderbolt 3 adapter. The Apple NIC uses the ntg3 inbox driver. There are several thunderbolt and USB3 networking options available using either vendor supplied drivers for some 10gbe adapters, or the native USB3 network driver fling from VMware.

Conclusion:
With the upcoming 4x4x2 form factor NUCs all set to deliver only 4 cores, this little 6 core box appears to be the one to have. It has been around long enough now to have good driver support, and ESXi7.0u1 works out of the box. It doesn’t get any simpler for a small form factor lab enthusiast.

Converting the ONTAP Simulator to work in VirtualBox

This is a topic that comes up after just about every ONTAP release. The ONTAP simulator is only supported on VMware hypervisors, but with a little effort it can run in VirtualBox. Simulate ONTAP, also known as the ONTAP simulator, or the vsim, is distributed as virtual appliance in OVA format, but VirtualBox fails to import the OVA. Instead, it throws some variation of this error:

The underlying problem is that virtual box doesn’t connect IDE devices to the correct IDE ports unless they are presented in just the right order within the OVF xml file. This can be overcome by modifying the xml by to avoid the bug in VirtualBox, but that is tedious. There is another way. VirtualBox has a nice ‘vboxmanage’ command line interface that can be used to rebuild the VM, and then export it to a VirtualBox friendly OVA file. It is this more scriptable approach that will be used here.

Obtaining the Simulator

The simulator is available to NetApp customers, most partners, and employees. It is not available to guest accounts. Guest accounts can access the ONTAP Select evaluation as an alternative.

For ONTAP 9.7, the download is available from the beta support site at:
https://mysupport-beta.netapp.com/site/tools/tool-eula/5e31797415040d3cce0033d3

Previous versions can be downloaded from the tool chest at:
https://mysupport.netapp.com/tools/info/ECMLP2538456I.html?productID=61970

Scripting the OVA conversion

This example script is in bash, and uses the ONTAP 9.7 version of the simulator. It runs as-is on OS X but some adjustments may be needed for other operating systems or other versions of the simulator.

Grab the complete script from this GitHub repo, then follow along with the rest of the blog.
https://github.com/madlabber/vsim2vbox

Lets start by defining some variable. These variables will need to be adjusted if working with a different version of the simulator. name is the base name of the ova file, without the extension. memory is the amount of ram to assign to the VM. The VMware version only needs about 5gb, but on VirtualBox a bit more ram is needed. The IDExx variables hold the names of the corresponding VMDK files within OVA archive.

#!/bin/bash
name="vsim-netapp-DOT9.7-cm_nodar"
memory=6192
IDE00="vsim-NetAppDOT-simulate-disk1.vmdk"
IDE01="vsim-NetAppDOT-simulate-disk2.vmdk"
IDE10="vsim-NetAppDOT-simulate-disk3.vmdk"
IDE11="vsim-NetAppDOT-simulate-disk4.vmdk"

And do a little sanity check to make sure the vboxmanage command is available

if [ -z "$(which vboxmanage)" ];then echo "vboxmanage not found";exit;fi

Now we can extract the contents of the OVA, which is just a tar file with some specific formatting constraints.

tar -xzvf "$name".ova

Some older versions of the simulator also gzip the VMDK files, so if you are working with an older version be sure to decompress the VMDK files as well.

Now use vboxmanage to create a new VM

vboxmanage createvm --name "$name" --ostype "FreeBSD_64" --register
vboxmanage modifyvm "$name" --ioapic on 
vboxmanage modifyvm "$name" --vram 16
vboxmanage modifyvm "$name" --cpus 2
vboxmanage modifyvm "$name" --memory "$memory"

Next add the NICs. Here the internal network, intnet, is used because it makes the conversion predictably scriptable. When the final ova is re-imported, adjust the networking to suite your needs.

vboxmanage modifyvm "$name" --nic1 intnet --nictype1 82545EM --cableconnected1 on
vboxmanage modifyvm "$name" --nic2 intnet --nictype2 82545EM --cableconnected2 on
vboxmanage modifyvm "$name" --nic3 intnet --nictype3 82545EM --cableconnected3 on
vboxmanage modifyvm "$name" --nic4 intnet --nictype4 82545EM --cableconnected4 on

Next add 2 serial ports. These are not needed on VMware, but ONTAP may hang on VirtualBox if these are not present.

vboxmanage modifyvm "$name" --uart1 0x3F8 4
vboxmanage modifyvm "$name" --uart2 0x2F8 3

The VirtualBox bios enumerate disks a little differently, so to maintain the original device order presented to the VM add an empty virtual floppy drive.

vboxmanage storagectl "$name" --name floppy --add floppy --controller I82078 --portcount 1 
vboxmanage storageattach "$name" --storagectl floppy --device 0 --medium emptydrive

And now we can add the IDE controller and attach those VMDK files.

vboxmanage storagectl "$name" --name IDE    --add ide    --controller PIIX4  --portcount 2
vboxmanage storageattach "$name" --storagectl IDE --port 0 --device 0 --type hdd --medium "$IDE00"
vboxmanage storageattach "$name" --storagectl IDE --port 0 --device 1 --type hdd --medium "$IDE01"
vboxmanage storageattach "$name" --storagectl IDE --port 1 --device 0 --type hdd --medium "$IDE10"
vboxmanage storageattach "$name" --storagectl IDE --port 1 --device 1 --type hdd --medium "$IDE11"

Now that the VM is built, vboxmanage can export it to an OVA file that VirtualBox can understand.

vboxmanage export "$name" -o "$name"-vbox.ova

And finally, remove the temporary VM from VirtualBox.

vboxmanage unregistervm "$name" --delete

The end result should be a -vbox.ova that can be cleanly imported into VirtualBox. But there are a couple of considerations when running the simulator on this platform.

First, this isn’t VMware, so the VMware tools service will throw errors at startup. That can be avoided by setting a variable at the loader prompt that disables that service.

setenv bootarg.vm.run_vmtools false

Second, the NICs are not enumerated in the order one would expect. Here is the mapping of VirtualBox network adapters to ONTAP ports:

Adapter 1: e0d
Adapter 2: e0a
Adapter 3: e0b
Adapter 4: e0c

And finally, beware the nat network. ONTAP has multiple ethernet ports that expect to be able to communicate with each other over the network, but the VirtualBox NAT network isolates each port to its own private broadcast domain. As a result the NAT network will not work.

Aside from that you should now be able to run a completely unsupported instance of the ONTAP Simulator in VirtualBox.

VMware NIC order may change after SuperMicro BIOS updates.

I encountered this issue over the holidays while doing some firmware and bios updates in the lab. A couple of my hosts are based on SuperMicro Xeon-D boards from the X10SDV line. These systems have 2x1gb and 2x10gb ports, and the original NIC order enumerated the 1gb ports first.

After updating the the latest BIOS (2.1), one of my ESX hosts did not come back online. I could see from the IPMI console the system had booted, but was not responding to pings. When I checked the management NICs, I discovered the order had changed.

I had to reconfigure my vSwitch uplinks to accommodate the new NIC order, by re-assigning the management uplink in the DCUI so I could get back into the hosts and fix everything else. I don’t know why one system re-ordered the NICs and the other did not, but I am now left with two otherwise identical hosts with different network uplink topologies. That is a mystery for another day, but if you are running SuperMicro based VMware hosts, proceed with caution.

Deploying ONTAP Select on KVM (on a NUC)

In my last post I went through the process for getting KVM installed and installing the ONTAP deploy VM. Deploying ONTAP Select is mostly a matter of stepping through a nice wizard, but I will have to make one adjustment in the swagger interface to deploy it on the NUC. Everything in here could be done with RESTful API calls, but unfamiliar things are easier to learn in a GUI.

After logging into a fresh install of the deploy utility you land at this workflow. If you bought a license this is where you would upload the license file. I don’t have any licenses to add, so I’ll run it as an eval cluster. Click Next.

The next step is to add the hypervisor hosts to the inventory, which in this case is just my KVM box. Fill in the form, click add, and wait for it to show up in the list on the right. Next.

This page defines the ONTAP Select cluster. In this example, its a single node cluster running 9.6 on KVM. Fill in the form and click Done.

Done doesn’t really mean done. It just advances to node setup. Under licenses pick Evaluation Mode, then fill out the hypervisor particulars.

Undersized hosts like the NUC may not appear in the the Hosts drop list. It can still be assigned to a host from the cli by connecting to the deploy instance over ssh:

(ONTAPdeploy) node modify -cluster-name otskvm -name otskvm-01 -host-name 192.168.123.59

Under storage, pick the Storage Pool from the drop list and assign some of its capacity to ONTAP. Don’t try to assign the entire capacity of the storage pool. ONTAP Select needs about 266GB for its system disks, which are not included in the information presented on this panel. Also to use a license type other than evaluation, the storage pool capacity needs to be set to at least 1tb. Factoring in the system disks the storage pool needs about 1.3TB available to accept a licensed instance of ONTAP Select. Here I am deploying in eval mode, and only assigning 500GB.
To move on, click Done.

The Next button will become enabled, and the final fields before ‘create cluster’ are the cluster admin password. If you are on a host with 6 cores, click create cluster and you’re done. If you are following along on a quad core host like a NUC, we need to use the swagger interface to change a setting that is not exposed in the GUI.

When deploy creates the VM, it will reserve a full 4 cores worth of CPU. This creates a VM with optimal performance, but on a host that only has 4 cores we need to dial that back a bit. Note that this should not usually be done in production. If you need to do this in production check in with your account team first to make sure your scenario can be supported.

To access the swagger interface, select “API Documentation” from the help menu. This is where you can access all of the API documentation and try out API calls along the way.

In the swagger interface, scroll down toward the bottom and expand the clusters section.

Find the GET /clusters section, and click “Try it out!”

Record the cluster’s id. It becomes an input on the next API call. Scroll down to GET /clusters/{cluster_id}/nodes. Fill in the cluster ID from the first API call, and click “Try it out!”. The output returned will have the id of the node.

Now that we have both the cluster id, and the node id, we can adjust the reservations setting on the node. Scroll on down to:
PATCH /clusters/{cluster_id}/nodes/{node_id}

Fill in the cluster id, the node id, and the changes shown here. Valid values for cpu_reservation_percentage are 25, 50,75, 100, with 100 being the default.

Once again click “Try it out!”, but this time look for {} in the response body and a response code of 200.

Now switch back to the deploy GUI, pick a cluster admin password, and click create cluster. It will take a while to deploy, but eventually should end in a successful deployment:

It will take several minutes for the cluster’s ONTAP System Manager web interface to become available on the cluster management IP you specified. Be patient and remember to connect over https. There is even a link to it on the clusters tab of the ONTAP deploy UI.

Once you have access to the ONTAP system manager, provisioning storage services is the same as it is in any other ONTAP system. For a walkthrough of setting up CIFS services, see this post.

Installing ONTAP Select Deploy on KVM (on a NUC)

In a previous series of posts I built an ESX host on a NUC and used it to run ONTAP Select. This time around I’ll do it on KVM. This is one of those ‘prove it actually works’ posts, because I keep hearing it doesn’t work. That may have been true at one time, but with a quarterly release cadence this is a product that evolves fairly quickly. This post will cover installing KVM and the ONTAP deploy utility, the next post will cover the actual ONTAP Select deployment.

ONTAP Select is supported on KVM so this is mostly just a matter of following the instructions, but the NUC platform brings a few challenges. It only has 4 cores, and it only has 1 NIC, which just like on VMware is a little below the documented system requirements. Unlike VMware, there is no “standalone eval” image. This time I’ll build it the proper way, using the ONTAP Deploy utility VM. But first, I need to get KVM up and running.

The hardware specifications are the same as the VMware build:

NUC8i5BEH, (4 cores, 8 threads)
64GB RAM
512GB NVME drive
1TB SSD drive
Note: To deploy a licensed instance of ONTAP Select a 2TB SSD would be needed.

For this build I chose Centos 7.6 and these install options, based entirely on personal preference:
Server with GUI
+ Virtualization Client
+ Virtualization Hypervisor
+ Virtualization Tools
+ System Administration Tools

I installed Centos to the NVME drive, saving the SATA SSD for later.

During setup I created a local account called ‘admin’ and set the password for root.

Later in the process I will be creating a bridge for openvswitch and adding the sole 1gb NIC, which will drop the wired network connectivity to the KVM host. So to be able to do that work over SSH I will use the Wifi adapter for host management, and assign the wired interface to a link-local only address.

Time to build KVM. Start by opening an ssh session into the host as admin, and switch to root:

su

Next use yum to install all the dependancies:

yum install -y qemu-kvm libvirt openvswitch virt-install lshw lsscsi lsof

If openvswitch is missing from the repo, you can either build it from source or grab it from the community build service. This post is long enough without a build-from-source detour, so I’ll grab it from the CBS.

wget https://cbs.centos.org/kojifiles/packages/openvswitch/2.7.3/1.1fc27.el7/x86_64/openvswitch-2.7.3-1.1fc27.el7.x86_64.rpm
yum install openvswitch-2.7.3-1.1fc27.el7.x86_64.rpm

Next create a storage pool using that data SSD, which on this platform is /dev/sda

virsh pool-define-as select_pool logical --source-dev /dev/sda --target=/dev/select_pool
virsh pool-build select_pool
virsh pool-start select_pool
virsh pool-autostart select_pool

Now setup openvswitch

systemctl start openvswitch
systemctl enable openvswitch
ovs-vsctl add-br br0
ifdown eno1
ovs-vsctl add-port br0 eno1
ifup eno1

Set the queue length rules required for ONTAP Select

echo "SUBSYSTEM=="net", ACTION=="add", KERNEL=="ontapn*", ATTR{tx_queue_len}="5000"" > /etc/udev/rules.d/99-ontaptxqueuelen.rules
cat /etc/udev/rules.d/99-ontaptxqueuelen.rules

Thats it for KVM. Now for the ONTAP Deploy VM. ONTAP Deploy is part deployment utility, part HA mediator, and part license server. It is the standard supported way to deploy ONTAP Select regardless of the hypervisor. Deploy does not have to run on the same host as Select. One deploy instance can manage about 100 instances of Select in an enterprise environment.

A raw image is available for running the Deploy VM on KVM, which you can get from the evaluation section of the Netapp support site, or you can get a 90day eval here: https://www.netapp.com/us/forms/tools/90-day-trial-of-ontap-select.aspx . Start by downloading the ONTAPdeploy raw.tgz file on your local machine and copying it over with scp:

scp ~/Downloads/ONTAPdeploy2.12.1.raw.tgz admin@192.168.123.59:/home/admin

And now back over on the ssh session to the KVM host…
Extract the tgz:

cd /home/admin
tar -xzvf ONTAPdeploy2.12.1.raw.tgz

Give it a home:

mkdir /home/ontap
mv ONTAPdeploy.raw /home/ontap

And use virt-install to build a VM around it:

virt-install --name=ontapdeploy --vcpus=2 --ram=4096 --os-type=linux --controller=scsi,model=virtio-scsi --disk path=/home/ontap/ONTAPdeploy.raw,device=disk,bus=scsi,format=raw --network "type=bridge,source=br0,model=virtio,virtualport_type=openvswitch" --console=pty --import --wait 0

Set it to autostart:

virsh autostart ontapdeploy

Next use the virsh console to complete the VMs setup script:

virsh console ontapdeploy

The setup script will look something like this:

Connected to domain ontapdeploy
Escape character is ^]
That does not appear to be a valid hostname
Host name            : ontapdeploy
Use DHCP to set networking information? [n]: n
Net mask             : 255.255.255.0
Gateway              : 192.168.123.1
Primary DNS address  : 192.168.123.21
Secondary DNS address: 
Please enter in all search domains separated by spaces (can be left blank):
Selected IP           : 192.168.123.58
Selected net mask     : 255.255.255.0
Selected gateway      : 192.168.123.1
Selected primary DNS  : 192.168.123.21
Selected secondary DNS: 
Search domains        : 
Calculated network    : 192.168.123.0
Calculated broadcast  : 192.168.123.255
Are these values correct? [y]: y
Applying network configuration. Please wait...
Continuing system startup. Please wait...
Debian GNU/Linux 9 ontapdeploy ttyS0
ontapdeploy login:

The GUI should be available now over https on the specified address. The default credentials are:
username: admin
password: admin123
Log in once now to change the default password, and the system will be ready to deploy ONTAP Select.

Creating a new Active Directory Forest with Ansible

Building new AD forests isn’t something most of us do often enough to need to automate it, but recently I was talking to a good friend and a fellow homelabber who needed to provision some new domains in his lab. I do this a lot because every lab environment I build gets its own AD forest. When I told him I’d been automating it with Ansible he suggested I write it up for the blog.

This playbook creates a new domain in a new forest from a freshly provisioned VM, like the one built in my previous post on building windows VMs with Ansible.

The beginning of the playbook defines all the variables needed to provision the new AD Forest. In practice I keep them in a vars file, but to simplify the example playbook I put them in-line.

---
- name: Create new Active-Directory Domain & Forest
  hosts: localhost
  vars:
    temp_address: 172.16.108.144
    dc_address: 172.16.108.11
    dc_netmask_cidr: 24
    dc_gateway: 172.16.108.2
    dc_hostname: 'dc01'
    domain_name: "demo.lab"
    local_admin: '.\administrator'
    temp_password: 'Changeme!'
    dc_password: 'P@ssw0rd'
    recovery_password: 'P@ssw0rd'
    upstream_dns_1: 8.8.8.8
    upstream_dns_2: 8.8.4.4
    reverse_dns_zone: "172.16.108.0/24"
    ntp_servers: "0.us.pool.ntp.org,1.us.pool.ntp.org"
  gather_facts: no

Part of the process of preparing this VM to become a domain controller involves setting a static IP, changing its hostname, and changing its password, so I use Ansible’s dynamic inventory rather than a static inventory file.

First I add it to inventory using the VMs original IP and password:

  tasks:
  - name: Add host to Ansible inventory
    add_host:
      name: '{{ temp_address }}'
      ansible_user: '{{ local_admin }}'
      ansible_password: '{{ temp_password }}'
      ansible_connection: winrm
      ansible_winrm_transport: ntlm
      ansible_winrm_server_cert_validation: ignore
      ansible_winrm_port: 5986
  - name: Wait for system to become reachable over WinRM
    wait_for_connection:
      timeout: 900
    delegate_to: '{{ temp_address }}'

Next set the static IP. This task does not have windows Ansible coverage, so it uses win_shell, which in turn runs the command under Powershell.

  - name: Set static IP address
    win_shell: "(new-netipaddress -InterfaceAlias Ethernet0 -IPAddress {{ dc_address }} -prefixlength {{dc_netmask_cidr}} -defaultgateway {{ dc_gateway }})"
    delegate_to: '{{ temp_address }}'  
    ignore_errors: True 

This command will always return failed because once the IP changes Ansible can’t check the results of the task. Just set ignore_errors: true and let it time out. Next Add the host back in to inventory under its new IP address:

  - name: Add host to Ansible inventory with new IP
    add_host:
      name: '{{ dc_address }}'
      ansible_user: '{{ local_admin }}'
      ansible_password: '{{ temp_password }}'
      ansible_connection: winrm
      ansible_winrm_transport: ntlm
      ansible_winrm_server_cert_validation: ignore
      ansible_winrm_port: 5986 
  - name: Wait for system to become reachable over WinRM
    wait_for_connection:
      timeout: 900
    delegate_to: '{{ dc_address }}'

Next set the local administrator password. This password will become the domain admin password later when the system is promoted to a domain controller.

  - name: Set Password
    win_user:
      name: administrator
      password: "{{dc_password}}"
      state: present
    delegate_to: '{{ dc_address }}'
    ignore_errors: True  

Once again re-add it to inventory using its new IP address:

  - name: Add host to Ansible inventory with new Password
    add_host:
      name: '{{ dc_address }}'
      ansible_user: '{{ local_admin }}'
      ansible_password: '{{ dc_password }}'
      ansible_connection: winrm
      ansible_winrm_transport: ntlm
      ansible_winrm_server_cert_validation: ignore
      ansible_winrm_port: 5986 
  - name: Wait for system to become reachable over WinRM
    wait_for_connection:
      timeout: 900
    delegate_to: '{{ dc_address }}'

Next set the upstream DNS servers. These will become the DNS forwarders once the AD integrated DNS server is installed.

  - name: Set upstream DNS server 
    win_dns_client:
      adapter_names: '*'
      ipv4_addresses:
      - '{{ upstream_dns_1 }}'
      - '{{ upstream_dns_2 }}'
    delegate_to: '{{ dc_address }}'

Next set the upstream NTP servers. Domain controllers should reference an authoritative time source.

  - name: Stop the time service
    win_service:
      name: w32time
      state: stopped
    delegate_to: '{{ dc_address }}'
  - name: Set NTP Servers
    win_shell: 'w32tm /config /syncfromflags:manual /manualpeerlist:"{{ntp_servers}}"'
    delegate_to: '{{ dc_address }}'  
  - name: Start the time service
    win_service:
      name: w32time
      state: started  
    delegate_to: '{{ dc_address }}'

Now before proceeding disable the windows firewall. Otherwise the domain firewall policy will prevent later tasks from succeeding after the system reboots. You can re-enable it and set rules to your liking once the playbook is complete.

  - name: Disable firewall for Domain, Public and Private profiles
    win_firewall:
      state: disabled
      profiles:
      - Domain
      - Private
      - Public
    tags: disable_firewall
    delegate_to: '{{ dc_address }}'

Before promoting a system to a DC, you should set its hostname. Its much simpler to rename it before it becomes a domain controller. These tasks update the hostname, and reboot if required.

  - name: Change the hostname 
    win_hostname:
      name: '{{ dc_hostname }}'
    register: res
    delegate_to: '{{ dc_address }}'
  - name: Reboot
    win_reboot:
    when: res.reboot_required   
    delegate_to: '{{ dc_address }}'

Now you are ready to install active directory and create the domain.

  - name: Install Active Directory
    win_feature: >
         name=AD-Domain-Services
         include_management_tools=yes
         include_sub_features=yes
         state=present
    register: result
    delegate_to: '{{ dc_address }}'
  - name: Create Domain
    win_domain: >
       dns_domain_name='{{ domain_name }}'
       safe_mode_password='{{ recovery_password }}'
    register: ad
    delegate_to: "{{ dc_address }}"
  - name: reboot server
    win_reboot:
     msg: "Installing AD. Rebooting..."
     pre_reboot_delay: 15
    when: ad.changed
    delegate_to: "{{ dc_address }}"

Once the system reboots there are a few little cleanup tasks. First domain controllers should use themselves as the DNS server. This should get set during dc_promo, but I like to be sure it gets set:

  - name: Set internal DNS server 
    win_dns_client:
      adapter_names: '*'
      ipv4_addresses:
      - '127.0.0.1'
    delegate_to: '{{ dc_address }}'

Next create the reverse lookup zone for the local subnet. The forward lookup zones get created automatically, but the reverse zones do not. Note the retries on this step. At this point in the process the system has rebooted after becoming a domain controller. It takes a while for it to really be ready to continue.

  - name: Create reverse DNS zone
    win_shell: "Add-DnsServerPrimaryZone -NetworkID {{reverse_dns_zone}} -ReplicationScope Forest"
    delegate_to: "{{ dc_address }}"    
    retries: 30
    delay: 60
    register: result           
    until: result is succeeded

And the final step in my process is to make sure RDP is enabled so I can remote in and do any one-off customizations:

  - name: Check for xRemoteDesktopAdmin Powershell module
    win_psmodule:
      name: xRemoteDesktopAdmin
      state: present
    delegate_to: "{{ dc_address }}"
  - name: Enable Remote Desktop
    win_dsc:
      resource_name: xRemoteDesktopAdmin
      Ensure: present
      UserAuthentication: NonSecure
    delegate_to: "{{ dc_address }}"

Thats the process end to end from newly installed windows server to newly provisioned Active Directory forest. The complete playbook is in the examples repo on my github.

Running an ONTAP Select eval cluster on a NUC

In this post, I’ll give a little overview of ONTAP Select, how to get a free eval copy, and how to deploy it on a NUC or other small lab host. I’ll also go through some getting started steps to take the ONTAP Select instance through deployment and on to serving data. This builds on a recent post that covered the install of ESXi on the NUC and turns it into a storage appliance running ONTAP.

About ONTAP Select

ONTAP Select is the ONTAP operating system, running in a Virtual Machine on an ESXi or KVM host. This is the same ONTAP operating system that runs on NetApp FAS and AFF engineered systems, and in the major cloud providers as Cloud Volumes ONTAP. ONTAP can run just about anywhere, but the accessibility of ONTAP Select makes it a great platform for running ONTAP in the homelab. You can use it to learn ONTAP, try out new releases, or just to add some feature rich storage to your lab.

System Requirements

What I need:
According to the documentation, to run a single node ONTAP Select cluster my VMware host needs:
– 2 x 1GbE NICs
– Six physical cores or greater, with four reserved for ONTAP Select
– 24GB or greater with 16GB reserved for ONTAP Select.
– A hardware raid controller or enough internal SSDs to enable software raid

What I have:
– 1 x 1GbE NIC
– 4 physical cores
– 64GB RAM
– 1xNVME drive + 1xSATA SSD drive

So the NUC doesn’t quite meet the documented system requirements, but I’ll make it work anyway.

Obtaining ONTAP Select

ONTAP Select has a free 90 trial available at the following this link:
https://www.netapp.com/us/forms/tools/90-day-trial-of-ontap-select.aspx
If you have an existing account on the NetApp support site, you can download it from the evaluation section of the support site. If you need to create a guest account, follow the instructions on the 90 day trial link to get access. Once logged in you’ll find it in the product evaluation section, or just follow this link:
https://mysupport.netapp.com/NOW/download/special/ontapselect_eval/
The version I’ll be installing is the Standalone Eval OVA:

Just to be clear, this is not the way a licensed version would be installed. A licensed version would be installed using the ONTAP Deploy utility OVA, which is part deployment tool, part HA mediator, and part license manager. It is possible to install a properly licensed ONTAP Select instance on a NUC, but that is a topic for another day. Today is about having some fun with the Standalone Eval version.

Deploying ONTAP Select

Now that we have the OVA file downloaded, we can just deploy it like any other OVA. From a resource standpoint, the VM requires 4 vCPUs, 16gb of RAM, and 300gb+ of disk space, which is just within reach of a small platform like the intel NUC.

The datastore really should be a local SSD or NVME based datastore for performance reasons, so I will be deploying it to the internal NVME drive on my NUC. The VM will need about 302GB for a thick provisioned deployment.

Connect it to your VM Network, and choose the default deployment type, “ONTAP Select Evaluation – Small”.

The additional settings page contains the IPs and hostname that will be used to create the cluster.

Clustername: ONTAP Select clusters can contain 1,2,4,or 8 nodes. This field specifies the name of the cluster, not the underlying node(s).
Data Disk Size for ONTAP Select: This is the size of the virtual disk that will be used to store user data. The default for the eval is 100gb, but you can increase it if more space is available.
Cluster Management IP address: This is the primary management IP for the cluster, regardless of the number of nodes it contains.
Node Management IP address: This IP is used to manage the individual node.
Netmask and Gateway: Set these to match your VM Network subnet.
Node Name: Each node in an ONTAP cluster is assigned a unique name. This cluster will only contain one node, so give it a name.
Administrative Password: This sets the initial password for the cluster’s ‘admin’ account. Use at least a mix of letters and numbers. The deployment may fail if the password is too simple.

Continue on with deployment, then open up the VM console and wait for the login prompt:

You won’t typically use the VM console window after the initial deployment. Instead you would either SSH to the cluster management IP, or log into the ONTAP System Manager GUI in a browser.

Accessing the ONTAP System Manager GUI

After a few minutes the System Manager login should be available on the Cluster Management IP address. By default, the interface is only available via HTTPS. Login as user admin, with the password specified in the OVA deployment workflow.

Once logged in to the ONTAP System Manager GUI, you’ll be at the cluster’s dashboard page:

Preparing to serve data

ONTAP Clusters are a platform for running Storage Virtual Machines (SVMs, also known as vservers). The SVMs provide the actual user facing data services like CIFS, NFS, iSCSI, etc. Before we can create an SVM, we need a data aggregate. To use a virtualization frame of reference, if SVMs are analogous to VMs, data aggregates are analogous to datastores. On larger systems, data aggregates are an ‘aggregate’ of one or more raid groups. In the case of this little ONTAP Select instance, the data aggregate will be a single virtual disk in raid0.

Navigate to Storage->Aggregates & Disks->Aggreagates, then click create:

At this point there will only be 1 disk available, so give the aggregate a name. In this example I called it aggr1. Then click submit.

Adding more storage

My NUC has a SATA SSD in addition to the NVME drive where I deployed the ONTAP Select VM. I could make a datastore on that SSD, put a large VMDK on that datastore, and attach it to the ONTAP Select VM. But I actually prefer to just RDM the whole disk. That takes a little CLI work on the ESX host.

After enabling SSH on my ESX host and logging in over SSH, I can find the sata drive’s device identifier:

And use vmkfstools to create a passthru RDM for that device:

I can then attach that VMDK to my ONTAP Select VM and use it to create another data aggregate. Edit the VM, add a hard disk, and pick “Existing Hard Disk”. Browse to the RDM disk and add it to the VM.

That disk will show up as an unassigned disk in ONTAP, which I can later assign to my node and use to create another data aggregate.

Navigate to Storage->Aggregates & Disks->Disks, select the unassigned disk from the list and click assign.

In the Assign Disks dialogue, click Assign. Then the disk will be available in the aggregate create workflow.

Setting Reservations

The ONTAP Select VM needs some reservations to protect it from other VMs that may be running on the host. In a production deployment, 100% of CPU and RAM would be reserved, but on this tiny platform that is not feasible. We can and should reserve 100% of the RAM, and at least ~25% of the CPU. This host has ~8ghz available over 4 cores, so I’ll set my CPU reservation at 2000, and my memory reservation at 16GB.

Treating it like an appliance

Since I’ll be treating this host like a home lab storage appliance, I will set this VM to start and stop with host. Enable autostart, make this VM start first, and set the shutdown behaviour to ‘shut down’ to allow ONTAP to shutdown gracefully.

Passing VLANs to ONTAP

For ONTAP Select to support VLANs like the hardware appliances do, it needs to be attached to a VMware port group assigned to VLAN 4095. This configures the port group as a VLAN trunk, and allows VMs to handle VLAN tags on their own. VMware calls this configuration “Virtual Guest Tagging” or VGT. If you want that, configure a port group as shown below and connect the Select VMs NICs to that port group.

Creating a Storage Virtual Machine

If you make SVMs all the time you might want to skip on to the conclusion, otherwise read on for a walkthrough of provisioning an SVM.
Start by navigating to Storage->SVMs, and click create:

On the first page of the SVM Setup wizard, specify the SVM name, which protocols to enable, and the DNS domain and name severs required to join active directory. It is possible to run CIFS in workgroup mode, but that feature is only available at the command line. For this example I’ll just enable CIFS, join AD, and create my first share.

On the next page of the wizard provide the CIFS configuration details, starting with the IP address to use for CIFS access. In the Assign IP address drop list, select “Without a subnet”, then fill out the IP information in the Add Details box. Then click OK.

This creates a logical interface, which needs to be assigned to a port. Ports in ONTAP are named e0a,e0b,e0c, etc. Click browse next to the Port: box and pick e0a.

Next fill in the CIFS server details. The CIFS Server Name is the name of the computer account it will create in Active Directory, and the remaining fields are your active directory details. You also have the option of creating an initial CIFS share as part of the SVM setup wizard.

On the Admin details page of the wizard, enter a password for the vsadmin account. Each SVM has its own administrative account that can be used for delegation, or integration with other applications.

Click submit and continue, then OK on the final confirmation page, and your new SVM will be created, along with that initial share.

This was a deliberately simple example. To learn more about ONTAP and ONTAP Select, see the resources available on docs.netapp.com.

Conclusion

The end result of this little lab adventure is an Intel NUC or similar small ESX host, configured to act as an appliance running a single node ONTAP Select cluster, with a couple of RAID0 data aggregates, support for CIFS, NFS and iSCSI, and all the data management features you would get in an enterprise class storage system. It may only last for 90 days, but that’s long enough to learn how to use SnapMirror. With storage efficiencies applied, the results can be pretty impressive. Here is a screenshot from one filled with nested lab VMs.

I have covered the steps to build this lab box interactively, but I’ll revisit this in a future post and replace all these tasks with Ansible so I can spin up a new lab box just by running a playbook. After all, this build has several single points of failure so I’ll need something to SnapMirror to for data protection.

Anatomy of a Virtual Lab Environment

Virtual Labs are everywhere. VMware has HOL (Hands on labs), Microsoft has Hands-on Labs, Cisco has dCloud, NetApp has labondemand, and on and on. They’re great for making complete lab environments available for demos, training, and study. But how do they work, and how can they be scaled down to run in a homelab?

Virtual Labs generally have a few things in common. They have isolated network(s) internal to the lab, they contain a collection of pre-configured VMs, and they are accessed via some sort of jump host. Virtual labs are typically cloned into multiple instances, with every lab instance containing an identical set of VMs and networks.

In this simple example, each lab instance has an identical set of VMs, with identical IP addresses. Each lab’s gateway connects it to the transit network, and lab users connect to their lab instance through a remote display protocol.

For this scheme to work, each lab needs an isolated internal network. In fact the VMs within these labs should be completely identical, down to the mac addresses of their NICs. There are lots of ways this could be accomplished, with VXLAN and NSX at the top of the list, but those are heavyweight solutions at homelab scale. Instead I’ll take a simpler approach, and just use portgroups on an isolated vSwitch to achieve network isolation between lab instances.

Here is a diagram of what those 3 lab instances might look like from a vSwitch perspective:

Each lab instance’s internal lab network is backed by an individual port group, with a unique vlan assignment. A virtual router acts as the lab gateway, with the router’s LAN port connected to the instance’s network, and the router’s WAN port connected to the VM Network. The router provides NAT to the lab instance, and a simple RDP port forward to the jump host facilitates remote access to the lab environment.

One caveat of using this strategy is that lab instances cannot span hosts. If there was a need for an individual instance to span hosts, the networks would need to be VLAN backed, or provisioned with an overlay technology. In practice there are other reasons to keep the VMs within a given instance running on the same host, so this isn’t really a limitation, but it does mean there needs to be a way to group these VMs together so they always run on the same host. I use two VMware features to accomplish this, vApps, and affinity rules.

Here I’ve grouped my lab instance VMs into vApp containers.

vApps can also be cloned using the vCenter UI, providing an easy way to provision more instances. Many of my lab topologies are too complex to survive vApp cloning, but its a good way to get started. If you pre-provision a network for the new lab clone, you can map the VMs to that network as part of the New vApp wizard.

Next I can create an affinity rule, to keep all the child VMs of that vApp on the same host. But since I have cloned my vApps, all of the child VM names are the same, and the vCenter UI for creating affinity rules cannot distinguish one from another. In this case, its much easier to just create the rule with a little snippet of PowerCLI:

New-DrsRule -Cluster "Lab Cluster" -Name "VirtualLab-Instance1" -KeepTogether $true -VM (get-vapp "VirtualLab-Instance1" | get-vm)

So far I’ve covered the general anatomy of a virtual lab, with an emphasis on the networking aspects, and an approach to implementing them on a small scale suitable for a home lab. There is a lot more to cover on this topic. The configuration of the virtual router serving as the gateway, strategies for configuring VMs to survive this kind of cloning, ways to optimize active directory for cloning and long term storage, and strategies for automated provisioning are all important topics. I also have a project on my github with my virtual lab automation and provisioning portal, if you want to see how I really do things in my home lab. It’s a perpetual work in progress, but for now I’ll leave off with a screenshot of the dashboard.