Nathan Stratton’s Homepage

Optimizing for Memory Intensive Workloads

by on Aug.20, 2013, under Hardware

Processor speed is an important factor when deciding on a new server spec, however with virtualization and other memory intensive workloads many times the memory system has a far greater impact on performance then even CPU speed. Xeon e5-2600 CPUs support 3 basic types of third generation dual data rate (DDR3) memory via 4 channels in up to 3 banks for each CPU. How or if those slots are filled and with what is very important to understand.

Types of Memory
The basic element of each type of DIMM is the DRAM chip that provides 4 or 8 bits of data. When ECC is used the DRAM chips provide 72 bits allowing many errors to be corrected without loss. Since ECC functions differently with 4 bit and 8 bit chips, different DIMMs types should never be mixed. The DRAMs on the DIMMs are arranged in groups called ranks, groups of chips that can be access simultaneously by the same chip select (CS) control signal.

UDIMM – With unregistered DIMMs each chip on the DIMM has its data and control lines directly tied to the integrated over the memory bus to the memory controller off the QPI ring integrated into each CPU. Each DRAM on this bus adds to the electrical load, because of this load UDIMM support is limited to only 2 dual-rank UDIMMs per channel. However this direct access to the DRAMs by the memory controller allows UDIMMs to provide the fastest and lowest latency memory access of all types.

RDIMM – Registered DIMMs are the most common in use in servers today. RDIMMs have an extra chip, a buffer that isolates the control lines between the memory controller and each DRAM chip. This buffer slightly increases latency, but allows a RDIMMs to support up to quad ranks and fill all 3 DIMM banks.

LRDIMM – Load Reduced DIMMs are a relatively new type of DDR3 memory that buffers all control and data lines from the DRAM chips. This isolation decreases the electrical load on the memory controller and allows the highest memory configuration possible. Since the DRAM chips are hidden by the buffer, LRDIMMs are able to implement rank multiplication offering the memory controller virtual ranks that may be less then the physical ranks on the DIMM. This hiding of physical ranks allows more rank support then the DDR3 memory architecture naively supports by the CPU. This increased capacity does come at the price of not only speed and latency, but also increased power consumption.

Memory speed
The clock frequency of the memory bus used to access DIMMs on the e5-2600 series is 1600, 1333, 1066, or 800 MHz (up to 1866 MHz on the new e5-2600 v2). This memory bus speed is controlled by the BIOS and is set per system. It is not possible to access memory in different banks at a different speeds. The maximum memory speed on the e5-2600 series is limited by the number of banks, ranks used, and the speed of the QPI ring. To support full 1600 MHz memory a 8.0 GT/s QPI is required, something that is not available on the standard or basic e5-2600 processors.

The e5-2600 supports up to 8 physical ranks per channel, DIMMs using single, dual, or quad ranks can be used, however quad rank DIMMs lower the clock frequency of the memory bus. The more ranks that are available to a channel the more parallelism can be preformed by the memory controller increasing memory performance, thus dual ranked DIMMs should be used if possible.

While the e5-2600 can physically support 3 banks of memory, only 2 banks can be used at 1600 MHz. If all 3 banks are used, the maximum clock frequency supported on the memory bus is 1066 MHz. Fully populating channels is not required on the e5-2600, it is highly recommended that all 4 are populated and if possible with two DIMMs in each channel increasing the ranks available on each channel.

Latency
Column Address Strobe, or CAS latency with DDR3 is the amount of clock cycles it takes between the moment the memory controller requests access to a DRAM and when that data is available on the DRAM chip on the DIMM. In searching for memory, particularly with lower cost UDIMMs, pay attention to the CAS latency and stay away from anything over a CAS of 9.

Maximum Performance
Bottom line… If you want the maximum memory performance use 16 sticks of 1600 MHz 1.5 volt dual rank UDIMMs with the lowest CAS latency as possible populating first two banks of all 4 channels. ECC DIMMs are also preferred by not required by the e5-2600.

Leave a Comment :, , , , , more...

Reality of H.264 SVC

by on Feb.04, 2013, under Software, Telephony

I frequently get asked about H.264 SVC, so I decided to write this post to share my thoughts. My background is in the service provider space, so let me first say that what I am about to talk about relates to SVC used for real time two way communication.

Lets first start with some background. Scalable Video Coding, (SVC) is the Annex G extension of H.264 AVC approved in July 2007. As the name implies, the idea behind SVC was to create a bit stream that allows scalable subset of bit streams that represent different resolutions, quality, and frame rates. You can think of SVC as layered stream where the decoder is allowed to decode all, or just the base subset of the streams depending on the desired quality. If all that is needed is the base stream, the decoder simply discards the packets from the other layers.

If you do a search on H.264 SVC, you quickly are overwhelmed by the flood of pages singing the praises of SVC. The sad part is, most of them are authored by marketing guys rather then technologist who understand the technical differences between SVC and standard H.264 AVC.

Lets first take a look at some of the proclaimed benefits of SVC and the reality when it comes to service providers.

SVC allows up to 20% packet loss without effecting quality.

A number of vendors use SVC layers to provide redundancy information to a bit stream. Providing redundancy is nothing new, a 8 disk RAID 5 array can lose 1 disk without any information lost. A RAID 6 array is able to lose 2 disks without losing any data. This redundancy however comes at a cost, depending on the desired level of redundancy extra data is sent making the total bit stream 10 – 30% larger. Many times this increase in bandwidth may itself lead to quality issue, however lets for right now pretend that it is worth sending the extra data. I have seen many tests showing how great using SVC layers to provide redundancy is. They show lots of loss and you can clearly see that quality is far superior to standard AVC. However, if you dig deeper you start to see the flaw of this approach. If you run a IP network, you know that most of your packet loss is not random, it is burst mode. A routers queue gets full and a bunch of packets are dumped on the floor. In cases like that, the most common packet loss on the internet, redundancy information does not do a thing because the base stream AND and the redundant information are both lost!

SVC allows a single encoder to provide different resolution, quality, or frame rates.

Yes, this is very true, in fact, it is what SVC was designed for. The issue becomes how relevant this is in two way real time video communication on the internet. The problem is that ALL of this data is in the full bit stream. As I mentioned before, if a decoder does not need the more advanced levels, it simply discards the packets, they are still sent! Many vendors have a solution to this, you simply buy their boxes and put them all over your network and your customers networks. The devices then drop the unneeded layers. While I find this approach great from the guys selling more boxes, from a network engineering side there are many drawbacks.

SVC is a standard, just like H.264 AVC

Again, this is correct, however just because there is a standard for SVC it does NOT mean that vendors who are deploying SVC are able to inter operate. Lets face it, in our industry most of our “standards” are know as RFCs out of the IETF. I am a big fan of the IETF and I love the process. It has HUGE some advantages over other standards bodies such as the ITU where standards become inch thing documents and the process is sometimes more political then technical. At the same time, lets not forget that RFC stands for “Request For Comment”, many times they are only a dozen or so pages in length where much of the implementation is left to the reader. Some “standards” are only internet-drafts that are published by only 1 or 2 people that never turn into a RFC from a company who just wants to say they are following a “standard”. We are still today working out issues with H.264 AVC over the public internet, SVC with a much smaller following in my opinion will never be “standardized” by the market, something that is much more crucial then the technical standard.

SVC provides a totally new way of offering low cost conferencing

One interesting approach to SVC is the idea of an application router that simply routes SVC layers between users rather then a typical MCU approach that decodes and then encodes the video stream. While I like this is interesting, I don’t think it is more then a niche play. First let me point out that this idea is far from new. Skype and others who do not have centralized MCUs have been routing streams between users for years with H.264 AVC and other CODECs. A big problem with this approach is that it shifts the burden from a centerlized MCU to in many respects the client. If 8 people are in the conference, it is forced to decode 7 video streams rather then just one. The bandwidth that is required to receive and send video to / from 7 other users is far greater then a single flow to a MCU. To make matters worse it goes in the opposite direction that the industry has been trying to move. We have been working on technologies like rtcp-mux and bundle to allow voice, video, + the 2 RTCP flows to all be on one flow rather then 4. With the application router approach, even if rtcp-mux and bundle are used, you still require port bindings for each of the other participants audio, video and signaling streams.

SVC and quality based on users bandwidth.

While closely related to encoder being able to provide different resolutions, quality or frame rates, this is more focused on the end device. The issues of last mile bandwidth has always been a real issue when it comes to two way video services. The idea here is that the device can somehow request just the layers it has bandwidth for rather then everything. However if we look at last mile bandwidth closer, this is trying to fix the wrong direction! The problem lies in that fact that most last mile bandwidth is asymmetrical, I wont bore you with the reasons, but upload speed is normally a fraction of download. If a device could back off on what is is receive (the big direction), it does nothing for what it is sending. Does the device simply send AVC, or does it try to send base plus high quality layers using SVC up the small bandwidth side of the pipe? If one wants to be smarter about this, a better approach would be to use standard AVC on both ends and allow both encoders to be adjusted in real time rather then at the start of the call.

SVC encoding speed

The SVC guys like to say its faster, but I am sorry, in the real world it is not. Yes, encoding 1080p + 720p + VGA + QVGA with H.264 is generally slower then SVC with QVGA + the other layers, however why would we be sending ALL of that to one device? You are much better off encoding just the required video feed for each endpoint rather then trying to encode everything. The SVC guy will then say, but if you do it once with SVC you can send that same feed to everyone. While I think their are applications where the exact same content can be sent to every user, in the real world this is rarely the case. Different users want different layouts, options, etc, the world of pushing the same static thing to everyone is going away, people want it their way! If you add to this the issue of CPU and especially DSP optimization for AVC over SVC, it becomes even more clear.

 

Is SVC good for anything? Yes, believe it is, SVC is great for the content storage industry allowing them to store one video file with many different option levels. It’s great for the security industry, allowing video feeds to be processed by systems at a base level and then expanded in quality if something needs to be analyzed further. However, I think most of the people making noise about SVC today are just looking for marketing spin, rather then real technical advantages.

 

Leave a Comment :, , , more...

The Blade Myth and 10Gb Ethernet

by on Apr.21, 2012, under Hardware

The Blade Myth and 10Gb Ethernet

Over the last several years there has been a big push to adopt blade servers. The idea being that you can cram more CPU cores into less space allowing you to build a more efficient data center. Lets take a few minutes and look at IBM BladeCenter vs 1U white label pizza boxes. The results may surprise you.

I picked IBM BladeCenter because it has the largest market share in the blade space and frankly is a great product! It is price competitive with other blades and I think gives a fair representation of the market.

The BladeCenter E chassis supports 14 bays in a high density chassis. At only 9U high you can support 4 chassis in a rack and still have 6U free to support two top of rack switches and any other support hardware. This provides a maximum density of 56 dual processor blades in a standard 42U rack.

Before we jump into the actual blades, lets next look at 1U chassis. In a standard 42U rack you can support 40 dual processor 1U servers, leaving room for two 1U top of rack switches. This is one of the biggest selling points for the blade camp, we are able to support many more raw CPUs with blades.

Economics

Our test systems are Dual Intel Xeon E5-2680 8 core 2.7 GHz CPUs, with dual 10Gb Ethernet, and 128 GB of RAM. The BladeCenter HS23 rings in at $11,713 (including 1/14th the BladeCenter H chassis) and the white label rings in at $5,882 based on Supermicro X9DRH-7TF motherboard.

Both of our systems come with 10 Gb Ethernet, however they are very different from factors. On the IBM we are handed older SFP+ connectors and on the white label we have newer copper 10GBase-T ports that support standard Cat-6 cable.

For aggregation, I chose Arista Networks 7050 series switches based on the Broadcom Tident+ ASICs. This switch supports 40 GbE ports and 4 QSFP+ 40 GbE ports on a 1.28 Terabits/sec fabric. The white label solutions is using the DCS-7050T-64-F ($20,995) with 40 10GBase-T ports and the BladeCenter requires the bit more expensive DCS-7050S-64-F ($24,995) with SFP+ ports.

Our white label box has 10GBase-T ports on the motherboard, however the BladeCenter requiress two Ethernet Pass-Through Module at $4,999 each and $75 twinax cables. Bringing our cost per port of 10 Gb Ethernet including switch at $328 for white label and $1,175 per port for IBM BladeCenter.

In order to compare our numbers in an easy way I decided to look at a per CPU core metric of cost per CPU core. With dual 8 core CPUs and dual 10 Gb Ethernet ports. The BladeCenter comes in at $834 and white label at $408.

But what about cost for space?

It turns out that cost for space does not effect the per core number that much. With BladeCenter at 896 cores and a cost per rack at $750, over two years the cost per core is only $22.77 compared to the 640 cores and $31.88 per core over 2 years cost on the white label. So even if we take 2x amount of space for our 1U white label servers we still come way way ahead using 1U pizza box servers saving almost $7K per server!

Distributed storage in today’s clouds

How does this fit with today’s cloud deployments? Well the interesting thing is that today many cloud deployments are using servers for more then just compute resources. Today 10Gb Ethernet controllers such as the Intel X540 are able to directly access CPU cache greatly reducing the processor and memory requirements of 10 Gb flows. This allows servers to also act as cloud storage using new distributed file systems such as Gluster and Hadoop.

Unlike blades, our 1U servers have plenty of room for disks. In fact, the Supermicro X9DRH-7TF motherboard already has a LSI SAS2208 hardware RAID controller supporting 6Gb/s SAS/SATA drives. With 8 2.5″ 1TB SATA drives one could at a low cost add 280TB of raw RAID5 storage per rack to the compute cloud.

Leave a Comment more...

Dracut PXE Boot with bonded interfaces

by on Mar.07, 2012, under Software

It’s taken me a while to get dracut PXE Boot working with bonded interfaces, so I wanted to take a moment and share.

My setup is as follows, 20 servers with dual gig ethernets connected to two Cisco 3750 switches connected togeter in a ring. The first ethernet, eth1 from each server are all connected to swich 1 with the 2nd ethernet, eth2 all connected to the 2nd switch. The ring configuration allows the switches to look like one larger switch, providing redundancy while still allowing for things like trunks spanning more then one switch.

Switch Configuration

The cisco 3750 is configured as follows:

interface Port-channel1
 description virt1
 switchport trunk encapsulation dot1q

interface GigabitEthernet1/0/1
 switchport trunk encapsulation dot1q
 speed 1000
 duplex full
 spanning-tree portfast
 channel-protocol lacp
 channel-group 1 mode passive

interface GigabitEthernet2/0/1
 switchport trunk encapsulation dot1q
 speed 1000
 duplex full
 spanning-tree portfast
 channel-protocol lacp
 channel-group 1 mode passive

The above config first sets up a port-channel, a bonded interface and sets the encapsulation to dot1q, the standard that allows VLAN tagging. Two interfaces are then configured I set the speed, duplex, and spanning-tree portfast to help speed up port setup time. The ports are both configured to used standared lacp and are both made part of the port-channel interface with the channel-group 1 mode passive command. The mode passive is important it does not setup the ports into the trunk group until the other end (our server) brings up the LACP trunk. This allows the server to do standard PXE Boot with DHCP and TFTP on the standard interface rather then failing because it was in trunk mode.

Dracut Configuration

Dracut allows you to boot a server with as little as possible hard-coded into the initramfs. To make the image I typed:

dracut dracut.img 3.2.7-1.fc16.x86_64
dracut –add-drivers bonding -f dracut.img

The first line builds the image and the 2nd line adds bonding support to the image, note that the kernel name is important, you can pull that with uname -r. The Dracut configuration lives on the tftpserver in the pxelinux.cfg/default file. Mine looks like:

prompt 1
default Fedora-16_3.2.7-1.fc16.x86_64
timeout 10
serial 0 115200
console 0

label Fedora-16_3.2.7-1.fc16.x86_64
kernel vmlinuz-3.2.7-1.fc16.x86_64
append initrd=dracut.img root=10.10.0.4:/diskless/Fedora16_020303 console=ttyS0,115200 biosdevname=0 bond=bond0:eth0,eth1:mode=4 bridge=ovirtmgmt:bond0 ip=ovirtmgmt:dhcp

This file configures a serial console on the first serial port as a speed of 115,200, it passes to the tftpserver the kernel file with the dracut configuration. A breakdown of the dracut line is as follows:

initrd=dracut.img                                                                         This is the name of my dracut image.
root=10.10.0.4:/diskless/Fedora16_020303                         
NFS IP and path for the root image.
console=ttys0,115200                                                                Sets the serial device and speed.
biosdevname=0                                                                           Keeps the old eth naming scheem.
bond=bond0:eth0,eth1:mode=4                                               Bonds eth0 and eth1 using mode4.
bridge=ovirtmgmt:bond0
                                                           Creates bridge ovirtmgmt attached to bond0.
ip=ovirtmgmt:dhcp                                                                       Run DHCP on ovirtmgmt interface.

Now the problme….

So far we have a setup that will correctly DHCP and PXE Boot, the server will have access to Vlan 1, but not the other VLANs, this is because the switch LACP port is not yet running as a trunk. Cisco can do this automatically if there is a cisco on the other end via cisco proprietary protocol, but the Linux box does not support this. To get around this problem and still PXE Boot boot we have a script that adds “switchport mode trunk” to the interface Port-Channel. Once this is done you will be able to talk on all the VLANs you have setup. This is an ugly hack, but so far is the only way I have found to have a cisco work in this setup.

3 Comments more...

Hello World, the current temp is:

by on May.02, 2010, under Hardware, Software

I have wanted to start playing with micro controllers for a while now, I ended up selecting the Parallax Propeller chip because of its ease of use and I liked it’s COG design with 8 32 big cores working together.

My first test was connecting a 4×20 line LCD and a few DS18S20 1-wire temp sensors to the propeller chip. Everything was very easy to learn the LCD was interfaced with no external components and the 1-wire bus only required a 4.7K pull up resistor.

Propeller PDB with LCD and 1-wire bus

Leave a Comment : more...

Central Air Pool Heat-Cool

by on Apr.24, 2010, under Projects

Worked today on my central air pool heat/cool system. The goal for the new system is to be able to operate in normal house cool mode, pool heat mode, and pool cool mode all in one HVAC system. The new system uses a reversing valve to pump down unused portions dramatically cutting down on the amount of refrigerant needed.

hvac-house_cool

Check out the project page for more info:

Leave a Comment :, more...

Infiniband

by on Jul.30, 2009, under Hardware

Infiniband is an often overlooked technology outside of the supercomputer / clustering space. I think that is a shame given some of the amazing aspects of this technology. Infiniband is a serial connection with a raw full duplex data rate of 2.5 Gbit/s known as 1x single data rate (SDR) mode. In addition to a double data rate (DDR) and a quad data rate (QDR) mode, links can be aggregated in units of 4 or 12 paths yielding up to 120 Gbit/s in 12X QDR mode. In a day where server motherboards are just starting to see 10 Gibt/s ethernet cards, the most common “low speed” infiniband options is 10 Gbit/s 4X SDR cards. Infiniband uses remote direct memory access (RDMA) for data transfer allowing data to be moved between hosts directly without any CPU cycles. All of this happens in about 1/4th the port to port speed of 10 Gbit/s ethernet!

The part I like best about Infiniband is the price, especially the used market. Lets take a look at a common setup on eBay. There are lots of switch options, but I like the TopSpin 120 also know as the Cisco 7000p. This is a 24 port 4X SDR 10 Gbit/s switch that runs for $750 – $1500 depending on the used source. There are even more options for Infiniband cards, I tend to stick with Mellanox chip set based cards and they can be found for as little as $40 for PCI-X and around $125 for PCI express. The only thing that is going to cost you more with Infiniband is the cables, they will run you $20 – $50 each.

Applications that support native infiniband RDMA are going to get the best performance, but with the Infiniband over IP (IPoIB) you can use standard TCP/IP! With IPoIB your infiniband card shows up as a normal interface and you can run DHCP or static IP on it.

Cisco 7000P

Cisco 7000P

Mellanox MHEL-CF128-T

Mellanox MHEL-CF128-T

GORE 4X Infiniband Cable

GORE 4X Infiniband Cable

1 Comment more...

APC Matrix 5000 Hack

by on Jul.19, 2009, under Hardware

I use a lot of power in my office, so much that the four 1500 VA UPS units I have only last me a few min. I needed something bigger so I went on eBay and found two APC Matrix 5000 UPS units for $450 each including shipping. There was only one downside, there were no batteries and new batteries would have cost me several thousand dollars.

The solution? I picked up 8 marine batteries at the auto parts store and wired them up (yes with fuses) into two 48 volt strings connected in parallel.

IMG_0324

Note: If you try this, you want to use Marine or better yet Deep-Cycle batteries rather then car starting batteries. Car batteries are designed to give very hight bursts of current and should only be discharged to about 5%. The very thin plates would destroyed over a few hundred discharges rather then the thousands you would get from deep cycle.

P.S. Yes, I built a cover for it!

7 Comments :, more...

BlinkMind This Week at InfoComm

by on Jun.16, 2009, under Uncategorized

BlinkMind, will be at InfoComm 09 in Orlando, Fl. If your in the area, stop by and check us out in both 3792 in the Video Conferencing Pavilion. Showing off our 16 active party video conferencing system, SIP Video Server, and IPTV system. We also are showing interoperability with all of systems at Grandstream GXV3006, Polycom VVX 1500, Creative InPerson, Tandberg E20, and Linphone soft client with the BlinkMind contributed H.264 code.

InfoComm Booth Under Construction

InfoComm Booth Under Construction

If you need help finding our booth, check out (click to expand):

Hall B Floor Plan

Hall B Floor Plan

Leave a Comment more...

Converting Citrix .xva to Xen.org .img

by on Jun.06, 2009, under Software

Xen is one of the coolest virtualization technologies out there. It comes in may flavors, the two largest being the bleeding edge xen.org open source project and the commercial (Citrix) version. There are things I love about the commercial version, but they lost me only supporting windows in their XenCenter administration interface.

The file formats of the commercial and open source Xenare totall different. The open source is a standard image file, you can mount it, fdisk it, whatever you would like. The Citrix Xen Virtual Appliance .XVA file is quite different. It is actually a tar file with ova.xml meta data and directories starting with Ref full of 1M files that make up the drive volumes of the virtual image.

To convert .xva to an xen .img file you first untar the image:

tar -xvf {image}.xva

Then grab this handy utility and run it on your untared data, as an example:

python xenmigrate.py –convert=Ref:3 {image}.img

This will paste all of those files back together, starting at 00000000. Note I have had problems running this script on Centos 5.x.

11 Comments : more...

Looking for something?

Use the form below to search the site:

Cool Links!

A few highly recommended links...