Wednesday, January 21, 2015

TDR updates

A few months ago, I wrote about a project I had been thinking of for a while but not had time to work on: a time-domain reflectometer (TDR) for testing twisted pair Ethernet cables.

TDR background

The basic theory of operation is simple: Send a pulse down a transmission line and measure the reflected voltage over time to get a plot of impedance discontinuities over time. Unfortunately, doing this with sufficient temporal resolution (sub-nanosecond) requires extremely high analog sampling rates, and GHz A/D converters are (to say the least) not cheap: the least expensive 1 GSA/s ADC on Digi-Key is the HMCAD1520, which sells for $120 each at the time of this writing. Higher sampling rates cost even more, the 1.5 GSa/s ADC081500CIYB is listed at $347.

One possible architecture would consist of a pre-amplifier for each channel, a 4:1 RF mux, and a single high-speed ADC sampled by an FPGA. This would work, but seemed quite expensive and I wanted to explore lower-cost options.

ADC architecture

After thinking about the problem for a while, I realized that the single most expensive component in a classical TDR was probably the ADC - but there was no easy way to make it cheaper. What if I could eliminate the ADC entirely?

I drew inspiration from the successive-approximation-register (SAR) ADC architecture, which essentially converts a DAC into an ADC by binary searching. The basic operating algorithm is as follows:
  • For each point T in time
    • vstart = 0
    • vend = Vref
    • Set DAC to (Vstart + Vend) / 2
    • Compare Vin against Vdac
    • If Vin > Vdac
      • Set current bit of sample to 1
      • Set Vstart to Vdac, update ADC, repeat
    • else
      • Set current bit of sample to 0
      • Set Vend to Vdac, update ADC, repeat
The problem with SAR for high speeds is that N-bit analog resolution at M samples per second requires a DAC that can run at O(M log N) samples per second - hardly an improvement!

In order to work around this problem, I began to think about ways to represent the data generated by a SAR ADC. I ended up modeling a simplified SAR ADC which performed a linear, rather than binary, search. We can represent the intermediate data as a matrix of 2^N rows by M columns, one for each of M data points.

The sampling algorithm for this simplified ADC works as follows:
  • For each point T in time 
    • Set DAC to 0
    • Compare Vin against Vdac
    • If Vin > Vdac
      • Set column[Vdac] to 1
      • Increment Vdac
      • Repeat comparison
    • Otherwise stop and capture the next sample
Once we have this matrix, we can simply sum the number of 1s in each column to calculate the corresponding sample value.

While this approach will clearly work, it is exponentially slower than the conventional SAR ADC since it requires 2^N samples instead, instead of N, for N-bit precision. So why is it useful?

Now consider what happens if we acquire the data from a transposed version of the same matrix:
  • For each Vdac from 0 to Vref
    • For each point T in time
      • Compare Vin against Vdac
      • If Vin > Vdac
        • Set row Vin of column T to 1
    • Increment Vref
    • Go back in time and loop over the signal again
This version clearly captures the same data, since matrix[T][V] is set to true iff sample T is less than V. We simply switch the inner and outer loops.

It also has a very interesting property for cost optimization: Since it only updates the DAC after sampling the entire signal, we can now use a much slower (and cheaper) DAC than with a conventional SAR. In addition, the comparator can now update at the sampling frequency instead of 2^N times the sampling frequency.

There's just one problem: It requires time travel! Why are we wasting our time analyzing a circuit that can't actually be built?

Well, as it turns out we can solve this problem too - with "parallel universes". Since the impedance of the cable is (hopefully) fairly constant over time, if we send multiple pulses they should return identical reflection waveforms. We can thus send out a pulse, test one candidate Vdac value against this waveform, then increment Vdac and repeat.

The end result is that with a cheap SPI DAC, a high-speed comparator, and an FPGA with a high-speed digital input we can digitize a repetitive signal to arbitrary analog precision, with sampling rate limited only by comparator bandwidth and FPGA input buffer performance!

Pulse generation

The first step in any TDR, of course, is to generate the pulse.

I spent a while looking over FPGAs and ended up deciding on the Xilinx Kintex-7 series, specifically the XC7K70T. The -1 speed grade can do 1.25 Gbps LVDS in the high-performance I/O banks (matching the -2 and -3 speed Artix-7 devices) and the higher speed grades can go up to 1.6 Gbps.

The pulse is generated by using the OSERDES of the FPGA to produce a single-cycle LVDS 1 followed by a long series of 0s. The resulting LVDS pulse is fed into a Micrel SY58603U LVDS-to-CML buffer. This slightly increases the amplitude of the output pulse and sharpens the rise time up to 80ps.

The resulting pulse is then sent through the RJ45 connector onto the cable being tested.

Output buffer

Input preamplifier

The reflected signal coming off the differential pair is AC coupled with a pair of capacitors to prevent bus fights between the (unequal) common-mode voltages of the output buffer and the preamplifier. It is then fed into a LMH6881 programmable-gain preamplifier.

This is by far the most pricey analog component I've used in a design: nearly $10 for a single amplifier. But it's a very nice amplifier (made on a SiGe BiCMOS process)- very high linearity, 2.4 GHz bandwidth, and gain from 6 to 26 dB programmable over SPI in 0.25 dB steps.

Input preamplifier
The optional external terminator (R25) is intended to damp out any reflections coming off of the preamplifier if they present a problem; during the initial assembly I plan to leave it unpopulated. Since this is my first high-speed mixed signal design I'm trying to make it easy to tweak if I screwed up somehow :)

The output of the preamplifier is a differential signal with 2.5V common mode offset.

Differential to single-ended conversion

The next step is to convert the differential voltage into a single-ended voltage that we can feed into the comparator. I use an AD8045 unity gain voltage feedback amplifier for this, configured to compute CH1_VDIFF = (CH1_BUF_P - CH1_BUF_N)  + 2.5V.


The single-ended voltage is compared against the DAC output (AFE_VREF) using one half of an LMH7322 dual comparator.

The output supply of the comparator is driven by a 2.5V supply to produce LVDS-compatible differential output voltage levels.

PCB layout

The board was laid out in KiCAD using the new push-and-shove router. All of the differential pairs were manually length-matched to 0.1mm or better.

The upper left corner of the board contains four copies of the AFE. The AD8045s are on the underside of the board because the pinout made routing easier this way. Hopefully the impedance discontinuities from the vias won't matter at these signal speeds...

AFE layout, front side
AFE layout, back side
The rest of the board isn't nearly as complex: the lower left has a second RJ45 connector and a RGMII PHY for interfacing to the host PC, the power supply is at the upper right, and the FPGA is bottom center.

The power supply is divided into two regions, digital and analog. The digital supply is on the far right side of the board, safely isolated from the AFE. It uses an LTC3374 to generate an 1.0V 4A rail for the FPGA core, a 1.2V 2A rail for the FPGA transceivers and Ethernet PHY, a 1.8V 1A rail for digital I/O, and a 2.5V rail for the CML buffers and Ethernet analog logic.

The analog supply was fairly close to the AFE so I put a guard ring around it just to be safe. It consists of a LTC3122 boost converter to push the 5V nominal input voltage up to 6V, followed by a 5V LDO to give a nice clean analog rail. I ran the output of the LDO through a pi filter just to be extra safe.

The TDR subsystem didn't use any of the four 6.6 Gbps GTX serial transceivers on the FPGA because they are designed to recover their clock from the incoming signal and don't seem to support use of an external reference clock. It seemed a shame to waste them, though, so I broke them (as well as 20 0.95 Gbps LVDS channels) out to a Samtec Q-strip header for use as high-speed GPIO.

Without further ado, here's the full layout. I could have made the board quite a bit smaller in the vertical axis but I needed to keep a constant 100mm high so it would fit in the card guides on my Eurocard rack.

The board is at fabs for quotes now and I'll make another post once the boards come back.

Layer 1
Layer 2 (ground)
Layer 3 (power)
Layer 4, flipped to make text readable

Thursday, January 15, 2015

GERALD microscope control system

While my optical microscopes are capable of sufficient resolution for imaging larger-process ICs, taking massive die images (this one, for a comparatively small 3.2mm^2 die, is about 0.6 gigapixels) has been beyond my capability because I have better things to do with my time than sit in the lab for a week turning the stage knob a little, clicking a button to take a picture, turning the knob a little...

John has a computer-controlled microscope that he uses for large imaging jobs and it seemed like it'd be a good idea to make one of my own. The first step was to write some control software. John's user interface is very much a "programmer's GUI" and I figured something with a bit more eye candy wouldn't be too hard to do.

The result was a system called GERALD. (I couldn't think of a name for it, asked my girlfriend for suggestions, and this was the first she came up with...)

The prototype is using my old AmScope microscope because I didn't want to take my main Olympus out of service for an extended period of time during development. Yes, I'm aware that the "support" structure under the stage isn't very rigid... this is a software development testbed and I won't be using this microscope in the final deployment so there's no sense wasting time machining nice aluminum brackets. I just have to be careful not to move around too much when using it ;)

Prototype GERALD system on my desk. The breadboarded MCU is generating step/direction pulses from USB-serial commands and sending them to the stepper controller under the textbook.

The camera in use is an AmScope MD1900. It uses a proprietary USB protocol and only has Windows driver support, and I try to run a 100% Linux shop. This was a problem... In keeping with John's unofficial motto "Open source by any means necessary", I corrected the problem ;)

A bit of Wireshark and libusb coding later, I had a basic working driver. The protocol is closely related (but not identical) to the MU800 that John reversed a few months ago, which helped me get my feet in the door. There's a bit of trivial obfuscation (XORing control transactions with 0x55aa or 0x5aa5) which I fail to understand... The average user isn't going to notice anything, and anyone with the skills to reverse engineer USB transactions or a kernel driver will see right through it, so why bother?

I plan to try making a kernel V4L2 driver in the future but for now it works so it's not a huge priority.

The current GERALD system is very much a WIP, most basic features are there but automated image capture and some other things aren't implemented yet.

GERALD UI overview
The basic UI layout is modeled in large part on the FEI Versa FIB's control system, which is currently my favorite control system for either optical or electron microscopes.

The upper left panel is an overview of the current camera feed, scaled down to fit. The upper right view, meanwhile, shows the center of the feed at native (1:1 pixel) resolution. Both views have a footer that includes a scale bar, date/time, the objective in use, and magnification. (The magnification is the actual value based on calibration of the camera and my 24" 1080p display.)

The lower right view is a webcam pointed at the sample stage. I haven't had the time to make a bracket to hold it in the right spot so it's just sitting on my desk for now. This is the equivalent of a SEM chamber cam and I hope to eventually use it to avoid crashing the sample into the objective in full-remote-control operation.

The lower left view is a navigation display showing the current sample. Unlike the Versa's nav-cam (a single static image taken of the stage when the sample is loaded) my navigation view is a composite of actual microscope images. Every time a new video frame comes in when the stage isn't moving (to avoid motion blur) it is plotted in the navigation view at the current physical position of the stage.

As of now the navigation view is non-interactive; all it does is show the current field of view on the sample. My plan is to support clicking to move the stage to a specific point, as well as drawing to define a rectangular area for step-and-repeat imaging.

In typical use I envision the user moving to the upper left and bottom right corner of the sample manually with the joystick, then selecting an objective and drawing a box around the sample to initiate a high-resolution imaging run.

In order for all of this to work properly, the system must be calibrated. The first step is to calibrate the camera so that it knows how large each pixel is. I've done this manually in the past by taking pictures of a calibration slide and counting pixels, but it's high time I automated the process.

GERALD pixel size calibration
The algorithm is quite simple and is designed to work with the pattern on my calibration slide (10um pitch short lines and 50um pitch long lines) only. As of now the are limited sanity checks so if there's no calibration slide in view the results can be somewhat strange :)
  • Convert the image to grayscale using NTSC color weights
  • Compute a gray-level histogram
  • Median-filter the histogram to smooth out spikes and find peaks
  • The distribution should be approximately bimodal (white lines on dark background). Take the mean of the peaks and threshold the image to binary.
  • Do a median filter on the binary image to smooth out noise
  • Scan across the image horizontally, one scan line at a time. Note the start and end locations of each white area. If the width is too small (less than two pixels) or too large (more than 10% of the screen) discard it. Otherwise save the width and centroid as a slice down a potential line.
  • For each slice, check if the next row down has a line slice within a couple of pixels. If so, add it to the growing polyline and remove it from the list of candidates.
  • Fit a line segment to the points in each polyline using least-squares, set its width to the mean of all slices used
  • Extend each line segment until it hits the edge of the image. If the projected line gets within a very short distance of BOTH endpoints of a second segment, the two are collinear so merge them. (This will smooth out gaps in the detected lines from dust specks etc.) The resulting lines are plotted in green in the calibration view.
  • Project the lines onto the X axis and sort them from left to right.
  • For each line from left to right, measure the perpendicular distance to the next one over (displayed in red in the calibration view).
  • Compute the median of these line lengths. This is considered to be the pitch of the lines, in pixels.
  • Find the median width of the lines.
  • If the pitch is less than three times the line width, we're looking at fine-pitch lines (10 μm pitch). Otherwise the fine pitch lines were too small to resolve and we're looking at coarse pitch (50 μm).
  • Compute the physical size of one camera pixel from the known physical pitch and pixel pitch.
Now that that the camera has been calibrated, we know how big a camera pixel is - but we don't have the scaling and rotation factors needed to transform between camera and stage coordinates. This is the second half of the calibration.

Rather than trying to compute a full transformation matrix, the current code simply represents a motor step for each motor as a 2-vector describing the distance (in nanometers, referenced to the camera axes) the stage moves during one motor step. We can then easily compute the distance moved by any number of motor steps as a linear combination of these two 2-vectors.

This algorithm is built on top of the same CV code used for the camera calibration.
  • Find the rightmost line on the calibration slide.
  • Locate the midpoint of it. (This is marked with a blue dot in the debug overlay.)
  • Check if this keypoint is in the left 1/4 of the camera's FOV.
  • If not, move left 50 steps, take a picture, and repeat until it is.
  • Record the position of the keypoint.
  • Move right, 50 steps at a time, until the keypoint is in the right 1/4 of the FOV.
  • Record the 2D distance the keypoint moved and divide by the number of steps taken to get the X axis step vector.
  • Repeat for the Y axis, except using 1/3 as the threshold instead of 2/3 since the Y axis FOV is smaller than X.

The system isn't quite finished but it's coming along nicely. Hopefully I'll have time in the next couple of months to finish the software, make a PCB for the control circuit, and machine brackets to hold all of the parts onto my Olympus scope.

FPGA cluster updates

Although the "raised floor" design for my FPGA cluster looked cool, it really didn't scale. My entire desk was full, there was very limited room for new hardware, and the boards kept getting dusty. To make matters worse, long wires were needed to connect everything and it was difficult to manage them all.

Original FPGA cluster

I ended up moving forward with the plan I came up with a few months ago and my FPGA cluster is now living on the 19" rack in my living room.

The first step was to laser-cut acrylic frames for each board (or several boards, if they were small enough) that would slide into the card guides.

In the photo below you can see nodes lx9mini0 and lx9mini2 (lx9mini1 was being used for something else at the time so I put it on another card later on) on a clear 1/16" acrylic sheet cut to standard Eurocard dimensions. The clear faceplate was later replaced with an opaque black one because I think it looks better that way.

Spartan-6 FPGA boards on a 3U blade card

My existing USB hubs weren't well suited to the rackmount form factor so I built a new one. This is a ten-port USB 2.0 hub with two front-panel ports (for keyboard and mouse) and eight back-side ports for internal connections. It consists of three 4-port Cypress hub chips in a tree, plus the associated PMICs. For extra fun I threw on an XC2C128 CPLD with a SPI headers so that I could potentially toggle power to individual ports remotely over SPI, but as of now this functionality isn't being used.

As with my previous hub designs all ports are overcurrent protected. I also added external ESD clamp diodes (RCLAMP0514M) to each data line after an killing a previous hub by zapping it while plugging my phone in to charge.

The port indicator LEDs are off in the idle state, green when a device has enumerated successfully, blinking green when a device is detected but fails to enumerate, and red for an overcurrent fault.

3U x 4HP USB hub blade

I went with my original plan of racking the Atlys boards in one of my empty 1U cases and the AC701 in another. The boards are screwed into standoffs which are attached to the case by cyanoacrylate adhesive.

Nodes atlys0 and atlys1 being installed in a 1U server case. JTAG and Ethernet cables are not yet installed.

I then installed my PDU board, the BeagleBone Black, the USB hub, and all of the smaller FPGA boards I was using for my research in the rack. Since I was moving the second 24-port switch from my desk to the rack I hooked the two together with a short run of multimode fiber. Fiber was overkill for the application but I had SFP ports sitting around unused and I was running out of copper interfaces...

The standard I've decided on for all new designs in the near future is as follows:
  • Boards are to be 100mm tall (3U Eurocard height)
  • Component keepouts along the top and bottom edges as specified by the Eurocard standard for card guides
  • Faceplate mounting holes on the front panel as specified by the Eurocard standard
  • 5V DC center-high power via standard 5.5mm barrel jack on the back edge
  • Network jacks and indicator LEDs on the front edge
  • JTAG, USB, and other I/O on the back edge

My current rack
The current FPGA cluster
Back side of the FPGA cluster midway through rack install. The USB cable coming out the bottom was temporary for testing until I could find a right-angle one.
All of the cards have nice shiny black faceplates except my 4-port switch prototype from a few years ago - the laser cutter on campus was unavailable the day I wanted to get the faceplate made and I never got around to doing it.

I still have quite a bit more gear to rack - several Spartan-6 boards I'm not actively using in my research, a Raspberry Pi, a Parallella, and a large number of PIC MCU development boards of various types. This will fill the current card rack and then some, so I left 3U of empty space below this one for expansion. The Parallella runs hot so I ordered a 1U fan tray which will be installed later today.

The 1U blank above the card rack is there for airflow reasons (the fan will blow upwards and exhaust air needs some space to blow out) but I can still fit stuff that doesn't block airflow. Early next week I'll be replacing it with a 16-port LC-LC patch panel so that I can terminate fiber runs to various devices on the rack (such as the AC701 board, which currently has a 2m run of multimode going around the side of the rack because there's no good way to get fiber from back to front).

New microscope bench

For a long time, my microscopes have lived on a folding plastic table in the corner of my lab. It was wobbling and causing blurry images, but I never got a chance to do something about it... until now.

It's been replaced it with a custom-built wooden workbench. I made it big and heavy on purpose to reduce vibrations: The tabletop is 3/4" flooring-grade plywood and the legs are 4x4" posts resting on top of rubber pads. (I actually made two of them, because one of my roommates liked the design and wanted one for himself. )

I don't have any photos of the early build process but here's one of the tabletop and legs before the shelving was installed. It was annoying having to stain and assemble it in the middle of my living room... I can't wait to have a garage or basement to work in!

EDIT: A friend who helped me with the build just sent me this one.
Test-fitting some of the lumber
New workbench, partially assembled
After installing shelves
In its final resting place

The stain turned out a little darker than I anticipated but it actually matched the paneling in my apartment surprisingly well so no complaints there. The 16 square feet of shelf space let me tidy up the lab and remove a lot of miscellaneous junk that had been sitting around with nowhere to go.

With equipment installed
It was a lot of fun to make - I had almost forgotten how much I enjoyed woodworking. Maybe my next piece of furniture will be something more pretty and less utilitarian in nature? (Making it out of hardwood instead of construction lumber, and having better tools, would help too...)

Hello 2015 - New job, new projects, and more!

A few months ago, I wrote about some of my pending projects. Several of them are actively running, some are done, and a few new ones cropped up :) I'll be making a bunch of short posts in the next day or so on some of them.

It's been a busy fall so I haven't had time to post anything lately. I'm working hard on my thesis and plan to defend this spring. It's 84 pages and counting...

After I graduate I'll be packing up my lab and moving from Troy, NY to somewhere near Seattle. I'll be working as a senior security consultant in the hardware lab at IOActive, doing board, firmware, and silicon level reversing and security auditing, as well as original research... so you can expect some posts by me on their lab blog in the coming months. My personal lab will likely have a bit of downtime due to the move, although I'd like to get back up and running by the end of the summer at the latest.

I've also submitted a talk based on my CPLD reverse engineering work to REcon. I hinted earlier about the project but there's a lot more to it which hasn't been released publicly... details to come if the talk is accepted ;)

Saturday, October 4, 2014

Why Apple's iPhone encryption won't stop NSA (or any other intelligence agency)

Recent news headlines have made a big deal of Apple encrypting more of the storage on their handsets, and claiming to not have a key. Depending on who you ask this is either a huge win for privacy, or a massive blow to intelligence collection and law enforcement capabilities. I'm going to try avoiding expressing any opinions of government policy here and focus on the technical details of what is and is not possible - and why disk encryption isn't as much of a major game-changer as people seem to think.

Matthew Green at Johns Hopkins wrote a very nice article on the subject recently, but there are a few points I feel it's worth going into more detail on.

The general case here is that of two people, Alice and Bob, communicating with iPhones while a third party, Eve, attempts to discover something about their communications.

First off, the changes in iOS 8 are encrypting data on disk. Voice calls, SMS, and Internet packets still cross the carrier's network in cleartext. These companies are legally required (by CALEA in the United States, and similar laws in other countries) to provide a means for law enforcement or intelligence to access this data.

In addition, if Eve can get within radio range of Alice or Bob, she can record the conversation off the air. Although the radio links are normally encrypted, many of these cryptosystems are weak and can be defeated in a reasonable amount of time by cryptanalysis. Numerous methods are available for executing man-in-the-middle attacks between handsets and cell towers, which can further enhance Eve's interception capabilities.

Second, if Eve is able to communicate with Alice or Bob's phone directly (via Wi-Fi, SMS, MITM of the radio link, MITM further upstream on the Internet, physical access to the USB port, or using spearphishing techniques to convince them to view a suitably crafted e-mail or website) she may be able to use an 0day exploit to gain code execution on the handset and bypass any/all encryption by reading the cleartext out of RAM while the handset is unlocked. Although this does require that Eve have a staff of skilled hackers to find an 0day, or deep pockets to buy one, when dealing with a nation/state level adversary this is hardly unrealistic.

Although this does not provide Eve with the ability to exfiltrate the device  encryption key (UID) directly, this is unnecessary if cleartext can be read directly. This is a case of the general trend we've been seeing for a while - encryption is no longer the weakest link, so attackers figure out ways to get around it rather than smash through.

Third, in many cases the contents of SMS/voice are not even required. If the police wish to geolocate the phone of a kidnapping victim (or a suspect) then triangulation via cell towers and the phone's GPS, using the existing e911 infrastructure, may be sufficient. If intelligence is attempting to perform contact tracing from a known target to other entities who might be of interest, then the "who called who when" metadata is of much more value than the contents of the calls.

There is only one situation where disk encryption is potentially useful: if Alice or Bob's phone falls into Eve's hands while locked and she wishes to extract information from it. In this narrow case, disk encryption does make it substantially more difficult, or even impossible, for Eve to recover the cleartext of the encrypted data.

Unfortunately for Alice and Bob, a well-equipped attacker has several options here (which may vary depending on exactly how Apple's implementation works; many of the details are not public).

If the Secure Enclave code is able to read the UID key, then it may be possible to exfiltrate the key using software-based methods. This could potentially be done by finding a vulnerability in the Secure Enclave (as was previously done with the TrustZone kernel on Qualcomm Android devices to unlock the bootloader). In addition, if Eve works for an intelligence agency, she could potentially send an NSL to Apple demanding that they write firmware, or sign an agency-provided image, to dump the UID off a handset.

In the extreme case, it might even be possible for Eve to compromise Apple's network and exfiltrate the certificate used for signing Secure Enclave images. (There is precedent for this sort of attack - the authors of Stuxnet appear to have stolen a driver-signing certificate from Realtek.)

If Apple did their job properly, however, the UID is completely inaccessible to software and is locked up in some kind of on-die hardware security module (HSM). This means that even if Eve is able to execute arbitrary code on the device while it is locked, she must bruteforce the passcode on the device itself - a very slow and time-consuming process.

In this case, an attacker may still be able to execute an invasive physical attack. By depackaging the SoC, etching or polishing down to the polysilicon layer, and looking at the surface of the die with an electron microscope the fuse bits can be located and read directly off the surface of the silicon.

E-fuse bits on polysilicon layer of an IC (National Semiconductor DMPAL16R). Left side and bottom right fuses are blown, upper right is conducting. (Note that this is a ~800nm process, easily readable with an optical microscope. The Apple A7 is made on a 28nm process and would require an electron microscope to read.) Photo by John McMaster, CC-BY
Since the key is physically burned into the IC, once power is removed from the phone there's no practical way for any kind of self-destruct to erase it. Although this would require a reasonably well-equipped attacker, I'm pretty confident based on my previous experience that I could do it myself, with equipment available to me at school, if I had a couple of phones to destructively analyze and a few tens of thousands of dollars to spend on lab time. This is pocket change for an intelligence agency.

Once the UID is extracted, and the encrypted disk contents dumped from the flash chips, an offline bruteforce using GPUs, FPGAs, or ASICs could be used to recover the key in a fairly short time. Some very rough numbers I ran recently suggest that an 6-character upper/lowercase alphanumeric SHA-1 password could be bruteforced in around 25 milliseconds (1.2 trillion guesses per second) by a 2-rack, 2500-chip FPGA cluster costing less than $250,000. Luckily, the iPhone uses an iterated key-derivation function which is substantially slower.

The key derivation function used on the iPhone takes approximately 50 milliseconds on the iPhone's CPU, which comes out to about 70 million clock cycles. Performance studies of AES on a Cortex-A8 show about 25 cycles per byte for encryption plus 236 cycles for the key schedule. The key schedule setup only has to be done once so if the key is 32 bytes then we have 800 cycles per iteration, or about 87,500 iterations.

It's hard to give exact performance numbers for AES bruteforcing on an FPGA without building a cracker, but if pipelined to one guess per clock cycle at 400 MHz (reasonable for a modern 28nm FPGA) an attacker could easily get around 4500 guesses per second per hash pipeline. Assuming at least two pipelines per FPGA, the proposed FPGA cluster would give 22.5 million guesses per second - sufficient to break a 6-character case-sensitive alphanumeric password in around half an hour. If we limit ourselves to lowercase letters and numbers only, it would only take 45 seconds instead of the five and a half years Apple claims bruteforcing on the phone would take. Even 8-character alphanumeric case-sensitive passwords could be within reach (about eight weeks on average, or faster if the password contains predictable patterns like dictionary words).

Wednesday, September 17, 2014

Threat modeling for FPGA software backdoors

I've been interested in the security of compilers and related toolchains ever since I first read about Ken Thompson's compiler backdoor many years ago. In a nutshell, this famous backdoor does two things:

  • Whenever the backdoored C compiler compiles the "login" command, it adds a second code path that accepts a hard-coded default password in addition to the user's actual password
  • Whenever the backdoored C compiler compiles the unmodified source code of itself, it adds the backdoor to the resulting binary.
The end result is a compiler that looks fine at the source level, silently backdoors a critical system file at compilation time, and reproduces itself.

Recently, there has also been a lot of concern over backdoors in integrated circuits (either added at the source code level by a malicious employee, or at the netlist/layout level by a third-party fab). DARPA even has a program dedicated to figuring out ways of eliminating or detecting such backdoors. A 2010 paper stemming from the CSAW Embedded Systems Challenge presents a detailed taxonomy of such hardware Trojans.

As far as I can tell, the majority of research into hardware Trojans has been focused on detecting them, assuming the adversary has managed to backdoor the design in some way that provides him with a tactical or strategic advantage. I have had difficulty finding detailed threat modeling research quantifying the capability of the adversary under particular scenarios.

When we turn our attention to FPGAs, things quickly become even more interesting. There are several major differences between FPGAs and ASICs from a security perspective which may grant the adversary greater or lesser capability than with an ASIC.

Attacks at the IC fab

The function of an ASIC is largely defined at fab time (except for RAM-based firmware) while FPGAs are extremely flexible. When trying to backdoor FPGA silicon the adversary has no idea what product(s) the chip will eventually make it into. They don't even know which pins on the die will be used as inputs and which as outputs.

I suspect that this places substantial bounds on the capability of an attacker "Malfab" targeting FPGA silicon at the fab (or pre-fab RTL/layout) level since the actual RTL being targeted does not even exist yet. To start, we consider a generic FPGA without any hard IP blocks:
  • Malfab does not know which flipflops/SRAM/LUTs will eventually contain sensitive data, nor what format this data may take.
  • Malfab does not know which I/O pins may be connected to an external communications interface useful for command-and-control.
As a result, his only option is to create an extremely generic backdoor. At this level, the only thing that makes sense is to connect all I/O pins (perhaps through scan logic) to a central malware logic block which provides the ability to read (and possibly modify) all state in the device. This most likely would require two major subsystems:
  • A detector, which searches I/Os for a magic sync sequence
  • A connection from that detector to the FPGA's internal configuration access port (ICAP), used for partial reconfiguration and readback.
The design of this protocol would be very challenging since the adversary does not know anything about the external interfaces the pin may be connected to. The FPGA could be in a PLC or similar device whose only external contact is RS-232 serial on a single input pin. Perhaps it is in a network router/switch using RGMII (4-bit parallel with double data rate signalling).

I am not aware of any published work on the feasibility of such a backdoor however I am skeptical that a sufficiently generic Trojan could be made simple enough to evade even casual reverse engineering of the I/O circuitry, and fast enough to not seriously cripple performance of the device.

Unfortunately for our defender Alice, modern FPGAs often contain hard IP blocks such as SERDES and RAM controllers. These present a far more attractive target to Malfab as their functionality is largely known in advance.

It is not hard to imagine, for example, a malicious patch to the RAM controller block which searches each byte group for a magic sync sequence written to consecutive addresses, then executes commands from the next few bytes. As long as Malfab is able to cause the target's system to write data of his choice to consecutive RAM addresses (perhaps by sending it as the payload of an Ethernet frame, which is then buffered in RAM) he can execute arbitrary commands on the backdoor engine. If one of these commands is "write data from SLICE_X37Y42.A5FF to RAM address 0xdeadbeef", and Malfab can predict the location of a transmit buffer of some sort, he now has the ability to exfiltrate arbitrary state from Alice's hardware.

I thus conjecture that the only feasible way to backdoor a modern FPGA at fab time is through hard IP. If we ensure that the JTAG interface (the one hard IP block whose use cannot be avoided) is not connected to attacker-controlled interfaces, use off-die SERDES, and use softcore RAM controllers on non-standard pins, it is unlikely that Malfab will be able to meaningfully affect the security of the resulting circuit.

Attacks on the toolchain

We now turn our attention to a second adversary, Maldev - the malicious development tool. Maldev works for the FPGA vendor, has compromised the source repository for their toolchain, has MITMed the download of the toolchain installer, or has penetrated Alice's network and patched the software on her computer.

Since FPGAs are inherently closed systems (more so than ASICs, in which multiple competing toolchains exist), Alice has no choice but to use the FPGA vendor's binary-blob toolchain. Although it is possible in theory for a dedicated team with sufficient time and budget to reverse engineer the FPGA and/or toolchain and create a trusted open-source development suite, I discount the possibility for the scope of this section since a fully trusted toolchain is presumably free of interesting backdoors ;)

Maldev has many capabilities lacking to Malfab since he can execute arbitrary code on Alice's computer. Assuming that Alice is (justifiably) paranoid about the provenance of her FPGA software and runs it on a dedicated machine in a DMZ (so that it cannot infect the remainder of her network), this is equivalent to having full access to her RTL and netlist at all stages of design.

If Alice gives her development workstation Internet access, Maldev now has the ability to upload her RTL source and/or netlist, modify it at will on his computer, and then push patches back. This is trivially equivalent to a full defeat of the entire system.

Things become more interesting when we cut off command-and-control access. This is a realistic scenario if Alice is a military/defense user doing development on a classified network with no Internet connection.

The simplest attack is for Maldev to store a list of source file hashes and patches in the compromised binary. While this is very limited (custom-developed code cannot be attacked at all), many design teams are likely to use a small set of stock communications IP such as the Xilinx Tri-Mode Ethernet MAC, so patching these may be sufficient to provide him with an attack vector on the target system. Looking for AXI interconnect IP provides Maldev with a topology map of the target SoC.

Another option is graph-based analytics on the netlist at various stages of synthesis. For example, by looking for a 32-bit register initialized to 0x67452301, which is in a strongly connected component with three other registers initialized to 0xefcdab89, 0x98badcfe, and 0x10325476, Maldev can say with a high probability that he has found an implementation of MD5 and located the state registers. By looking for a 128-bit comparator between these values and another 128-bit value, a hash match check has been found (and a backdoor may be inserted). Similar techniques may be used to look for other cryptography.


If FPGA development is done using silicon purchased before the start of the project, on an air-gapped machine, and without using any pre-made IP, then some bounds can clearly be placed on the adversary's capability.

I have not seen any formal threat modeling studies on this subject, although I haven't spent a ton of time looking for them due to research obligations. If anyone is aware of published work in this field I'm extremely interested.