Monday, March 31, 2014

Getting my feet wet with invasive attacks, part 2: The attack

This is part 2 of a 2-part series. Part 1, Target Recon, is here.

Once I knew what all of the wires in the ZIA did, the next step was to plan an attack to read signals out.

I decapped an XC2C32A with concentrated sulfuric acid and soldered it to my dev board to verify that it was alive and kicking.

Simple CR-II dev board with integrated FTDI USB-JTAG
After testing I desoldered the sample and brought it up to campus to introduce it to some 30 keV Ga+ ions.

I figured that all of the exposed packaging would charge, so I'd need to coat the sample with something. I normally used sputtered Pt but this is almost impossible to remove after deposition so I decided to try evaporated carbon, which can be removed nicely with oxygen plasma among other things.

I suited up for the cleanroom and met David Frey, their resident SEM/FIB expert, in front of the Zeiss 1540 FIB system. He's a former Zeiss engineer who's very protective of his "baby" and since I had never used a FIB before there was no way he was going to let me touch his, so he did all of the work while I watched. (I don't really blame him... FIB chambers are pretty cramped and it's easy to cause expensive damage by smashing into something or other. Several SEMs I've used have had one detector or another go offline for repair after a more careless user broke something.)

The first step was to mill a hole through the 900 nm or so of silicon nitride overglass using the ion beam.

Newly added via, not yet filled
Once the via was drilled and it appeared we had made contact with the signal trace, it was time to backfill with platinum. The video below is sped up 10x to avoid boring my readers ;)


Metal deposition in a FIB is basically CVD: a precursor gas is injected into the chamber near the sample and it decomposes under the influence of beam-generated secondary electrons.

Once the via was filled we put down a large (20 μm square) square pad we could hit with an electrical probe needle.

Probe pad
Once everything was done and the chamber was vented I removed the carbon coating with oxygen plasma (the cleanroom's standard photoresist removal process), packaged up my sample, went home, and soldered it back to the board for testing. After powering it up... nothing! The device was as dead as a doornail, I couldn't even get a JTAG IDCODE from it.

I repeated the experiment a week or two later, this time soldering bare stub wires to the pins so I could test by plugging the chip into a breadboard directly. This failed as well, but watching my benchtop power supply gave me a critical piece of information: while VCCINT was consuming the expected power (essentially zero), VCCIO was leaking by upwards of 20 mA.

This ruled out beam-induced damage as I had not been hitting any of the I/O circuitry with the ion beam. Assuming that the carbon evaporation process was safe (it's used all the time on fragile samples, so this seemed a reasonably safe assumption for the time being), this left only the plasma clean as the potential failure point.

I realized what was going on almost instantly: the antenna effect. The bond wire and leadframe connected to each pad in the device was acting as an antenna and coupling some of the 13.56 MHz RF energy from the plasma into the input buffers, blowing out the ESD diodes and input transistors, and leaving me with a dead chip.

This left me with two possible ways to proceed: removing the coating by chemical means (a strong oxidizer could work), or not coating at all. I decided to try the latter since there were less steps to go wrong.

Somewhat surprisingly, the cleanroom staff had very limited experience working with circuit edits - almost all of their FIB work was process metrology and failure analysis rather than rework, so they usually coated the samples.

I decided to get trained on RPI's other FIB, the brand-new FEI Versa 3D. It's operated by the materials science staff, who are a bit less of the "helicopter parent" type and were actually willing to give me hands-on training.

FEI Versa 3D SEM/FIB
The Versa can do almost everything the older 1540 can do, in some cases better. Its one limitation is that it only has a single-channel gas injection system (platinum) while the 1540 is plumbed for platinum, tungsten, SiO2, and two gas-assisted etches.

After a training session I was ready to go in for an actual circuit edit.

FIB control panel
The Versa is the most modern piece of equipment I've used to date: it doesn't even have the classical joystick for moving the stage around. Almost everything is controlled by the mouse, although a USB-based knob panel for adjusting magnification, focus, and stigmators is still provided for those who prefer to turn something with their fingers.

Its other nice feature is the quad-image view which lets you simultaneously view an ion beam image, an e-beam image, the IR camera inside the chamber (very helpful for not crashing your sample into a $10,000 objective lens!), and a navigation camera which displays a top-down optical view of your sample.

The nav-cam has saved me a ton of time. On RPI's older JSM-6335 FESEM, the minimum magnification is fairly high so I find myself spending several minutes moving my sample around the chamber half-blind trying to get it under the beam. With the Versa's nav-cam I'm able to set up things right the first time.

I brought up both of the beams on the aluminum sample mounting stub, then blanked them to try a new idea: Move around the sample blind, using the nav-cam only, then take single images in freeze-frame mode with one beam or the other. By reducing the total energy delivered to the sample I hoped to minimize charging.

This strategy was a complete success, I had some (not too severe) charging from the e-beam but almost no visible charging in the I-beam.

The first sample I ran on the Versa was electrically functional afterwards, but the probe pad I deposited was too thin to make reliable contact with. (It was also an XC2C64A since I had run out of 32s). Although not a complete success, it did show that I had a working process for circuit edits.

After another batch of XC2C32As arrived, I went up to campus for another run. The signal of interest was FB2_5_FF: the flipflop for function block 2 macrocell 5. I chose this particular signal because it was the leftmost line in the second group from the left and thus easy to recognize without having to count lines in a bus.

The drilling went flawlessly, although it was a little tricky to tell whether I had gone all the way to the target wire or not in the SE view. Maybe I should start using the backscatter detector for this?

Via after drilling before backfill
I filled in the via and made sure to put down a big pile of Pt on the probe pad so as to not repeat my last mistake.

The final probe pad, SEM image
Seen optically, the new pad was a shiny white with surface topography and a few package fragments visible through it.

Probe pad at low mag, optical image
At higher magnification a few slightly damaged CMP filler dots can be seen above the pad. I like to use filler metal for focusing and stigmating the ion beam at milling currents before I move to the region of interest because it's made of the same material as my target, it's something I can safely destroy, and it's everywhere - it's hard to travel a significant distance on a modern IC without bumping into at least a few pieces of filler metal.

Probe pad at higher magnification, optical image. Note damaged CMP filler above pad.
I soldered the CPLD back onto the board and was relieved to find out that it still worked! The next step was to write some dummy code to test it out:

`timescale 1ns / 1ps
module test(clk_2048khz, led);

 //Clock input
 (* LOC = "P1" *) (* IOSTANDARD = "LVCMOS33" *)
 input wire clk_2048khz;
 
 //LED out
 (* LOC = "P38" *) (* IOSTANDARD = "LVCMOS33" *)
 output reg led = 0;
 
 //Don't care where this is placed
 reg[17:0] count = 0;
 always @(posedge clk_2048khz)
  count <= count + 1;
  
 //Probe-able signal on FB2_5 FF at 2x the LED blink rate
 (* LOC = "FB2_5" *) reg toggle_pending = 0;
 always @(posedge clk_2048khz) begin
  if(count == 0)
   toggle_pending <= !toggle_pending;
 end
 
 //Blink the LED
 always @(posedge clk_2048khz) begin
  if(toggle_pending && (count == 0))
   led <= !led;
 end

endmodule


This is a 20-bit counter that blinks a LED at ~2 Hz from a 2048 KHz clock on the board. The second-to-last stage of the counter (so ~4 Hz) is constrained to FB2_5, the signal we're probing.

After making sure things still worked I attached the board's plastic standoffs to a 4" scrap silicon wafer with Gorilla Glue to give me a nice solid surface I could put on the prober's vacuum chuck.

Test board on 4" wafer
Earlier today I went back to the cleanroom. After dealing with a few annoyances (for example, the prober with a wide range of Z axis travel, necessary for this test, was plugged into the electrical test station with curve tracing capability but no oscilloscope card) I landed a probe on the bond pad for VCCIO and one on ground to sanity check things. 3.3V... looks good.

Moving carefully, I lifted the probe up from the 3.3V bond pad and landed it on my newly added probe pad.

Landing a probe on my pad. Note speck of dirt and bent tip left by previous user. Maybe he poked himself mounting the probe?
It took a little bit of tinkering with the test unit to figure out where all of the trigger settings were, but I finally saw a ~1.8V, 4 Hz squarewave. Success!

Waveform sniffed from my probe pad
There's still a bit of tweaking needed before I can demo it to my students (among other things, the oscilloscope subsystem on the tester insists on trying to use the 100V input range, so I only have a few bits of ADC precision left to read my 1.8V waveform) but overall the attack was a success.

Getting my feet wet with invasive attacks, part 1: Target recon

This is part 1 of a 2-part series. Part 2, The Attack, is here.

One of the reasons I've gone a bit dark lately is that running CSCI 6974, RPI's experimental hardware reverse engineering class, has been eating up a lot of my time.

I wanted to make the final lab for the course a nice climax to the semester and do something that would show off the kinds of things that are possible if you have the right gear, so it had to be impressive and technically challenging. The obvious choice was a FIB circuit edit combined with invasive microprobing.

After slaving away for quite a while (this was started back in January or so) I've managed to get something ready to show off :) The work described here will be demonstrated in front of my students next week as part of the fourth lab for the class.

The first step was to pick a target. I was interested in the Xilinx XC2C32A for several reasons and was already using other parts of the chip as a teaching subject for the class. It's a pure-digital CMOS CPLD (no analog sense amps and a fairly regular structure) made on a relatively modern process (180 nm 4-metal UMC) but not so modern as to be insanely hard to work with. It was also quite cheap ($1.25 a pop for the slowest speed grade in VQG44 package on DigiKey) so I could afford to kill plenty of them during testing

The next step was to decap a few, label interesting pins, and draw up a die floorplan. Here's a view of the die at the implant layer after Dash etch; P-type doping shows up as brown. (John did all of the staining work and got great results. Thanks!)

XC2C32A die floorplan after Dash etch
The bottom half of the die is support infrastructure with EEPROM banks for storing the configuration bitstream toward the center and JTAG/configuration stuff in a U-shape below and to either side of the memory array. (The EEPROM is mislabeled "flash" in this image because I originally assumed it was 1T NOR flash. Higher magnification imaging later showed this to be wrong; the bit cells are 2T EEPROM.)

The top half of the die is the actual programmable logic, laid out in a "butterfly" structure. The center spine is the ZIA (global routing, also referred to as the AIM in some datasheets), which takes signals from the 32 macrocell flipflops and 33 GPIO pins and routes them into the function blocks. To either side of the spine are the two FBs, which consist of an 80 x 56 AND array (simplifying a bit... the actual structure is more like 2 blocks x 20 rows x 2 interleaved cells x 56 columns), a 56 x 16 OR array, and 16 macrocells.

I wanted some interesting data to show my students so there were two obvious choices. First, I could try to defeat the code protection somehow and read bitstreams out of a locked device via JTAG. Second, I could try to read internal device state at run time. The second seemed a bit easier so I decided to run with it (although defeating the lock bits is still on my longer-term TODO.)

The obvious target for probing internal runtime state is the ZIA, since all GPIO inputs and flipflop states have to go through here. Unfortunately, it's almost completely undocumented! Here's the sum total of what DS090 has to say about it (pages 5-6):
The Advanced Interconnect Matrix is a highly connected low power rapid switch. The AIM is directed by the software to deliver up to a set of 40 signals to each FB for the creation of logic. Results from all FB macrocells, as well as, all pin inputs circulate back through the AIM for additional connection available to all other FBs as dictated by the design software. The AIM minimizes both propagation delay and power as it makes attachments to the various FBs.
Thanks for the tidbit, Xilinx, but this really isn't gonna cut it. I need more info!

The basic ZIA structure was pretty obvious from inspection of the implant layer: 20 identical copies of the same logic. This suggested that each row was responsible for feeding two signals left and two right.

SEM imaging of the implant layer showed the basic structure to be largely, but not entirely, symmetric about the left-right axis. At the far outside a few cells of the PLA AND array can be seen. Moving toward the center is what appears to be a 3-stage buffer, presumably for driving the row's output into the PLA. The actual routing logic is at center.

The row appeared entirely symmetric top-to-bottom so I focused my future analysis on the upper half.

Single row of the ZIA seen at the implant layer after Dash etch. Light gray is P-type doping, medium gray is N-type doping, dark gray is STI trenches.
Looking at the top metal layer revealed the expected 65 signals.

Single row of the ZIA seen on metal 4
The signals were grouped into six groups with 11, 11, 11, 11, 11, and 10 signals in them. This led me to suspect that there was some kind of six-fold structure to the underlying circuitry, a suspicion which was later proven correct.

Inspection of the configuration EEPROM for the ZIA showed it to be 16 bits wide by 48 rows high.

ZIA configuration EEPROM (top few rows)
Since the global configuration area in the middle of the chip was 8 rows high this suggested that each of the 40 remaining EEPROM rows configured the top or bottom half of a ZIA row.

Of the 16 bits in each row, 8 bits presumably controlled the left-hand output and 8 controlled the right. This didn't make a lot of sense at first: dense binary coding would require only 7 bits for 65 channels and one-hot coding would need 65 bits.

Reading documentation for related device families sometimes helps to shed some light on how a part was designed, so I took a look at some of the whitepapers for the older 350 nm CoolRunner XPLA3 series. They went into some detail on how full crossbar routing was wasteful of chip area and often not necessary to get sufficient routability. You don't need to be able to generate every 40! permutations of a given subset of signals as long as you can route every signal somehow. Instead, the XPLA3's designers connected only a handful of the inputs to each row and varied the input selection for each row so as to allow almost every possible subset to be selected somehow.

This suggested a 2-level hierarchy to the ZIA mux. Instead of being a 65:1 mux it was a 65:N hard-wired mux followed by a N:1 programmable mux feeding left and another N:1 feeding right. 6 seemed to be a reasonable guess for N, given the six groups of wires on metal 4.

ZIA mux structure
This hypothesis was quickly confirmed by looking at M3 and M3-M4 vias: Each row had six short wires on M3, one under each of the six groups of wires in the bus. Each of these short lines was connected by one via to one of the bus lines on M4. The via pattern varied from row to row as expected.

ZIA M3-M4 vias

I extracted the full via pattern by copying a tracing of M4 over the M3 image and using the power vias running down the left side as registration marks. (Pro tip: Using a high accelerating voltage, like 20 kV, in a SEM gives great results on aluminum processes with tungsten via plugs. You get backscatters from vias through the metal layer that you can use for aligning image stacks.) A few of the rows are shown above.

At this point I felt I understood most of the structure so the next step was full circuit extraction! I had John CMP a die down to each layer and send to me for high-res imaging in the SEM.

The output buffers were fairly easy. As I expected they were just a 3-stage inverter cascade.

Output buffer poly/diffusion/contact tracing

Output buffer M1 tracing
Output buffer gate-level schematic

Individual cell schematics
Nothing interesting was present on any of the upper layers above here, just power distribution.

The one surprising thing about the output buffer was that the NMOS on the third stage had a substantially wider channel than the PMOS. This is probably something to do with optimizing output rise/fall times.

Looking at the actual mux logic showed that it was mostly tiles of the same basic pattern (a 6T SRAM cell, a 2-input NOR gate, and a large multi-fingered NMOS pass transistor) except for the far left side.

Gate-level layout of mux area

Left side of mux area, gate-level layout
The same SRAM-feeding-NOR2 structure is seen, but this time the output is a small NMOS or PMOS pass transistor.

After tracing M1, it became obvious what was going on.

Left side of mux area, M1

The upper and lower halves control the outputs to function blocks 1 and 2 respectively. The two SRAM bits allow each output (labeled MUXOUT_FBx) to be pulled high, low, or float. A global reset line of some sort, labeled OGATE, is used to gate all logic in the entire ZIA (and presumably the rest of the chip); when OGATE is high the SRAM bits are ignored and the output is forced high.

Here's what it looks like in schematic:

Gate-level schematics of pullup/pulldown logic
Cell schematics
In the schematics I drew the NOR2v0x1 cell as its de Morgan dual (AND with inverted inputs) since this seemed to make more sense in the context of the circuit: the output is turned on when the active-low input is low and OGATE is turned off.

It's interesting to note that while almost all of the config bits in the circuit are active-low, PULLUP is active-high. This is presumably done to allow the all-ones state (a blank EEPROM array) to put the muxes in a well-defined state rather than floating.

Turning our attention to the rest of the mux array shows a 6:1 one-hot-coded mux made from NMOS pass transistors. This, combined with the 2 bits needed for the pull-high/pull-low module, adds up to the expected 8.  The same basic pattern shown below is tiled three times.
Basic mux tile, poly/implant
Basic mux tile, M1
(Sorry for the misalignment of the contact layer, this was a quick tracing and as long as I was able to make sense of the circuit I didn't bother polishing it up to look pretty!)

The resulting schematic:

Schematic of muxes

M2 was used for some short-distance routing as well as OGATE, power/ground busing, and the SRAM bit lines.

M2 and M2-M3 vias


M3 was used for OGATE, power busing, SRAM word lines, the mask-programmed muxes, and the tri-state bus within the final mux.



M3 and M3-M4 vias

And finally, M4. I never found out what the leftmost power line went to, it didn't appear to be VCCINT or ground but was obviously power distribution. There's no reason for VCCIO to be running down the middle of the array so maybe VCCAUX? Reversing the global config logic may provide the answer.

M4
A bit of trial and error poking bits in bitstreams was sufficient to determine the ordering of signals. From right to left we have FB1's GPIO pins, the input-only pin, FB2's GPIO pins, then FB1's flipflops and finally FB2's flipflops.

Now that I had good intel on the target, it was time to plan the strike!

Part 2, The Attack, is here.

Monday, March 24, 2014

Microchip PIC32MZ process vs PIC32MX

Those of you keeping an eye on the MIPS microcontroller world have probably heard of Microchip's PIC32 series parts: MIPS32 CPU cores licensed from MIPS Technologies (bought by Imagination Technologies recently) paired with peripherals designed in-house by Microchip.
Although they're sold under the PIC brand name they have very little in common with the 8/16 bit PIC MCUs. They're fully pipelined processors with quite a bit of horsepower.

The PIC32MX family was the first to be introduced, back in 2009 or so. They're a MIPS M4K core at up to 80 MHz and max out at 128 KB of SRAM and 512 KB of NOR flash plus a fairly standard set of peripherals.

PIC32MX microcontroller

Somewhat disappointingly, the PIC32MX MMU is fixed mapping and there is no external bus interface. Although there is support for user/kernel privilege separation, all userspace code shares one address space. Another minor annoyance is that all PIC32MX parts run from a fixed 1.8V on-die LDO which normally cannot (the 300 series is an exception) be disabled or bypassed to run from an external supply.

The PIC32MZ series is just coming out now. They're so new, in fact that they show as "future product" on Microchip's website and you can only buy them on dev boards, although I'm told by around Q3-Q4 of this year they'll be reaching distributors. They fix a lot of the complaints I have with PIC32MX and add a hefty dose of speed: 200 MHz max CPU clock and an on-die L1 cache.

PIC32MZ microcontroller

On-chip memory in the PIC32MZ is increased to up to 512 KB of SRAM and a whopping 2 MB of flash in the largest part. The new CPU core has a fully programmable MMU and support for an external bus interface capable of addressing up to 16MB of off-chip address space.

I'm a hacker at heart, not just a developer, so I knew the minute I got one of these things I'd have to tear it down and see what made it tick. I looked around for a bit, found a $25 processor module on Digikey, and picked it up.

The board was pretty spartan, which was fine by me as I only wanted the chip.

PIC32MZ processor module
Less than an hour after the package had arrived, I had the chip desoldered and simmering away in a beaker of sulfuric acid. I had done a PIC32MX340F512H a few days previously to provide comparison shots.

Without further ado, here's the top metal shots:

PIC32MX340F512H
PIC32MZ2048ECH
These photos aren't to scale, the MZ is huge (about 31.9 mm2). By comparison the MX is around 20.

From an initial impression, we can see that although both run at the same core voltage (1.8V) the MZ is definitely a new, significantly smaller fab process. While the top layer of the MX is fine-pitch signal routing, the top layer of the MZ is (except in a few blocks which appear to contain analog circuitry) completely filled with power distribution routing.

Top layer closeups of MZ (left), MX (right), same scale

Thick power distribution wiring on the top layer is a hallmark of deep-submicron processes, 130 nm and below. Most 180 nm or larger devices have at least some signal routing on the top layer.

Looking at the mask revision markings gives a good hint as to the layer count and stack-up.

Mask rev markings on MZ (left), MX (right), same scale
The MZ appears to be one thick aluminum layer and five thin copper layers for a total of six, while the MX is four layers and probably all aluminum.

Enough with the top layer... time to get down! Both samples were etched with HF until all metal and poly was removed.

The first area of interest was the flash.

NOR flash on MZ (left), MX (right), different scales
Both arrays appear to be the same standard NOR structure, although the MZ's array is quite a bit denser: the bit cell pitch is 643 x 270 nm (0.173 μm2/bit) while the MX's is 1015 x 676 nm (0.686 μm2/bit). The 3.96x density increase suggests a roughly 2x process shrink.

The white cylinders littering the MX die are via plugs, most likely tungsten, left over after the HF etch. The MZ appears to use a copper damascene process without via plugs, although since no cross section was performed details of layer thicknesses etc are unavailable.

The next target was the SRAM.

6T SRAM on MZ (left), MX (right), different scales
Here we start to see significant differences. The MX uses a fairly textbook 6T "doughnut + H" SRAM structure while the MZ uses a more modern lithography-optimized pattern made of all straight lines with no angles, which is easier to etch. This kind of bit cell is common in leading-edge processes but this is the first time I've seen it in a commodity MCU.

Cell pitch for the MZ is 1345 x 747 nm (1.00 μm2/bit) while the MX is 1895 x 2550 nm (4.83 μm2/bit). This is a 4.83x increase in density.

The last area of interest was the standard cell array for the CPU.

Closeup of standard cells on MZ (left), MX (right), different scales
Channel length was measured at 125-130 nm for the MZ and 250-260 nm for the MX.

Both devices also had a significant number of dummy cells in the gate array, suggesting that the designs were routing-constrained.

Dummy cells in MZ
Dummy cells in MX

In conclusion, the PIC32MZ is a significantly more powerful 130 nm upgrade to the slower 250 nm PIC32MX family. If Microchip fixes most of the silicon bugs before they launch I'll definitely pick up a few and build some stuff with them.

I wasn't able to positively identify the fab either device was made on however the fill patterns and power distribution structure on the MZ are very similar of the TI AM1707 which is fabricated by TSMC so they're my first guess.

For more info and die pics check out the SiliconPr0n pages for the two chips: