Thursday, April 24, 2008

[HiPerCoPS] 562 Project

For most of this semester, I've been working on a simulator for our architecture. I developed an algorithm to determine the cell types in multiplier modules of arbitrary size and granularity. Then I worked on simulating the cell interconnects so the multipliers would actually function. There are still a few bugs in the overall system, but it works pretty well for the most part.

Now, as part of my Cpt S 562: Fault Tolerant Computing class, I'm working on a mechanism that maps these multipliers to reconfigurable device. The algorithms are pretty simple so far, but they're also a bit clunky. I'm looking into ways to make the process more elegant and efficient, but the real focus is on making it fault tolerant. My goal is to add fault recovery and avoidance capabilities to the device.

So far, I can define a device with certain dimensions and place modules on it. There's also a function that induces permanent faults in the cells. Fault placement is random, but the number of faults to be induced is taken as a parameter.  My module placer knows that it can't map modules over faulty cells, so it looks for fault-free areas that are big enough for the "footprint" of a module.

Sometimes, if there are too many faults or the modules are too large, placement will be impossible. My system is smart enough to realize this report an error whenever it happens. There are a number ways to maximize the likelihood of being able to place 100% of the modules for a given system. First of all, the module placer can attempt to map the modules in order of size, from biggest to smallest. This way, the modules that are least likely to have enough space (the largest modules) have a better chance of finding unused real estate on the device. The smaller modules, then, can "settle into the cracks" that remain. Second, it is possible that fault cells can still be used in some contexts. Since the atomic unit of our architecture is medium grain, perhaps a single burnt-out transistor won't cause the entire cell to fail. In the event that a subset of the functionality remains, maybe certain types of cells could still be mapped to this location. This is an example of graceful performance degradation, which could greatly increase the lifetime of the device.

One of the things I'd like to do is consider using a simulated annealing algorithm in the module placer. This approach is used in most FPGA placers. Although there are some fundamental differences between FPGAs and our architecture, I still think that simulated annealing could be a promising option for more efficient module placement.

No comments: