# The Fast Evolving Landscape of On-Chip Communication Selected Future Challenges and Research Avenues Davide Bertozzi · Giorgos Dimitrakopoulos · José Flich · Sören Sonntag the date of receipt and acceptance should be inserted later Abstract As multi-core systems transition to the many-core realm, the pressure on the interconnection network is substantially elevated. The Network-on-Chip (NoC) is expected to undertake the expanding demands of the ever-increasing numbers of processing elements, while—at the same time—technological and application constraints increase the pressure for increased performance and efficiency with limited resources. Although NoC research has evolved significantly the last decade, essential questions remain un-answered and call for fresh research ideas and innovative solutions. In this paper, we summarize a selected set of NoC-related research challenges, with the hope to guide future development and trigger high-impact research progress. #### 1 Introduction Modern integrated multicore platforms have adopted a Network-on-Chip (NoC) technology that brings interconnect architectures inside the chip. The NoC paradigm tries to find a scalable solution to the tough integration challenge of modern SoCs, by applying at the silicon chip level well established networking principles, after Davide Bertozzi University of Ferrara, Ferrara, Italy. E-mail: davide.bertozzi@unife.it Giorgos Dimitrakopoulos Democritus University of Thrace, Xanthi, Greece E-mail: dimitrak@ee.duth.gr José Flich Universitat Politècnica de València, Valencia, Spain E-mail: jflich@disca.upv.es Sören Sonntag Intel Corp., Munich, Germany E-mail: soeren.sonntag@intel.com suitably adapting them to the silicon chip characteristics and to application demands [1], [2], [3], [4]. This approach was originally adopted to tackle the physical integration complexity, clocking scalability, timing closure and verification problems of state-of-the-art SoCs [5]. While the seminal idea of applying networking technology to address the chip-level interconnect problem has been shown to be adequate for current systems, the complexity of future computing platforms demands new architectures that go beyond physical-related requirements and equally participate in delivering high-performance, quality of service, dynamic adaptivity at the minimum energy and area overhead [6]. Scalable interconnect architectures form the solid base on top of which heterogeneous computing platforms and their unifying programming environments will be developed. Parallelism is all about cooperation that cannot be achieved without the efficient communication offered by the interconnect. The interconnect implements the physical and logical medium for any kind of data transfer and its latency, bandwidth and energy efficiency directly affects overall system performance. Interconnect design is a multidimensional problem involving hardware and software components such as network interfaces, switches, topologies, routing algorithms and communication library APIs. We expect system networks to achieve ultra-fast end-to-end message delivery, hundreds of Gbytes per second of link under demanding physical, architectural and technological constraints which translate to contradictory objectives and most commonly to very tight energy budgets. In this paper, we revisit most of the aspects of NoC design beginning from micro-architecture and design methodologies and moving to physical integration challenges (clocking strategies, runtime adaptivity), as well as network partitioning, reconfiguration and virtualiza- tion and identify the challenges involved in future NoC design. At the same time, the adoption of emerging interconnect technologies such as optical and wireless interconnects are thoroughly discussed and their associated challenges in system architecture, circuit design, device fabrication and CAD tool development are analyzed. The selected research avenues are also supported by an industrial viewpoint that discusses the NoC design arena of current and future SoCs for mobile platforms. #### 2 NoC Microarchitecture Challenges All aspects of Network-on-Chip architecture starting from topology and routing algorithms [7], [8], [11], [12], [13], and covering router and network interface microarchitecture have evolved significantly over the last decade. Although topologies and routing algorithms are defined from in a generic manner [9], [10], allowing several design customization and specialization decisions, the same does not hold for router microarchitecture. A unified customizable model that will cover in a unified manner all micro-architectural alternatives such as control and data path pipelines [19], [16], [26], [27], speculation [14], [15], buffering architecture [20], [21], [22], [24] and allocation policies [17], [18] is still missing. The derivation of such a model would allow for rapid, safe architectural changes in order to find the globally optimum architectures bridging the gap between architecture exploration, microarchitecture fine tuning and physical implementation. The derived customizable model should be smart enough to differentiate its microarchitecture depending on switch radix, centralized or distributed physical placement [23], [25] and available silicon area. Currently, router design involves assembling hardware blocks from a component library of varying granularity and complexity. Although such an approach has provided so far efficient architectures, its efficiency is limited by the efficiency of the independent blocks; the designer's potential for delivering efficient compositions, and the depth of the design space exploration. At the same time, every new synthesis of components would require separate verification effort that would first guarantee correctness such as deadlock and starvation avoidance, performance validation and separate physical implementation characterization and integration with the remaining components of the system. A predefined customizable model would significantly reduce verification and validation effort since its generic definition and precharacterization is expected to cover by construction every customized instance of it. At the same time, on-chip network interfaces that decouple at the eye of the programmer computation from communication have evolved in recent years from primitive protocol bridges and packetizing-buffering modules to sophisticated NoC units. The network interfaces besides physical integration including clock domain synchronization and power domain interfacing are responsible for many other higher-level-abstraction structures close to the programming model and to the communication protocol stack. The architecture of network interfaces does not follow a standard approach and many ad hoc alternatives supporting various application domain features have appeared recently [28], [29], [30], [31]. The trend in network interface design is to integrate as many networking features as possible directly in hardware while keeping a balance between hardware complexity of the interfaces in terms of area and latency relative to the connected cores and the complexity of the rest of the network. The architectural features that every network interface should support include: (a) protocol and bandwidth adaptivity (b) buffering, (c) error management, (d) quality of service, (d) memory address protection and isolation (security), (e) out of order transactions handling, and (f) programming interface support [32] that should follow a standard form such as MCAPI or any other generic interface that would appear in the future. Although there is a growing consensus on what a network interface should support the same does not happen on how it should be implemented. A clear fixed architecture and possibly programmable by a small instruction set architecture is missing that would allow easy customizations and versatile operation via software-like reprogrammability. Also part of the communication libraries and application-level abstractions could be implemented easily in software via the network interface's instruction set. Besides improving NoC architectures either at the nodes of the network or at the interfaces, holistic design methodologies should also evolve possibly moving to a true hardware-software interconnect co-design. A clear asset in this direction would be the definition of interconnect-specific architecture description languages and automatic and correct-by-construction model to RTL synthesis methodologies (like a network-specific high-level synthesis methodology [34], [35]) that would allow for faster exploration of networking architecture and holds promise for most efficient design by allowing joint and across layer optimizations [36]. Up to now interconnect design followed a layered approach that encapsulates networking functions in a hierarchical stack of per-layer operations (e.g. link, transport, network layer). Designers focus on a particular layer that hinders possible across-layer optimizations and maintains unnecessary overhead hidden inside each layer. Although, concrete studies are still missing on how the network intelligence should be split between interfaces and network nodes, the definition of a network-centric design methodology at the architecture level would allow new unexplored alternatives. Interconnect-centric design technologies need to be supported by new efficient usage strategies of the network as a whole. Traditional network usage deals with bandwidth and latency guarantees trying to satisfy either hard communication deadlines or offer equality of service [33]. Although such techniques have not reached the highest plateau of efficiency or ease of implementation new usage strategies should evolve that consider energy efficiency, energy fairness taking into account the criticality in terms of application execution time of the delivery of certain packets relative to the rest. The wide adoption of NoC technology in today's high-end SoCs is expected to grow the next years even to ultra low power SoC platforms that are gradually becoming multicores. In such systems, maximum energy efficiency is the most important quality factor and thus operate at low voltages (and possibly at low speeds) trying to keep power consumption close to the minimum possible. The application of NoC technology, as we know it today, in such platforms would require a complete redesign from the circuit up to architectural level of all traditional NoC components (buffers, crossbar, arbiters) and topologies making them truly voltage scalable and adaptive to changing operating conditions [37]. Low cost on-line fault tolerance features may help in this direction when applied appropriately so as not to eat back of the energy saved by voltage scaling. ## 3 The Compositional Challenge Power dissipation continues to be a primary design constraint in the multi- and many-core chip era. Increasing power consumption not only results in increasing energy costs, but also results in high die temperatures that affect chip reliability, performance, and packaging cost. From the performance standpoint, current and future largely integrated systems will have to carefully constrain application performance to stay within power envelopes [40]. Fortunately, multi-core systems host applications that exhibit runtime variability in their performance requirements, which can be exploited to optimize throughput while staying within the system power envelope. seek to exploit runtime variability in application behav- ior to achieve maximum energy savings with minimal performance degradation. The granularity of adaptive voltage and frequency control is currently still an open issue. The milestone Intel Single-Chip Cloud Computer implementation exhibits 24 frequency islands with 15 speed settings from 100 to 800 MHz, and 7 voltage islands with 7 voltage levels from 0.7V to 1.3V in steps of 0.1V [105]. The on-chip network is a voltage and frequency island on its own. Considering that the chip consists of 48 cores structured into 24 tiles, it can be considered a relevant example of fine-grained power management enabled by on-package voltage regulators. In practice, even though the performance advantages of per-core DVFS in multi-core systems have been suggested [38,41], providing per-core, independent voltage control can be overly expensive [38]. In contrast, when DVFS is applied across multiple cores, determining a single optimal DVFS setting that simultaneously satisfies all cores will be extremely difficult; some applications will suffer performance loss or power overheads. This problem worsens as the number of cores and running applications increase in future systems. There is currently no generalized consensus on the above tradeoffs, however the relentless improvement of (integrated) voltage regulator technology as a standard homogenous CMOS component holds promise of proliferating DVFS domains for maximum energy benefits [42]. As a consequence, a future scenario can be reasonably envisioned where the homogeneous cores of a regular tile-based architecture will deliver heterogeneous power-performance operating conditions with fine-grained granularity, and with runtime tuning capability. Last but not least, future many-core programmable accelerators are likely to host a large number of concurrent applications to enable effective exploitation of the hardware resources, thus pushing the partitioning concept [43]. The virtualization paradigm, which is becoming mainstream also in the embedded computing domain, is fostering this trend, since virtual machines might be easily allocated a subset (possibly changing over time) of the parallel computer architecture, with trustworthy isolation strategies [44]. Overall, at any given point in time portions of the many-core architecture might go unused, hence they could be effectively powered off [45]. The above requirements have relevant implications on hardware design. Especially, they cause design-time homogeneous architectures to become highly heterogeneous at runtime, given the diversity of core power states across the platform. This diversity should be absorbed at component boundary, where IP cores are in-Dynamic voltage and frequency scaling (DVFS) schemesterconnected with the system integration framework, that is with the on-chip communication architecture. Networks-on-Chip should be therefore ready to deliver communication paths at runtime that potentially cross areas with highly heterogeneous operating conditions. Synchronizer-based design is the traditional answer to this challenge [47], since it enables to absorb clock phase and frequency setting differences across communicating frequency islands at some synchronization overhead, especially latency and power. Unfortunately, the latency overhead caused in NoC links in turn calls for larger buffering requirements at the receiver end, in order to preserve the capability for maximum throughput operation [48]. Synchronizers are extensively used in embedded systems, especially dual-clock FIFOs (like for instance [49], however they currently cope only with coarse-grained splitting into frequency domains in most cases. As finer-grain splitting will gain momentum, optimization techniques for them will become mandatory to fit the tight resource budgets [46]. The literature is currently ready to deliver novel proposals and interesting ideas, although architectural optimizations and silicon validations are far from complete [51]. As an example, a merging technology of synchronizers with NoC building blocks was proved capable of reusing expensive buffering resources for multiple purposes, including synchronization, performance buffering and flow control [50]. Nonetheless, the distinctive challenges for the industrialization of synchronizer-based technology encompass: - The careful engineering of timing constraints across links, since the need to deliver flow control in NoCs causes non-trivial round-trip channel dependencies even in source-synchronous communication. - The continued development of bundled data routing methodologies in sub-40 nm technologies, capable of keeping relative delay mismatches between link wires under control. - The implementation of clock gating techniques capable of cutting down on idle power, especially when source-synchronous clock signals need to be routed to the receiver end for the sake of signal resynchro- - The implementation of reliable reset mechanisms, safeguarding operation of those synchronizers that require precise alignment between their front-ends (in one clock domain) and their back-ends (in another clock domain) with respect to timing uncertainties in reset deassertion across clock domains. - The proper (over) sizing of the number of cascaded stages in brute-force synchronizers to counter the degradation of the resolution time constant of synregime. This has non-negligible implications over the performance and power overhead of dual-clock FIFOs, which would in many cases require slot overprovisioning to preserve the full throughput operation capability. There is no doubt that reliable and energy-efficient many-core system design will only be feasible under relaxed synchronization assumptions in the future. An indirect confirmation comes from the analogy of the synchronization paradigm used in two relevant Intel test chips. On one hand, the early 80-core Intel Polaris chip relied on mesochronous clocking to implement simpler, and less power-hungry clock distribution networks replacing a complicated H-tree with something simpler and shorter like a grid [57]. On the other hand, the latest demonstration of a 256-node Intel NoC on 22nm Tri-Gate CMOS relies on the principle of sourcesynchronous communication, which is effectively coupled with an hybrid packet/circuit switching architecture built on top of it, thus yielding 20.2 Tbps among the nodes and 18.2 Tbps/W efficiency when running at 430 mV in near-threshold voltage operation [58]. An appealing alternative to synchronizer-based design consists of clockless handshaking [52]. When applied for inter-domain communication, it holds promise of average-case instead of worst-case performance, no switching power of a clock tree, especially in idle state, robustness to process/voltage/temperature variations, and efficient delivery of differentiated per-link performance [59]. Counterintuitively, such potential benefits are not reflected in an adequate industrial exploitation, which includes asynchronous Ethernet routers and high-speed FPGAs, and only marginally on-chip interconnect sub-systems. The traditional explanation consists of the poor CAD tool support to design asynchronous systems. Many efforts are underway to get predictable and fast-converging designs by means of ad-hoc tools [61] and/or scripting and methodologies on top of mainstream CAD tools [60], however they are currently not capable yet to avoid extensive manual intervention, the deactivation of relevant optimization capabilities of such tools [54], the technology-specific description of some components in abstract specifications, and to enable the flexibility required for the design and synthesis of soft macros. As a consequence, prototype designs and real products still largely rely on hard macros and full custom design when it comes to the asynchronous components [56], with some noticeable exceptions [53, 55]. However, it should be observed that existing industrial prototypes do not often represent an incentive for further development of the tooling support. Existing chronizers as technology scales deeper into the nanoscaleasynchronous interconnect fabrics can easily prove relevant savings on application total power, however they feature a larger energy-per-flit than their synchronous counterparts. As a result, power savings can be explained only in terms of the poor utilization of the system interconnect, since the idle power figure is clearly in favour of the asynchronous implementation. The ultimate reason for this trend is that asynchronous NoCs are typically designed with 4-phase communication protocols and delay-insensitive data encoding. The former choice implies two complete round-trip channel communications per transaction, which becomes unaffordable in the presence of long links. The second choice guarantees high timing robustness since circuit functionality and operation are guaranteed by construction in the face of delay variations during the fabrication process. Unfortunately, this comes with high area occupancy, low coding density and high energy per bit. More recently, these quality metrics are raising the interest for an alternative design style relying on bundled data [62–64]. The key rationale is that if the lower timing robustness of this data encoding can be kept under control by means of efficient CAD tools and guardbands on relative timing constraints, the benefits of reduced area, reduced wire-per-link and reduced energyper-bit can be materialized. The success of this design style will probably depend on the availability of efficient tunable delay lines to make the performance penalty of the guardbands link-specific. Whether this is a viable solution for high-performance designs or not, bundled data clearly represents the way to go for the desynchronization of low-power, low-end designs in the embedded computing domain. One possible obstacle is given by the fact that latch performance might not be that good in low-power technology libraries, which questions the typical design practice of delivering highperformance asynchronous designs via extensive utilization of pipelining techniques in the presence of specific pipeline design styles (e.g., MOUSETRAP pipelines). Overall, the question about the role of asynchronous interconnect technology for the solution of the compositional challenge in manycore systems is far from getting stable answers. The key driver is the possibility to connect domains regardless of their specific and runtime-varying operating conditions, like in [65,66]. Literature keeps documenting remarkable and trustworthy power savings whenever the technology is applied. It should be however brought to the stage where its development for exploitation within an industry-standard methodology and tool-flow becomes cost-effective with respect to the further evolution of current design methods. #### 4 The Resource Sharing Challenge From previous sections we have seen many challenges ahead to be addressed for proper NoC design. However, the fact that the NoC is a shared resource within the chip makes its design much more critical than expected. If we think of a possible multicore system, where tens or hundreds of processor cores communicate within them and with caches and memory controllers, we easily see that the medium used for that communication is always the same, the NoC. Thus, the way we design the NoC will heavily influence not only the overall system performance but the way we share resources (cores, caches, memory controllers, accelerators, . . . ). This imposes an orthogonal challenge to the previous ones since an unbalanced use or resources will lead to poor performance numbers, or even to unattainable QoS levels. Indeed, future systems with hundreds, or perhaps thousands of cores, will inherit a structural problem. Applications running on such systems will simply not scale, or will scale poorly, thus not taking benefit of the theoretical peak performance of those systems. To address this issue, we can see high-performance computing systems divided in two categories [111]. In capability computing, HPC infrastructures (supercomputers) are used to solve a single and highly complex problem in the shortest possible amount of time. In capacity computing, however, a compute system solves as many problems as possible in parallel with the lowest possible cost. This refers to, for instance, data-centers receiving millions of requests per time unit. If we apply the *capacity computing* approach to multicore systems, we can think of applications (or tasks) running on the same system but using disjoint sets of resources (cores, memories, ...). This is indeed an appealing approach, since it allows to maximize system resources utilization, thus making a proper use of our system. This approach (sharing resources between different applications or tasks) is also emerging in the embedded domain with the concept of mixed-criticality systems (MCS) and virtualization. In MCS systems, a single multicore chip must be able to run different applications with different criticalities and must guarantee failures or perturbations of applications do not affect the other applications performance. Virtualization of chip resources is also being promoted lately and also imposes an effective resource sharing policy in order to decouple application's performance from the rest. In the near future, multicores will invade every domain (e.g. aerospace, automotive) and will demand for efficient policies to manage the resources in a structured and safe manner. Unfortunately, the NoC is a shared resource and lays in the middle of the problem. So, it is clear we need to design the NoC with these new requirements in mind. This challenge is aggravated by the fact that we may have running on top of the system a coherence proto- col, which will guarantee memory access consistency. In such scenario the NoC will face mainly the traffic generated by the coherence protocol. This kind of traffic has its own characteristics, such as traffic distribution, traffic burstiness, and communication types (unicast, collective, gather operations). An NoC can not be designed without taking into account such traffic characteristics. Support of efficient sharing resource policies and coherence protocols demands for efficient NoC designs in the following directions. First, the communication particularities of coherence protocols need to be supported natively by the NoC. Examples are broadcast support in the NoC allowing efficient communication of coherence protocol commands, gathering operations support in the NoC allowing efficient acknowledgments of multiple cores to the same memory block, and efficient support of synchronization primitives in the NoC allowing fast synchronization operations of the processes running on the system. These kind of optimizations (indeed, an efficient co-design of the NoC and the coherence protocols) will allow such protocols to scale, or at least scale better, thus delivering higher capacities and performance numbers. Second, the NoC needs to be designed with built-in mechanisms and methods to guarantee runtime and flexible partitioning schemes, which will enable effective isolation of applications or tasks in the same chip. This affects mainly to the design and properties of routing algorithms. Topology-agnostic routing algorithms (like up\*/down\* or segment-based routing) allow the building of partitions with any shape, thus promoting the partitioning capability. Third, the NoC needs to be designed with reconfiguration capabilities. Indeed, if we plan to map different applications on the system these applications will be continuously entering and leaving the system, thus needing different numbers and types of resources and demanding a proper chip reconfiguration. An NoC with transparent reconfiguration (not affecting the current traffic and not stopping the traffic) is required for the support of such systems. Finally, the final direction to take is a complete and transparent exposure of the configuration and partitioning capabilities of the NoC to the software stack, mainly the operating system and in particular the hypervisor. This module will be in charge of customizing the system to the current demands of the applications requesting service from the system. All this support has its center of gravity in the NoC since is the shared resource and is the one used to communicate all the system components within the chip. Proper design of the NoC to support partitioning while optimizing coherence protocols support will become mandatory for future chips based on NoCs. ### 5 Emerging Interconnect Technologies According to the ITRS roadmap [70], interconnect innovation is the key to satisfying performance, reliability, and power requirements in the long term. Future interconnect technologies must support ultra-high data rates (e.g., greater than 100 Gbps/pin), be scalable enough to support tens to hundreds of concurrent communication streams, and involve fabrication techniques that are compatible with mainstream MPSoC and system-in-package (SiP) technologies. An overview of the fundamentals and ongoing research challenges for two revolutionary interconnect technologies is reported in [71, 72], namely silicon nanophotonics and RF/wireless interconnects. Optical links are already pervasive in data centers because of their ability to improve the bandwidth density over copper cables, to the point that optical switching represents the next step in order to overcome the overhead of frequent domain conversions [69]. There is instead no consensus on the use of silicon photonics for on-chip communication, where optimized eletronic links, and their evolutions, are competitive. Yet, emerging nanophotonic technology has yielded a rich design space for on-chip optical-electrical architectures [73– 77]. Early studies such as those in [73, 75, 76, 78–81] made the point for the performance and power properties of photonic interconnection networks in isolation from the rest of the system. System-scale analysis was instead made affordable by [82], with a trade-off between accuracy and simulation speed in favour of the latter. The need to come up with compelling cases for silicon nanophotonic technology has motivated the quest for higher accuracy, for instance by considering communication workloads [83], or the network interface overhead [84]. As the level of detail in comparative analysis between electrical and optical fabrics increases, it is becoming evident that while the optical interconnect fabric is not more energy efficient per se [86], the opticallyaugmented system is, since it can burn power for a lower amount of time due to the lower execution times that optical links enable [85]. Another alternative for future on-chip communication consists of NoCs with multi-band RF interconnects (RF-I) [89]. In this particular NoC, instead of depending on the charging/discharging of wires for sending data, electromagnetic (EM) waves are guided along on-chip transmission lines created by multiple layers of metal and dielectric stack. As the EM waves travel at the effective speed of light, low latency and high bandwidth communication can be achieved. Though RF-I NoCs can be built using existing CMOS technology, they require laying of long on-chip transmission lines to serve as wave guides, without eliminating any existing links. Recently, the design of a wireless NoC based on CMOS Ultra Wideband (UWB) technology was proposed [90]. In [91], the feasibility of designing on-chip wireless communication networks with miniature antennas and simple transceivers that operate at the sub-THz range of 100-500 GHz has been demonstrated. If the transmission frequencies can be increased to THz/optical range then the corresponding antenna sizes decrease, occupying much less chip real estate. One possibility is to use nanoscale antennas based on CNTs operating in the THz/optical frequency range [92]. Consequently, building an on-chip wireless interconnection network (WiNoC) using THz frequencies for inter-core communications becomes feasible. On-chip wireless communication links not only alleviate the latency and energy dissipation issues of conventional technologies but also eliminate complex interconnect routing and layout problems arising in some of the alternative technologies. Hence, such interconnects enable design of novel and efficient architectures which mitigate the multi-hop communication of traditional NoCs to achieve significant performance gains. A detailed survey regarding the promises and design challenges of this emerging paradigm is reported in [93]. The development of emerging interconnect technologies implies that three fundamental gaps need to be addressed by researchers at different levels of abstraction, and with a cross-layer approach to design and optimization. Among them, the *physical design gap* is certainly the most evident issue. As regards optical NoCs, high-speed, low power, and small feature-size electro-optical modulators and photo-detector receivers need to be developed, since their quality metrics, together with the overhead of laser sources, will determine the threshold required to be advantageous over electrical interconnects. In particular, high-speed, electrically-driven monolithic light sources have remained elusive so far, thus calling for profound innovations in the field of integrated on-chip light sources. Finally, on-chip optical interconnect modules are very sensitive to process and thermal variations. Designers need to ensure active or passive optical control methods to maintain reliable device operation [87]. Similarly, the effectiveness of WiNoCs strongly depends on the design of the physical layer. In turn, the miniaturized on-chip antennas and the wireless transceivers for wavelength-routed networks due to the large amount influence the performance of the physical layer. Characteristics of the antennas and the transceivers also depend on the adopted frequency range of communication (ultra wide band, millimeter-wave, sub terahertz, feasible. The chosen scheme has then serialization and scalability implications, that are especially constraining for wavelength-routed networks due to the large amount of needed resources. Currently, an extensive comparison of architectural solutions in an homogeneous experimental setting is still missing, and so is a study relating them to the requirements of realistic workloads. or terahertz). All physical layers designed in different frequency bands have antenna and transceiver area and power overheads [93]. Thus, innovations such as [94–96] are required to achieve the best performance-overhead trade-off and fully exploit the advantages of wireless links. Technology maturity is not the only gap that separates emerging technologies from their industrial uptake. An *architectural gap* in fact raises on top of the physical one, although they end up being tightly intertwined. For optical NoCs, building a communication architecture out of a specific optical toolbox is a complex task that spans several design concerns. After all, from a functional viewpoint an optical network is nothing else but a non-blocking crossbar, due to the lack of buffering technology of practical relevance. Hence, the control complexity is entirely moved to the boundaries. There, key aspects such as flow control, synchronization, buffering architecture, resource sharing techniques, serialization, etc. should be taken care of, and may determine the threshold beyond which an optical interconnect is better than another one, or than an electrical counterpart. A key architecture-level design decision concerns the implementation of space-routed optical NoCs, which devote the available WDM (wavelengthdivision multiplexing) link bandwidth to peer-to-peer communications, or wavelength-routed NoCs, which exploit the same bandwidth for the sake of delivering global, contention-free communication, while decreasing the available bandwidth for the specific communication flows. This decision tightly depends on the application requirements, since wavelength-routed NoCs are well suited for latency-critical applications, while space-routed ones are the best choice for throughputintensive applications, especially for long-lasting connections. As pointed out in [88], there are a number of intermediate solutions between the two extreme cases, which yield to photonic bus variants. With the single writer multiple reader paradigm (SWMR), additional signaling is required for the sake of tuning the filters of the intended receiver, while in the multiple writer single reader (MWSR) paradigm a global arbitration is needed to select the injecting sender into the photonic bus. The multiple reader multiple writer scenario is also feasible. The chosen scheme has then serialization and scalability implications, that are especially constraining of needed resources. Currently, an extensive comparison of architectural solutions in an homogeneous experimental setting is still missing, and so is a study relating them to the requirements of realistic workloads. Clock resynchronization is another architecture-level concern. Some parts of the optical network interface need in fact to work at overly high speed (e.g., 10 GHz, associated with the modulation rate of the optical medium, and with the serialization ratio), or with multi-phase clock signals. This is not only a physical design issue, since signals converted back from the optical to the electronic domains should be resynchronized in the target clock domain [84]. The resynchronization architecture and circuitry is still a largely unresolved issue for optical NoCs, although source synchronous schemes seem to be the preferred option. Unfortunately, they require the transmission of clock signals across the optical domain, which becomes therefore an active research field [107]. Addressing the design predictability gap is mandatory when selecting the target topology for the optical NoC. Although not explicitly stated, logic topologies are tied to implicit placement constraints for initiator and target interfaces. The real positioning of such interfaces on the layout of the system at hand may cause a radical change of the physical routing paths. Side effects are an increased length of the waveguides or an unexpected number of additional waveguide crossings. Placement constraints are especially severe in a 3D stacked environment, as proven in [109], thus justifying their consideration upfront in the design process [108]. WiNoC architectures can be assembled by overlaying a regular wired mesh-based NoC with wireless links [95, 97]. However, there is currently an immense interest in creating novel architectures aided by the onchip wireless communication [96]. In this direction, different hierarchical small-world wireless NoC architectures incorporating THz and mm-wave wireless links are explored in [96] and [98], respectively. These works have demonstrated that, by using wireless links as longrange communication channels between widely separated cores along with wired interconnects connecting adjacent cores, it is possible to obtain significant gains in achievable bandwidth, and improve the energy dissipation profiles without introducing significant hardware overhead. Research in this domain is far from being consolidated. Moreover, to attain the desired performance benefits using WiNoC, the available communication resources should be utilized optimally. Therefore, efficient media access mechanism [90, 95, 98, 99], along with optimum routing protocol [90, 95, 96, 100], is crucial for efficient utilization of the wireless channels. Since all solutions improve the achievable bandwidth at an area and power overhead, a comprehensive study quantifying merits and limitations of these techniques, and their implementation challenges, needs to be carried out for an informative comparative analysis. The MAC and routing protocols for WiNoCs need to be complemented by suitable flow control mechanisms to enable optimum utilization of the wireless medium [101, 102]. Last but not least, challenges in reliability and integration demand radically different architectural design to make this emerging interconnect paradigm viable for large-scale adoption. Although architectural innovations such as [103] may enable resilience against permanent failures, the wireless channels are inherently more prone to transient errors than their wireline counterparts. In this direction, it is demonstrated in [104] that with carefully designed error control coding (ECC) schemes in the WiNoC it is possible to achieve high gains in performance due to the wireless links while maintaining reliability comparable to that of a traditional wire line NoC. However, application of ECC also introduces timing and area overhead, which gives rise to an interesting trade-off to explore. In addition to the physical and architectural gaps, the system as a whole should be optimized around an optical transport medium, which implies the codesign of components together to meet system-level requirements (i.e., the systemability gap). This includes for instance the codesign of the fabric with the cache coherence protocol, the routing path selection policy in case of hybrid interconnect fabrics, the differentiated service of latency- vs. throughput-critical traffic, the codesign of the on-chip network with the processor-memory (including off-chip) network, the avoidance of messagedependent deadlock. At this level, also compiler and software optimizations should be considered, including for instance the optimization of the dynamic behaviour of the application and the exploitation at runtime of the available degrees of freedom in the communication fabric. Finally, each new technology should come with its own design technology support from the ground up, in order to bridge the gap between physical designers and system designers, who need to do design with the new technology. This encompasses abstract models, design methodologies, tools and toolflows. For instance, abstraction layers and associated description tools should be redefined for optical NoCs, thus matching the electronic definitions of behavioural views, RTL ones, etc. Moreover, new tools for placement and routing such as [110] are needed, due to the inherently different optimization metrics that optical NoCs require with respect to mainstream CAD tools for electronic design (e.g., number of waveguide crossings, waveguide length, or both). Overall, the most daunting challenge for the next few years will be to come up with compelling cases for silicon nanophotonic as well as for wireless networks, thus possibly justifying the definition of roadmaps and investments in technology development based on solid experimental evidence. Assessing the implications of an emerging interconnect technology over the quality metrics of real-life devices such as GPUs or programmable accelerators is part of this needed validation framework. In this respect, the experience of researchers starts to put together a few basic rules for trustworthy crossbenchmarking between NoCs on top of emerging technologies vs. their electrical counterparts. Next, such rules are reported by deriving them from the converging conclusions of [88] and [86]. They are tailored to optical interconnection networks, although the inspiring principles behind them could easily drive WiNoC research as well in the future: - clearly specify the logic topology. In many papers, logic topologies are hardwired with their physical implementations, hence preventing a true distinction of design points, and the application of well-known optimization principles from interconnection network theory. - explore the space of mapping options to nanopho-field [106]. tonic devices. For a given logic topology, different technology mappings do exist, characterized by the use of a different mix of photonic devices (e.g., 1x2, 2x2 or higher-order photonic switching elements), or a different filtering order of WDM signals. The heart - account for place&route constraints. The actual gap between logic topologies and their physical implementation under the place&route constraints of the layout at hand should be quantified. Also, design techniques/choices should be investigated/made to minimize such a gap. - keep it simple. Simple interconnection solutions, starting from topology selection and from the choice of basic building blocks, are a must in order to minimize the adoption risk of a new technology. - design the network interface architecture. Interfacing the electrical and the optical domains is not just an issue of bringing optoeletronic devices in the design, but it is an architecture-level effort too, where networking design issues (buffering, flow control, deadlock avoidance, etc.) should be addressed. - use an aggressive electrical counterpart. Previous work often reports orders of magnitude better performance and power of the optical fabric also because the electrical counterpart is built on top of naive assumptions. A trustworthy crossbenchmarking should consider state-of-the-art electrical NoC architectures, which should undergo synthesis on top of industrial technology libraries. - assume a broad range of device parameters. In the presence of a fast evolving technology, it does not make sense to tie conclusions to specific parameters for the silicon photonic devices. Rather, parametric studies should define the requirements for physical designers in order for their devices to be mature enough for practical exploitation. - carefully consider static power overhead. Optical interconnect technology is static-power dominated, while materializing excellent dynamic power savings. Therefore, previous studies suggested not to use it for short-range communications, but rather to aggregate injecting cores into optical network interfaces, while performing short-range communications still in electronics. This avoids the proliferation of domain converters, which would consolidate the static power dominance. Obviously, the above rules should be followed not only to determine under which operating conditions switching to a new interconnect technology is to be preferred to the further evolution of current electrical links, but also to compare emerging technologies with one another, thus complementing seminal works in this field [106]. ## 6 Particular Challenges for Mobile Platforms The heart of mobile platforms like tablets and mobile phones are heterogeneous MPSoCs. They do not simply comprise a number of identical processing nodes as mobile platforms run on batteries and all processing and communication tasks need to be executed very efficiently. An increasingly large number of specialized hardware accelerators is supporting the general-purpose processors in order to achieve a high efficiency, i.e. executing the desired operations and tasks faster and/or more energy efficient than the main CPUs could do. Furthermore, these platforms comprise a zoo of different specialized components such as display controllers, camera interfaces, sensors, connectivity modules such as Bluetooth, WiFi, FM radio, GNSS (Global Navigation Satellite System), and multimedia subsystems. Most of them provide local intelligence such as an integrated processing core or a DSP. All of these subsystems and components communicate with each other and with memories such as on-chip SRAMs, and SDRAMs. A sophisticated on-chip interconnection network is required which needs to be tailored to the communication requirements of each individual component. Conventional 2D mesh networks or simple rings are not appropriate here. In fact, these networks need to be as heterogeneous as the rest of the SoC. This complicates both architecture and topology decisions, and poses many practical challenges on top of typical challenges from academia such as minimum hop-count or deadlock avoidance. Battery life time and responsiveness define the user experience. Both things at the same time can only be achieved by highly optimized architectures. This implies extensive power saving features such as DVFS and scenario-based sleep modes for parts of the system. Based on the given scenario, portions of the system are put into low-power modes. This includes reduced frequencies, lower voltages or the complete power down of a subsystem. The chosen power-saving measure depends on the wake-up time that is needed to restore the subsystem to its previous state. Dynamic clock frequency scaling can be used easily as the wake-up time is in a range of few clock cycles. The subsystem is still able to work, although slower, and will retain its state. A complete power-down of a subsystem, however, is only initiated if the power-down phase is long enough compared to the phases of powering down and waking up. In some cases the start of the next active phase cannot be determined in advance. It could happen that the subsystem is requested to be active immediately after initiating the power-down procedure which is then immediately followed by the wake-up procedure. Both procedures might use more energy than keeping the subsystem in active (idle) state. So, the power-down phase must be long enough to pay off from both timing and energy-saving perspectives. From a physical design point of view the increasing complexity of the computing and communication architectures must be handled properly. Clock distribution and synchronization inside the chip become prohibitively costly. Long clock signal wires running across the chip and toggling at high frequencies are difficult to balance and consume lots of energy. Mesochronous and asynchronous islands arise to overcome the need of long clock wires. However, these islands have to be integrated seamlessly. While this sounds trivial to accomplish, in practice many things need to be taken into account. Re-synchronization between clock domains needs to be performed, which causes additional latency and might even cause throughput reduction on control and data paths. A sophisticated design flow is needed, but also architectural awareness in order to do more good than bad. Typically, these data and control paths will be hand-optimized by experienced engineers to squeeze the last drop of performance out of these connections. The next step in mobility is the Internet of Things (IoT) [112] where everyday devices, objects and physical assets are equipped with computing power, sensors and wireless connections that enable them to commu- nicate with each other and with the Internet. These connected objects become more and more intelligent with tiny integrated processors such as the Intel Quark processor, or even be complete computers such as the Intel Edison, a PC in the size of an SD card including Bluetooth and WiFi communication capabilities. In the end, these SoCs with their on-chip interconnects will be connected forming large wireless off-chip networks with an ever increasing amount of data that will be transmitted. #### References - J. Handy, "NoC interconnect improves SoC economics," Objective analysis - Semiconductor market research, 2011. - J. Browne, "On-Chip Communications Network," in Sonics, 2012. - W. J. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks," in *Proc. of the 38th Design Automation Conference (DAC)*, Jun. 2001. - D. Wentzlaff et al., "On-Chip Interconnection Architecture of the Tile Processor," *IEEE Micro*, pp. 15–31, Sep./Oct. 2007. - L. Benini and G. De Micheli, "Networks on chips: a new SoC paradigm,", IEEE Computer, vol. 35, no. 1, pp. 7078, 2002. - G. De Micheli, C. Seiculescu, S. Murali, L. Benini and F. Angiolini et al. Networks on Chips: From Research to Products. 47th Design Automation Conference (DAC 2010), 2010 - J. Kim, J. Balfour, and W. J. Dally, "Flattened butterfly topology for on-chip networks," in Proc. IEEE/ACM Intern. Symp. on Microarchitecture (MICRO), 2007. - 8. A. K. Mishra, N. Vijaykrishnan, and C. R. Das, "A case for heterogeneous on-chip interconnects for cmps," in *Proc. of the intern. symp. on Computer architecture*, 2011, pp. 389–400. - 9. J. Balfour and W. J. Dally, "Design tradeoffs for tiled CMP on-chip networks," in *Proc, of the 20th ACM Intern. Conf. on Supercomputing (ICS)*, Jun. 2006. - Giorgos Passas, Manolis Katevenis, Dionisios Pnevmatikatos: "Crossbar NoCs Are Scalable Beyond 100 Nodes", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), ISSN: 0278-0070, vol. 31, issue 4, April 2012, pp. 573-585 - 11. J. Flich, A. Mejia, P. Lopez, and J. Duato, "Region-based routing: An efcient routing mechanism to tackle unreliable hardware in networks on chip," in *Intern. Symp. on Networks on Chip (NOCS)*, 2007. - S. Ma, N. Enright Jerger, and Z. Wang, "Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip," in *Proc. of the Intern.* Symp. on High Performance Computer Architecture, Feb. 2012, pp. 467–478. - Daeho Seo, Akif Ali, Won-Taek Lim, Nauman Rafique, and Mithuna Thottethodi. 2005. Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks. In Proceedings of the 32nd annual international symposium on Computer Architecture (ISCA '05). IEEE Computer Society, Washington, DC, USA, 432-443. - L.-S. Peh and W. J. Dally, "A delay model and speculative architecture for pipelined routers," in Proc. of the 7th Intern. Symp. on High-Performance Computer Architecture (HPCA-7), 2001. - R. D. Mullins, A. F. West, and S. W. Moore, "Low-latency virtual-channel routers for on-chip networks," in Proc. of the Intern. Symp. on Computer Architecture, 2004, pp. 188–197. - M. Azimi, D. Dai, A. Mejia, D. Park, R. Saharoy, and A. S. Vaidya, Flexible and adaptive on-chip interconnect for tera-scale architectures, Intel Technology Journal, vol. 13, no. 4, pp. 6277, 2009. - G. Michelogiannakis, N.Jiang, D.Becker, and W.J.Dally, "Packet chaining: Efficient single-cycle allocation for onchip networks," in *Proc. IEEE/ACM In. Symp. on Mi*croarchitecture (MICRO), 2011, pp. 83–94. - 18. G. Dimitrakopoulos et al., "Merged Switch Allocation and Traversal in Network-On-Chip Switches,", in *IEEE Trans. on Computers*, Feb 2013. - 19. J. Kim, "Low-cost router microarchitecture for on-chip networks," in *Intern. Symp. on Microarchitecture*, Dec. 2009 - A. T. Tran and B. M. Baas, "RoShaQ: High-performance on-chip router with shared queues," in *IEEE ICCD*, 2011, pp. 232–238. - D. U. Becker, "Adaptive backpressure: Efficient buffer management for on-chip networks," in IEEE ICCD, 2012. - S. M. Hassan and S. Yalamanchili, "Centralized buffer router: A low latency, low power router for high radix nocs," in *IEEE/ACM Intern. Symp. on Network on Chip*, April 2013. - A. Roca, C. Hernandez, J. Flich, F. Silla, and J. Duato, "Silicon-aware distributed switch architecture for on-chip networks," *Journal of Systems Architecture*, vol. 59, no. 7, pp. 505 – 515, 2013. - 24. I. Seitanidis, A. Psarras, G. Dimitrakopoulos, and C. Nicopoulos, "Elastistore: An elastic buffer architecture for network-on-chip routers," in *Proc. of Design Automa*tion and Test in Europe (DATE), Mar. 2014. - A. Balkan, G. Qu, and U. Vishkin, "Mesh-of-trees and alternative interconnection networks for single-chip parallelism," *IEEE Transactions on VLSI Systems*, vol. 17, no. 10, pp. 1419–1432, Oct 2009. - 26. P. Salihundam and et al., "A 2Tb/s 6x4 Mesh Network with DVFS and 2.3Tb/s/W router in 45nm CMOS," in Symp. VLSI Circuits, 2010. - S. R. Vangal and et al., "An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 43, pp. 6–20, Jan. 2008. - 28. S. Saponara, T. Bacchillone, E. Petri, L. Fanucci, R. Locatelli, and M. Coppola, "Design of a NoC Interface Macrocell with Hardware Support of Advanced Networking Functionalities", in IEEE Transactions on Computers. - X. Yang, Z. Qing-li, F. Fang-fa, Y. Ming-yan, and L. Cheng, NISAR: An AXI compliant on-chip NI architecture offering transaction reordering processing, in Proc. 7th Int. Conf. ASIC ASICON 07, 2007, pp. 890893. - A. Radulescu, J. Dielissen, S. G. Pestana, O. P. Gangwal, E. Rijpkema, P. Wielage, and K. Goossens, An efficient on-chip NI offering guaranteed services, shared-memory abstraction, and flexible network configuration, vol. 24, no. 1, pp. 417, 2005. - M. Ebrahimi, M. Daneshtalab, P. Liljeberg, J. Plosila, and H. Tenhunen, A high-performance network interface architecture for NoCs using reorder buffer sharing, in Proc. 18th Euromicro Int Parallel, Distributed and Network-Based Processing (PDP) Conf, 2010, pp. 546550. - 32. S. Kavadias, M. Katevenis, D. Pnevmatikatos: "Network Interface Design for Explicit Communication in - Chip Multiprocessors", chapter 10 (pp. 325-351) in the book: Designing Network-on-Chip Architectures in the Nanoscale Era, J. Flich and D. Bertozzi (Eds.), CRC Press Taylor & Francis Groupa, ISBN: 978-1-4398-3710-8, 2011. - 33. Gwangsun Kim, Michael M. Lee, John Kim, Jae W. Lee, Dennis Abts, Mike Marty, "Low-overhead Network-on-Chip Support for Location-oblivious Task Placement," IEEE Transactions on Computers, vol. 99, no. PrePrints, p. 1, , 2014 - 34. M. Fingeroff, High-level synthesis Blue Book - 35. Philippe Coussy, Adam Morawiec, High-Level Synthesis: from Algorithm to Digital Circuit - 36. O. Shacham, O. Azizi, M. Wachs, S. Richardson, M. Horowitz, Rethinking Digital Design: Why Design Must Change, IEEE Micro, 30(6), pp. 9-24. - 37. K. Itoh, "Adaptive Circuits for the 0.5-V Nanoscale CMOS Era", ISSCC 2009. - 38. W.Kim, M.S.Gupta, G.Y.Wei and D.Brooks, "System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators", Int. Symp. on High-Performance Computer Architecture, 2008. - S.Dinghe, S.Gupta, V. De, S.Vangal, N.Borkar, S.Borkar, K.Roy, "A 45 nm 48-core IA Processor with Variation-Aware Scheduling and Optimal Core Mapping", 2011 Symposium on VLSI Circuits, pp.205-251. - 40. "International Technology Roadmap for Semiconductors 2011", System Drivers, Figure SYSD3. - 41. C. Isci, A. Buyuktosunoglu, C. Cher, P. Bose, and M. Martonosi, "An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget", International Symposium on Microarchitecture, pp.347 358, 2006 - 42. Jain, R.; Geuskens, B.; Kim, S.; Khellah, M.; Kulkarni, J.; Tschanz, J.; De, V.; "A 0.45-1V Fully-Integrated Distributed Switched Capacitor DC-DC Converter With High Density MIM Capacitor in 22 nm Tri-Gate CMOS", IEEE Journal of Solid-State Circuit, Volume:PP, Issue:99, pp.1-11, 2014. - Robert Hilbrich, J. Reinier van Kampenhout, "Partitioning and Task Transfer on NoC-based Many-Core Processors in the Avionics Domain", Softwaretechnik-Trends 31(3), (2011) - Francisco Trivio, Jos L. Snchez, Francisco J. Alfaro, Jos Flich, "Network-on-Chip virtualization in Chip-Multiprocessor Systems", Journal of Systems Architecture - Embedded Systems Design 58(3-4): 126-139 (2012). - 45. F. O. Sem-Jacobsen, S. Rodrigo Mocholi, A. Strano, T. Skeie, D. Bertozzi, and F. Gilabert, "Enabling Power Efficiency through Dynamic Rerouting On-Chip", ACM Transactions on Embedded Computing Systems (TECS), 12(4), 2013. - 46. H.F.Tatenguem, et al., "Contrasting multi-synchronous MPSoC design styles for fine-grained clock domain partitioning: the full-HD video playback case study", Proceedings of the 4th International Workshop on Network on Chip Architectures, pp.37-42, 2011. - 47. A. Strano, D. Ludovici, V. Pavlidis, F. Angiolini, M. Krstic, D. Bertozzi, "The Synchronization Challenge", Chapter 6 in "Designing network on-chip architectures in the nanoscale era", edited by J.Flich and D.Bertozzi, Chapman and Hall/CRC Press; London: Taylor and Francis [distributor], 2011. - I.Loi, F.Angiolini, L.Benini, "Developing mesochronous synchronizers to enable 3D NoCs", Proceedings of the Design, Automation and Test in Europe Conference, Pages 1414-1419, 2008. 49. Saponara, S.; Cecchini, T.; Sechi, F.; Fanucci, L.; "Pinlimited frequency converter IP bridge for efficient communication of automotive IC sensors with off-chip ECUs", IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, pp.167 - 171, 2009. - Ludovici, D.; Strano, A.; Bertozzi, D.; "Architecture design principles for the integration of synchronization interfaces into network-on-chip switches", 2nd International Workshop on Network on Chip Architectures, 2009, Page(s): 31 - 36. - 51. Milos Krstic, et al., "Evaluation of GALS Methods in Scaled CMOS Technology: Moonrake Chip Experience", IJERTCS 3(4): 1-18 (2012). - 52. Steven M. Nowick, Montek Singh, "High-Performance Asynchronous Pipelines: An Overview". IEEE Design & Test of Computers, 28(5): 8-22 (2011). - 53. Beerel, P.A.; Dimou, G.D.; Lines, A.M.; "Proteus: An ASIC Flow for GHz Asynchronous Designs", IEEE Design & Test of Computers, Volume: 28 Issue: 5, pp.36 -51, 2011. - L.A.Plana, et al., "SpiNNaker: Design and Implementation of a GALS Multicore System-on-Chip," ACM JETC, vol.7, issue 4, pp.17:1-17:18, 2011. - 55. Thonnart, Y., Beigne, E.; Vivet, P.; "A Pseudo-Synchronous Implementation Flow for WCHB QDI Asynchronous Circuits", 18th IEEE Int. Symp. on Asynchronous Circuits and Systems (ASYNC), pp.73-80, 2012. - Y.Thonnart, P.Vivet, F.Clermidy, "A Fully-Asynchronous Low-Power Framework for GALS NoC Integration", DATE 2010, pp.33-38. - S.Vangal, et al., "An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS", ISSCC 2007, pp.98-589. - Vivek De, et al., "A 340mV-to-0.9V 20.2Tb/s Source-Synchronous Hybrid Packet/Circuit-Switched 1616 Network-on-Chip in 22nm Tri - Gate CMOS", ISSCC 2014. - 59. Yakovlev, Alex; Vivet, Pascal; Renaudin, Marc; "Advances in asynchronous logic: From principles to GALS & NoC, recent industry applications, and commercial CAD tools", Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013, Page(s): 1715 1724. - 60. William Lee, Vikas S. Vij, Anthony R. Thatcher, Kenneth S. Stevens, "Design of low energy, high performance synchronous and asynchronous 64-point FFT", DATE '13 Proceedings of the Conference on Design, Automation and Test in Europe, Pages 242-247. - Moreira, M.T.; Magalhaes, F.G.; Gibiluka, M.; Hessel, F.P.; Calazans, N.L.V.; "BaBaNoC: An asynchronous network-on-chip described in Balsa", 2013 International Symposium on Rapid System Prototyping (RSP), Page(s): 37 - 43. - 62. Gebhardt, D.; Junbok You; Stevens, K.S., "Design of an Energy-Efficient Asynchronous NoC and Its Optimization Tools for Heterogeneous SoCs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume: 30, Issue: 9, Page(s): 1387 1399. - 63. Imai, M.; Yoneda, T., "Improving Dependability and Performance of Fully Asynchronous On-chip Networks", 17th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), 2011, Page(s): 65 76. - 64. Ghiribaldi, Alberto; Bertozzi, Davide; Nowick, Steven M.; "A transition-signaling bundled data NoC switch architecture for cost-effective GALS multicore systems", Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013, Page(s): 332 - 337. - 65. F.Clermidy et al., "MAGALI: A Network-on-Chip based multi-core system-on-chip for MIMO 4G SDR", IEEE International Conference on IC Design and Technology (ICICDT), 2010, Page(s): 74 - 77. - 66. E.Beigne, et al., "An Asynchronous Power Aware and Adaptive NoC Based Circuit", IEEE Journal of Solid-State Circuits, Vol.44, no.4, pp.1167-1177, 2009. - D.Lattard, et al., "A Telecom Baseband Circuit based on an Asynchronous Network-on-Chip", ISSCC 2007, Page(s): 258 - 601. - 68. B. Krzanich, "CES 2014 Keynote", http://www.intel.com/content/www/us/en/events/intel-ces-keynote.html - Christoforos Kachris, Ioannis Tomkos: "A Survey on Optical Interconnects for Data Centers." IEEE Communications Surveys and Tutorials 14(4): 1021-1036 (2012). - "International Technology Roadmap for Semiconductors 2011", Interconnect. - S.Pasricha, N.Dutt, "Trends in Emerging On-Chip Interconnect Technologies", IPSJ Trans. on System LSI Design Methodology, Vol.1, pp2-7, 2008. - L. P. Carloni et al., "Networks-on-Chip in Emerging Interconnect Paradigms: Advantages and challenges", in Proc. 3rd ACM/IEEE Int.Symp. Networks-on-Chip, 2009, pp. 93-102. - N. Kirman et al., "Leveraging Optical Technology in Future Bus-based Chip Multiprocessors", in MICRO, 2006. - 74. C. Batten et al., "Building manycore processor-to-dram networks with monolithic silicon photonics", in Hot Interconnects, Aug 2008, pp. 21-30. - Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. N. Choudhary, "Firefly: illuminating future network-onchip with nanophotonics", in ISCA, 2009, pp. 429-440. - D. Vantrease, N. L. Binkert, R. Schreiber, and M. H. Lipasti, "Light speed arbitration and flow control for nanophotonic interconnects", in MICRO'09, 2009, pp. 304-315. - G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling, and A. Agarwal, "ATAC: a 1000-core cache-coherent processor with on-chip optical network", in PACT'10, 2010, pp. 477-488. - D. Vantrease et al., "Corona: System Implications of Emerging Nanophotonic Technology", in ISCA, 2008. - M. J. Cianchetti, J. C. Kerekes, and D. H. Albonesi, "Phastlane: a Rapid Transit Optical Routing Network", SIGARCH Comput. Archit. News, vol. 37, pp. 441-450, June 2009. - Y. Pan, J. Kim, and G. Memik, "Flexishare: Channel Sharing for an Energy-Efficient Nanophotonic Crossbar," in HPCA, 2010, pp. 1-12. - A. Shacham, B.G. Lee, A. Biberman, K. Bergman, and L.P. Carloni, "Photonic NoC for DMA Communications in Chip Multiprocessors," in Hot Interconnects, Aug 2007. - 82. J. Chan, G. Hendry, A. Biberman, K. Bergman and L.P. Carloni, "Phoenixsim: A Simulator for Physical-Layer Analysis of Chip-Scale Photonic Interconnection Networks", DATE, 2010. - 83. G. Hendry et al., "Analysis of Photonic Networks for a Chip Multi-Processor Using Scientific Applications, DATE, 2010. Proceedings of the Third International Symposium on Networks-on-Chip (NOCS), 2009. - 84. Marta Ortn Obn, Luca Ramini, Herve Tatanguem Fankem, Victor Vinals-Yufera and Davide Bertozzi, "A Complete Electronic Network Interface Architecture for Global Contention-Free Communication over Emerging Optical Networks-on-Chip" Proceedings of GLSVLSI Symposium, 2014. - 85. G.Kurian et al., "Cross-Layer Energy and Performance Evaluation of a Nanophotonic Manycore Processor System using Real Application Workloads", IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), Page(s): 1117 - 1130, 2012. - 86. Luca Ramini, Paolo Grani, Herve Tatenguem Fankem, Alberto Ghiribaldi, Sandro Bartolini and Davide Bertozzi "Assessing the Energy Break-Even Point between an Optical NoC Architecture and an Aggressive Electronic Baseline", Proceedings of DATE 2014. - Weiss, S.M., Molinari, M. and Fauchet, P.M.: "Temperature stability for silicon-based photonic band-gap structures", Appl. Phys. Lett., Vol.83, No.10, pp.1980-1982, Sep. 2003 - C.Batten, A.Joshi, V.Stojanovic, K.Asanovic, "Designing Chip-Level Nanophotonic Interconnection Networks", IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol.2, no.2, June 2012. - M. F. Chang et al., "CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect", Proceedings of IEEE International Symposium on High-Performance Computer Architecture (HPCA), 16-20 February, 2008, pp. 191-202. - D. Zhao and Y. Wang, "SD-MAC: Design and Synthesis of A Hardware-Efficient Collision-Free QoS-Aware MAC Protocol for Wireless Network-on-Chip", IEEE Transactions on Computers, vol. 57, no. 9, September 2008, pp. 1230-1245. - 91. S. B. Lee et al., "A Scalable Micro Wireless Interconnect Structure for CMPs", Proceedings of ACM Annual International Conference on Mobile Computing and Networking (MobiCom), September, 2009, pp. 20-25. - K. Kempa, et al., "Carbon Nanotubes as Optical Antennae," Advanced Materials, vol. 19, 2007, pp. 421-426. - 93. S.Deb, A.Ganguly, P.P.Pande, B.Belzer, D.Heo, "Wireless NoC as Interconnection Backbone for Multicore Chips: Promises and Challenges", IEEE Journal on Emerging and Selected Topics in circuits and Systems, vol.2, no.2, June 2012. - 94. X. Yu et al., "A Wideband Body-Enabled Millimeter-Wave Transceiver for Wireless Network-on-Chip", in Proc. 54th IEEE Midwest Symp. Circuits Syst., Aug. 2011, pp. 1-4. - 95. S. B. Lee et al., "A Scalable Micro Wireless Interconnect Structure for CMPs", in Proc. ACM Annu. Int. Con. Mobile Comput. Network. (MobiCom), 2009, pp. 20-25 - A. Ganguly et al., "Scalable Hybrid Wireless Networkon-Chip Architectures for Multi-Core Systems" IEEE Trans. Computers, vol. 60, no.10, pp. 1485-1502. - 97. D. DiTomaso et al., "iWise: Inter-routerwireless scalable express channels for Network-on-Chips (NoCs) architecture," in Proc. Annu. Symp. High Performance Interconnects, 2011, pp. 11-18. - S. Deb et al., "Enhancing performance of Network-on-Chip architectures with millimeter-wave wireless interconnects," in Proc. IEEE Int. Conf. ASAP, 2010, pp. 73-80. - D. Zhao et al., "Design of multi-channel wireless NoC to improve on-chip communication capacity," in Proc. 5th ACM/IEEE Int. Symp. Networks-on-Chip, 2011, pp. 177-184. - 100. C.Wang et al., "A wireless Network-on-Chip design for multicore platforms," in Proc. 19th Int. Euromicro Conf. Parallel, Distributed Network-Based Process., 2011, pp. 409-416. - 101. K.Chang et al., "Performance evaluation and design trade-offs for wireless network-on-chip architectures", - ACM Journal on Emerging Technologies in Computing Systems, Volume 8, Issue 3, August 2012. - 102. S. Deb et al., "Design of an efficient NoC architecture using millimeter-wave wireless links", in Proc. IEEE Int. Symp. Quality Electron. Design (ISQED), Mar. 2012, pp. 165-172. - 103. A. Ganguly et al., "Complex network inspired fault-tolerant NoC architectures with wireless links," in Proc. 5th ACM/IEEE Int. Symp. Networks-on-Chip, 2011, pp. 1485-1502. - 104. A. Ganguly et al., "A unified error control coding scheme to enhance the reliability of a hybrid wireless Network-on-Chip," in Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI Nanotechnol. Syst., 2011, pp.277-285. - J.Howard et al., "A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS", ISSCC, pp.108-109, 2010. - 106. S.Deb, K.Chang, A.Ganguly, P.Pande, "Comparative Performance Evaluation of Wireless and Optical NoC Architectures", SoCC 2010: 487-492. - 107. J.C. Leu and V. Stojanovic, "Injection-Locked Clock Receiver for Monolithic Optical Link in 45nm," Asian Solid-State Circuits Conference, Jeju, Korea, pp. 149-152, November 2011. - 108. Sebastien Le Beux, Jelena Trajkovic, Ian O'Connor, Gabriela Nicolescu: "Layout guidelines for 3D architectures including Optical Ring Network-on-Chip (ORNoC)." VLSI-SoC 2011: 242-247. - 109. Luca Ramini, Paolo Grani, Sandro Bartolini, Davide Bertozzi: "Contrasting wavelength-routed optical NoC topologies for power-efficient 3D-stacked multicore processors using physical-layer analysis." DATE 2013: 1589-1594. - Anja Boos, Luca Ramini, Ulf Schlichtmann, Davide Bertozzi: "PROTON: an automatic place-and-route tool for optical Networks-on-Chip", ICCAD 2013: 138-145. - 111. ETP4HPC Strategic Research Agenda, available at http://www.etp4hpc.eu/ - 112. B. Krzanich, "CES 2014 Keynote", http://www.intel.com/content/www/us/en/events/intel-ces-keynote.html