Tlm 415 Case Study 2.1 Question 2
The design of SoC (Systems-on-Chip) includes both hardware and software development. The embedded software is dependent on the hardware interfaces, but the hardware design is the most critical step of the flow, because modifying it is very costly. For this reason, a model of the hardware has to be developed. TLM (Transaction Level Modeling) models are highly abstract and enable both the ability to check hardware functionality in terms of data and control exchanges and to debug the embedded software before the detailed microarchitecture is available. Within TLM, multiple coding styles exist, depending on the level of detail needed. If the purpose is to have the earliest and fastest opportunity to run a transaction-accurate simulation of the hardware, for example in the case of a non-critical SoC, then LT (Loosely Timed) style is a relevant choice. Indeed, a TLM/LT simulation runs orders of magnitude faster than an RTL (Register Transfer Level) one.
SystemC is a C++ hardware modeling library which enables TLM, and also an IEEE standard. Since hardware systems are intrinsically parallel systems, the API offered by SystemC supports parallel semantics. The hardware behavior is modeled by SystemC threads and methods. They are executed by a scheduler, which guarantees that the order of their execution respects the constraints specified in the model. According to the SystemC standard , a scheduler must behave as if it was implementing coroutine semantics. This means that, for each execution order, there must exist a sequential scheduling that reproduces the case. The SystemC simulation kernel given by the Accellera Systems Initiative (ASI, formerly OSCI) is a sequential implementation. An advantage of the sequential implementation is that it makes the determinism of executions easier to implement and eases the reproducibility of errors. An obvious drawback is that it does not exploit the parallelism of the host machine. With the increasing size of models, the simulation time is the major bottleneck of complex hardware simulation. The parallelization of SystemC simulations is not straightforward, and is a major research concern.
For more than a decade now, there have been several proposals for SystemC parallelization. An approach chosen in [2,3,4] is to run multiple processes concurrently inside a delta cycle, with a synchronization barrier at the end of each one. Parallel discrete event simulation (PDES) has been exploited, first, with a conservative approach [5,6,7,8], where all the time constraints are strictly fulfilled. Then with a more optimistic approach, by relaxing the synchronization with a time quantum [9,10]. Optimistic approaches may need a rollback mechanism in case the simulation went through an invalid path. Another work  combined different methods; the parallelization inside delta cycles with relaxed synchronizations. To conclude with this panorama, sc_during allows specifying that some parts of the simulation can be run concurrently with the rest of the platform .
Each of these approaches have been proved experimentally to be efficient on some benchmarks, but the representativeness of these benchmarks compared to industrial case studies is questionable. Indeed, not much of the works above target LT simulations, while such models are commonly used for fast and early simulation. One difficulty is that real case studies are often confidential, and hardly available for the research community working on parallel simulation. Conversely, most research tools are not publicly available, hence a fair comparison on case studies is not possible. Our claim is that the challenges raised by the parallelization of LT SystemC models are fundamentally different from the ones in cycle-accurate or other fine granularity models. As a consequence, many of the existing approaches cannot work on LT models.
To support this claim, we need to provide measurements performed on a case study from STMicroelectronics. Since SystemC is a C++ library, usual profilers for C++ like gprof or valgrind + kcachegrind can be used. They will however miss important aspects of the execution of a SystemC program, like: how much time is spent in the SystemC kernel as opposed to the user-written parts, per-process statistics, simulated time based visualization. To the best of our knowledge, there is no turnkey application available to get these information from a SystemC simulation. We have developed SycView, a profiling and visualization tool for SystemC and we present the results on an industrial LT platform from STMicroelectronics. By giving these measurements, we show that some approaches cannot work on the model we want to parallelize, and thus that any implementation using this technique will not be efficient.
We believe this paper provides a better understanding of the potential bottlenecks of various parallelization approaches on such platforms. It should help both the design of efficient parallelization solutions and the design of representative benchmarks. We also propose a comprehensive survey of the existing solutions with a critical analysis. First, in Section 2 are given some background information. The problem is then described in Section 3. In Section 4 is presented SycView, a visualization and profiling tool we developed to get information about a simulation. The results obtained by SycView on an industrial test case from STMicroelectronics are shown in Section 5. Finally, in Section 6 we describe the panorama of existing work about the parallelization of SystemC simulations.
2.2. SystemC Scheduling
As a reminder, we first present in Figure 1 an abstract of the SystemC scheduler behavior, as stated in the SystemC standard .
The scheduler starts with an initialization phase that we do not detail here. In the evaluation phase, the runnable processes are started or resumed with no particular order. The immediate notifications produced by these transitions are then triggered. If there are runnable processes after the notifications, then the evaluation phase continues. Otherwise, the scheduler moves to the update phase, followed by the delta notification phase. The immediate notification loop is implicit on the figure, within the evaluation phase box. A delta cycle corresponds to the loop: evaluation, update and delta notification. At the end of a delta cycle, if there are no runnable processes, the scheduler checks timed events. If there are timed events, it picks the earliest one, sets the current simulation time to its time, and notifies the time change. A timed cycle corresponds to the loop: evaluation, update, delta notification and time notification. For concision we did not represent the other sets involved in the scheduling algorithm.
2.3. Time Modeling
In a TLM/LT model, the microarchitecture is not modeled, so the computation may be performed differently from the actual hardware, even if functionally equivalent. As a consequence, it makes no sense for a process to yield the control to another process in the middle of a computation: the state that the other process could observe would not be relevant with respect to the hardware [13,14]. Thus, a computation that takes time is usually modeled with a sequence like in Figure 2.
A construction commonly used in TLM/LT models is temporal decoupling. It has been added in the TLM-2 standard , but was in fact already used before in SystemC models. It consists in defining a local time for each process, which can be increased during the execution, getting ahead of the SystemC time. The increase is called a low cost timing annotation because it only operates on a local variable and induces no SystemC kernel operation. To keep time consistency, a time purge is defined. The purge of the local time happens notably when reaching synchronization points (synchronization-on-demand in TLM-2). This avoids synchronizing every time, but instead do the purge on essential events of the system. Then, using temporal decoupling, the sequence in Figure 2 becomes the one in Figure 3.
Precise timing information is not available in early models. Because of this, to avoid over-specification, time ranges are used in loosely timed models. The use of time ranges instead of time values changes the fact that two values (the bounds of the range) instead of one must be specified when annotating. For the synchronization, a value within the current time interval must be chosen, since the SystemC kernel needs a time value. The choice of the value within the range is implementation-defined. The current implementation in use at STMicroelectronics picks a random value within the interval. We will see in Section 5.3 that even if time ranges could be better used, the benefits in our test case are barely perceptible. This illustrates the claim we made in Section 1.
2.4. Communication Between SystemC Processes
In TLM the different processes are mostly communicating through TLM sockets. In practice, TLM sockets forward function calls from the source to the target until the final target is reached, where the function is defined. That means that a process emitting a transaction through a TLM socket will be executing a piece of code in the target module, i.e., which can use variables shared with other processes of the same module.
This is fundamentally different from sc_signal communication, which is the basic communication medium in cycle-accurate models. Indeed signals contain an isolation mechanism between their current value (i.e., the value returned by a read on the signal) and their future value (i.e., the value written to the signal). The future value is assigned to the current value during the update phase, but there is no race condition between a write and one or several reads during the execution phase (even if the execution semantics is relaxed to allow parallel execution during the execution phase). This isolation between readers and writers gives a convenient opportunity to have them running in parallel at negligible extra cost.
3. Problem Statement
The parallelization of LT SystemC/TLM simulations implies to solve several challenges. We do not present classic issues inherent to software parallelization, we focus instead on the ones induced specifically by such models, in particular in an industrial context.
A SystemC parallelization solution must not introduce race conditions. As stated in Section 2.4, in TLM models, communication is done by function calls, which involves shared resources. For example, two initiators (e.g., CPUs) that concurrently access the same target (e.g., a RAM) will concurrently call the same function of this target. That makes the target component itself a shared resource, introducing a race condition if not protected.
In the industry, there is a need to support heterogeneous simulations. That means that some parts of a model are designed with a different technology. For example, a vendor provided only the RTL model of a hardware block, but the rest of the model is written in SystemC/TLM. In this case, the RTL model can be simulated, or can be run in a FPGA device in co-simulation with the TLM model. One part of the model can even be a real hardware component (e.g., a prototype). An industrially compliant parallelization solution must be able to integrate this heterogeneity.
Another huge challenge for SystemC parallelization is the adaptability on existing platforms. Indeed, as for every technology change, the migration has a cost. This cost must be put in perspective with the time saved if the parallelization solution was in production: a solution that requires an important effort may not be profitable even if it shows substantial performance benefits.
To conclude with this section, we note a very important point regarding the conception of a parallelization technique: the knowledge of the profile of a simulation. Analogously, parallelizing huge independent computations on matrices is not performed the same way as parallelizing a shortest path algorithm. In our case, a simulation is mostly characterized by the model and not by the simulation kernel. In other words, a major interest in our research is to have the ability to characterize a simulation. One interesting measurement in that purpose is the number of runnable SystemC processes at each simulation cycle. Other ones include the wall-clock time consumption per transition, or per SystemC process. Both put in perspective, they give strong clues about the applicability of a parallelization option in a given case.
Case study:Eagersaver.com was established in 2005 by the CEO ColeTe Bevan as an online comparison site primarily focused on car insurance and related products. Since then it has grown, both organically and by acquisi±on of other companies into an organiza±on that now compares home insurance, legalinsurance, pet insurance, travel insurance, life insurance and accident insurance. It has diversiFed into other comparison services in Fnancial products, travel services and u±li±es. It has moved o²ine with the opening of call center ac±vi±es and a ³V shopping channel. ³he company’s turnover is £100m.³he Managing Director Dirk BradFeld now wishes to ´oat the company on the stock exchange and following a due diligence exercise by ColeTe’s corporate advisors she has been advised to ‘professionalize the procurement ac±vi±es throughout the group.’ ³he due diligence uncovered the following facts:I.³here are Fve loca±ons within the UK, situated at Chester, Edinburgh, Sheµeld, Bristol and Cardi¶. ³hey work independently and only one loca±on has a Purchasing Manager (³V shopping channel).II.³he largest spend across the group is on marke±ng (£12m online and £8m ³V).III.Marke±ng is centrally managed by the Marke±ng Director.IV.Most other procurement is undertaken by service heads including I³ and agency sta¶.