Addressable Retentivity

Architecture

David Coin Harris , Sarah L. Harris , in Digital Design and Computer Architecture (Second Edition), 2013

Memory

If registers were the only storage infinite for operands, we would be confined to unproblematic programs with no more than 32 variables. However, information can also exist stored in memory. When compared to the register file, memory has many information locations, but accessing it takes a longer amount of fourth dimension. Whereas the annals file is small and fast, retention is large and tedious. For this reason, usually used variables are kept in registers. By using a combination of retentivity and registers, a plan can access a large amount of information fairly speedily. As described in Section 5.five, memories are organized as an array of data words. The MIPS compages uses 32-flake memory addresses and 32-bit data words.

MIPS uses a byte-addressable retentiveness. That is, each byte in retentiveness has a unique accost. Yet, for explanation purposes only, we first innovate a word-addressable memory, and later on draw the MIPS byte-addressable memory.

Effigy 6.1 shows a memory array that is word-addressable. That is, each 32-bit information discussion has a unique 32-bit address. Both the 32-flake word accost and the 32-fleck data value are written in hexadecimal in Figure half dozen.1. For instance, data 0xF2F1AC07 is stored at memory address 1. Hexadecimal constants are written with the prefix 0x. By convention, memory is drawn with low memory addresses toward the bottom and loftier memory addresses toward the elevation.

Figure 6.1. Discussion-addressable memory

MIPS uses the load give-and-take instruction, lw, to read a data give-and-take from retention into a register. Code Example half dozen.half dozen loads memory word 1 into $s3.

The lw educational activity specifies the effective address in memory equally the sum of a base address and an commencement. The base address (written in parentheses in the instruction) is a register. The offset is a abiding (written earlier the parentheses). In Code Example vi.6, the base address is $0, which holds the value 0, and the offset is one, and then the lw pedagogy reads from memory address ($0 + 1) = 1. After the load give-and-take instruction (lw) is executed, $s3 holds the value 0xF2F1AC07, which is the data value stored at retentivity address 1 in Figure half dozen.1.

Code Instance half-dozen.six

Reading Discussion-Addressable Retentiveness

Assembly Lawmaking

# This assembly code (unlike MIPS) assumes word-addressable memory

  lw $s3, 1($0)   # read memory give-and-take 1 into $s3

Code Example 6.seven

Writing Word-Addressable Retention

Assembly Code

# This assembly code (unlike MIPS) assumes word-addressable memory

  sw   $s7, 5($0)   # write $s7 to memory word five

Similarly, MIPS uses the store word instruction, sw, to write a data give-and-take from a register into retentivity. Code Example 6.vii writes the contents of register $s7 into memory word 5. These examples have used $0 as the base accost for simplicity, just remember that any register tin be used to supply the base address.

The previous two code examples have shown a computer compages with a word-addressable memory. The MIPS memory model, withal, is byte-addressable, not word-addressable. Each information byte has a unique accost. A 32-bit discussion consists of four 8-bit bytes. And so each word accost is a multiple of 4, equally shown in Figure half dozen.2. Once more, both the 32-bit word address and the data value are given in hexadecimal.

Effigy 6.2. Byte-addressable memory

Code Example vi.8 shows how to read and write words in the MIPS byte-addressable retentivity. The discussion address is four times the discussion number. The MIPS associates code reads words 0, ii, and 3 and writes words ane, viii, and 100. The kickoff tin can be written in decimal or hexadecimal.

The MIPS compages as well provides the lb and sb instructions that load and store single bytes in memory rather than words. They are similar to lw and sw and will be discussed further in Department six.4.5.

Byte-addressable memories are organized in a big-endian or piddling-endian fashion, as shown in Figure 6.iii. In both formats, the most meaning byte (MSB) is on the left and the least significant byte (LSB) is on the right. In big-endian machines, bytes are numbered starting with 0 at the large (nearly pregnant) end. In petty-endian machines, bytes are numbered starting with 0 at the picayune (least meaning) finish. Word addresses are the same in both formats and refer to the aforementioned 4 bytes. Simply the addresses of bytes within a word differ.

Figure 6.three. Big- and fiddling-endian memory addressing

Code Example half dozen.8

Accessing Byte-Addressable Retention

MIPS Assembly Code

lw   $s0,   0($0)   # read data word 0 (0xABCDEF78) into $s0

lw   $s1,   8($0)   # read information discussion two (0x01EE2842) into $s1

lw   $s2,   OxC($0)   # read data word 3 (0x40F30788) into $s2

sw   $s3,   4($0)   # write $s3 to information word 1

sw   $s4,   0x20($0)   # write $s4 to data give-and-take 8

sw   $s5,   400($0)   # write $s5 to data word 100

Example 6.2

Big- and Picayune-Endian Retentiveness

Suppose that $s0 initially contains 0x23456789. After the following program is run on a big-endian arrangement, what value does $s0 comprise? In a piffling-endian system? lb $s0, 1($0) loads the data at byte address (1 + $0) = 1 into the least pregnant byte of $s0. lb is discussed in detail in Section 6.4.5.

sw $s0, 0($0)

lb $s0, i($0)

Solution

Figure half-dozen.4 shows how big- and little-endian machines store the value 0x23456789 in retentiveness word 0. Afterwards the load byte educational activity, lb $s0, 1($0), $s0 would contain 0x00000045 on a big-endian system and 0x00000067 on a little-endian organization.

Figure 6.4. Big-endian and niggling-endian data storage

IBM'due south PowerPC (formerly constitute in Macintosh computers) uses large-endian addressing. Intel's x86 architecture (plant in PCs) uses petty-endian addressing. Some MIPS processors are little-endian, and some are big-endian. 1 The selection of endianness is completely arbitrary but leads to hassles when sharing information betwixt big-endian and little-endian computers. In examples in this text, we will use little-endian format whenever byte ordering matters.

The terms big-endian and little-endian come from Jonathan Swift's Gulliver'south Travels, get-go published in 1726 nether the pseudonym of Isaac Bickerstaff. In his stories the Niggling rex required his citizens (the Little-Endians) to intermission their eggs on the little end. The Big-Endians were rebels who bankrupt their eggs on the big end.

The terms were first applied to computer architectures by Danny Cohen in his newspaper "On Holy Wars and a Plea for Peace" published on April Fools 24-hour interval, 1980 (USC/ISI IEN 137). (Photo courtesy of The Brotherton Collection, IEEDS University Library.)

In the MIPS architecture, word addresses for lw and sw must be word aligned. That is, the accost must be divisible by 4. Thus, the instruction lw $s0, 7($0) is an illegal instruction. Some architectures, such equally x86, permit non-discussion-aligned data reads and writes, but MIPS requires strict alignment for simplicity. Of course, byte addresses for load byte and store byte, lb and sb, need non be word aligned.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9780123944245000069

Introduction

William J. Buchanan BSc, CEng, PhD , in Software Evolution for Engineers, 1997

thirteen.6 Retention addressing size

The size of the address bus indicates the maximum addressable number of bytes. Table 13.four shows the size of addressable retentivity for a given address bus size. For example:

Table 13.4. Addressable memory (in bytes) related to accost omnibus size

Address bus size Addressable memory (bytes)
1 2
ii 4
3 8
4 16
v 32
half-dozen 64
7 128
8 256
9 512
10 1K *
11 2K
12 4K
13 8K
14 16K
15 32K
xvi 64K
17 128K
18 256K
nineteen 512K
20 1M
21 2M
22 4M
23 8M
24 16M
25 32M
26 64M
32 4G
64 16GG
*
1K represents 1024
1M represents 1 048 576 (1024 Chiliad)
1G represents 1 073 741 824 (1024 M)

A 1-bit address motorbus tin can address upwards to two locations (that is 0 and ane).

A ii-bit accost autobus tin can address ii2 or 4 locations (that is 00, 01, ten and 11).

A 20-chip address coach can address up to two20 addresses (1 MB).

A 24-bit address bus can address up to sixteen MB.

A 32-bit address bus can accost upward to 4 GB.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978034070014350058X

Embedded Software

Colin Walls , in Embedded Software (Second Edition), 2012

1.iii.2 Flat Unmarried-Space Memory

Flat retentiveness is conceptually the easiest architecture to appreciate. Each retention location has an address, and each address refers to a unmarried memory location. The maximum size of addressable memory has a limit, which is most probable to exist defined by the word size of the chip. Examples of chips applying this scheme are the Freescale Coldfire and the Zilog Z80.

Typically, addresses kickoff at zero and go upwardly to a maximum value. Sometimes, particularly with embedded systems, the sequence of addresses may be discontinuous. As long as the programmer understands the architecture and has the right development tools, this discontinuity is not a problem.

Nearly programming languages, like C, presume a flat memory. No special memory-handling facilities need be introduced into the language to fully utilise flat memory. The only possible issues are the use of address zero, which represents a null pointer in C, or high addresses, which may exist interpreted as negative values if intendance is not exercised.

Linkers designed for embedded applications that support microprocessors with flat memory architectures usually accommodate discontinuous retention space by supporting scatter loading of program and data sections. The apartment address memory architecture is shown in Figure 1.2.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124158221000015

Computer architecture

A.C. Fischer-Cripps , in Newnes Interfacing Companion, 2002

ii.2.nine Microprocessor unit (MPU/CPU)

The fundamental processing unit organises and orchestrates all activities within the microcomputer. Each operation inside the CPU is actually a very elementary task involving the interaction of binary numbers and Boolean algebra. A large number of these uncomplicated tasks combine to form a particular function which may appear to be alarmingly complex.

The "MPU" is essentially the aforementioned thing every bit the more familiar and full general term "CPU" (CPU applies to any computer, and non only a microcomputer).

The CPU is responsible for initiating transferring data to and from retentivity and input/output devices, performing arithmetic and logical operations, and controlling the sequencing of all activities. Inside the CPU are various subcomponents such as the arithmetic logic unit (ALU), the instruction decoder, internal registers and various command circuits which synchronise the timing of various signals on the buses.

CPU
Didactics decoder Accost registers
Arithmetic logic unit Pointers
Registers Flags
Instruction pointer

80X86 CPU development
1972 Intel introduces the 4004 with a iv-chip data bus, 10,000 transistors.
1974 8080 CPU has 8-bit information jitney and 64 kb addressable retentiveness (RAM).
1978 8086 with a xvi-fleck data bus and 1 MB addressable retentiveness, 4 MHz clock.
1979 8088 with 8 bit external data double-decker, 16-scrap internal bus.
1982 80286, 24 chip address bus, 16 MB addressable memory, 6 MHz clock.
1985 80386DX with 32-bit information omnibus, x MIPS, 33 MHz clock, 275 × 10three transistors
1989 80486DX 32-bit data bus, internal maths coprocessor, >1 × 106 transistors, 30 MIPS, 100 MHz clock, 4 GB addressable retentiveness.
1993 Pentium, 64-bit PCI data bus, 32-chip address bus, superscalar architecture allows more than ane education to execute in a single clock wheel, hardwired floating point, >3 × tenhalf-dozen transistors, 100 MIPS, >200 MHz clock, 4 GB addressable memory.
1995 Pentium Pro, 64-bit organization autobus, 5.5 × 10half dozen transistors, dynamic execution uses a speculative data menstruation analysis method to decide which instructions are ready for execution, 64 GB addressable memory.
1997 Pentium II, seven.five × 106 transistors with MMX engineering science for video applications 64 GB addressable memory.
1999 Pentium Iii, 9.5 × 10half dozen transistors, 600 MHz to i GHz clock.
2000 Pentium iv, 42 × ten6 transistors, 1.5 GHz clock.
2001 Xeon, Celeron processors, 1.ii GHz, 55 × ten6 transistors.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780750657204501091

Configuring Windows Server Hyper-Five and Virtual Machines

Aaron Tiensivu , in Securing Windows Server 2008, 2007

Understanding the Components of Hyper-Five

Hyper-V has greater deployment capabilities to past versions and provides more options for your specific needs because it utilizes specific 64-bit hardware and a 64-chip operating arrangement. Additional processing power and a larger addressable memory space is gained by the utilization of a 64-bit surroundings. Hyper-5 has three main components: the hypervisor, the virtualization stack, and the new virtualized I/O model. The hypervisor besides known every bit the virtual car monitor, is a very small-scale layer of software that is present directly on the processor, which creates the unlike "partitions" that each virtualized case of the operating system will run inside. The virtualization stack and the I/O components act to provide a become-between with Windows and with all the partitions that you lot create. All three of these components of Hyper-Five work together equally a team to allow virtualization to occur. Hyper-V hooks into threads on the host processor, which the host operating system can and then use to efficiently communicate with multiple virtual machines. Because of this, these virtual machines and multiple virtual operating systems tin all be running on a single physical processor. You lot can see this model in Figure 8.ane.

Figure viii.1. Viewing the Components of Hyper-Five

Tools & Traps…

Understanding Hypercalls in Hyper-V

In gild to ameliorate empathize and distinguish the basis for Hyper-V virtualization, let'south try to get a improve idea of how hypercalls work. The hypervisor uses a calling arrangement for a guest that is specific to Hyper-V. These calls are called hypercalls. A hypercall defines each set up of input or output parameters between the host and guest. These parameters are referred to in terms of a memory-based information structure. All aspects of input and output data structures are cushioned to natural boundaries up to 8 bytes. Input and output data structures are placed in retentivity on an 8-byte boundary. These are and so padded to a multiple of eight bytes in size. The values inside the padding areas are disregarded by the hypervisor.

There are 2 kinds of hypercalls referred to as simple and repeat. Simple hypercall attempts a single human action or operation. It contains a fixed-size set of input and output parameters. Repeat hypercall conducts a complicated series of elementary hypercalls. Besides having the parameters of a simple hypercall, a repeat hypercall uses a list of fixed-size input and output elements.

You can event a hypercall but from the near privileged guest processor mode. For x64 environment, this means the protected mode with the Current Privilege Level (CPL) of zero. Hypercalls are never allowed in existent way. If yous attempt to issue a hypercall inside an illegal processor mode, yous will receive an undefined operation exception.

All hypercalls should be issued via the architecturally defined hypercall interface. Hypercalls issued by other means including copying the code from the hypercall lawmaking page to an alternate location and executing it from there could event in an undefined operation exception. You lot should avoid doing this altogether considering the hypervisor is not guaranteed to deliver this exception.

The hypervisor creates partitions that are used to isolate guests and host operating systems. A division is comprised of a concrete address space and one or more virtual processors. Hardware resources such as CPU cycles, retentiveness, and devices can be assigned to the partition. A parent partition creates and manages child partitions. It contains a virtualization stack, which controls these child partitions. The parent partition is in virtually occasions also the root partition. It is the showtime partition that is created and owns all resources not owned by the hypervisor. As the root segmentation it will handle the loading of and the booting of the hypervisor. Information technology is also required to deal with power management, plug-and-play, and hardware failure events.

Partitions are named with a partition ID. This 64-bit number is delegated by the hypervisor. These ID numbers are guaranteed past the hypervisor to exist unique IDs. These are non unique in respect to power cycles however. The same ID may be generated beyond a power cycle or a reboot of the hypervisor. The hypervisor does guarantee that all IDs within a unmarried power cycle will be unique.

The hypervisor also is designed to provide availability guarantees to guests. A group of servers that take been consolidated onto a solitary physical automobile should not hinder each other from making progress, for example. A sectionalisation should be able to be run that provides telephony back up such that this sectionalisation continues to perform all of its duties regardless of the potentially contrary actions of other partitions. The hypervisor takes many precautions to assure this occurs flawlessly.

For each partition, the hypervisor maintains a memory pool of RAM SPA pages. This pool acts just like a checking account. The amount of pages in the puddle is called the balance. Pages are deposited or withdrawn from the memory pool. When a hypercall that requires retentiveness is made by a division, the hypervisor withdraws the required memory from the full pool residuum. If the balance is insufficient, the phone call fails. If such a withdrawal is made past a guest for some other guest in another segmentation, the hypervisor attempts to depict the requested amount of retentiveness from the pool of the latter partition.

Pages within a partition's memory pool are managed by the hypervisor. These pages cannot be accessed through any partition's Global Presence Architecture (GPA) infinite. That is, in all partitions' GPA spaces, they must be inaccessible (mapped such that no read, write or execute access is allowed). In general, the only partition that can deposit into or withdraw from a segmentation is that segmentation's parent.

Alert

Call back not to confuse partitions with virtual machines. You lot should think of a virtual machine as comprising a segmentation together within its state. Many times sectionalisation can be mistaken for the deed of virtualization when dealing with Hyper-V.

We should note that Microsoft volition proceed to back up Linux operating systems with the production release of Hyper-V. Integration components and technical support will be provided for customers running certain Linux distributions every bit guest operating systems within Hyper-V. Integration components for Beta Linux are now available for Novell SUSE Linux Enterprise Server (SLES) 10 SP1 x86 and x64 Editions. These components enable Xen-enabled Linux to have reward of the VSP/VSC architecture. This will help to provide improved operation overall. Beta Linux Integration components are available for immediate download through http://connect.microsoft.com. Some other additionally noteworthy characteristic is, as of this writing, Ruby-red Hat Fedora viii Linux and the alpha version of Fedora ix Linux, which are both uniform and supported by Hyper-V. The total list of supported operating systems will exist announced prior to RTM.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9781597492805000080

GPU Application Development, Debugging, and Functioning Tuning with GPU Ocelot

Andrew Kerr , ... Sudhakar Yalamanchili , in GPU Computing Gems Jade Edition, 2012

30.2.1 PTX Functional Simulation

Ocelot'south PTX emulator models a virtual architecture illustrated in Figure xxx.ii . This backend implements the NVIDIA PTX execution model and emulates the execution of a kernel on a GPU by interpreting instructions for each agile thread of a warp before fetching the adjacent pedagogy. This corresponds to execution by an arbitrarily wide single-instruction multiple-information (SIMD) processor and is similar to how hardware implementations such as NVIDIA'south GeForce 400 series GPUs execute CUDA kernels. Blocks of retentiveness shop values for the virtual annals file as well as the addressable retentivity spaces. The emulator interprets each education according to opcode, data blazon, and modifiers such as rounding or clamping modes, updating the architectural state of the processor with computed results.

Figure 30.2. PTX emulator virtual architecture.

Kernels executed on the PTX emulator present the unabridged appreciable state of a virtual GPU to user-extensible pedagogy trace generators. These are objects implementing an interface that receives the complete internal representation of a PTX kernel at the time it is launched for initial assay. And so, as the kernel is executed, a trace event object is dispatched to the collection of active trace generators after each instruction completes. This trace event object includes the instruction'southward internal representation and PC, set of retention addresses referenced, and thread ID. At this point, the didactics trace generator has the opportunity to inspect the register file and memory spaces attainable by the GPU such as shared and local retentiveness. Practically whatever observable behavior may be measured using this arroyo. In the next section, we will hash out Ocelot'southward interfaces for user-extended trace generators that compute custom metrics.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123859631000307

Trinity workloads

Jim Jeffers , ... Avinash Sodani , in Intel Xeon Phi Processor High Performance Programming (Second Edition), 2016

Running Trinity on Knights Landing: Quadrant-Flat vs Quadrant-Enshroud

Permit us now consider flat retention mode with quadrant cluster mode (quadrant-apartment for brusk), and how that compares to quadrant-enshroud . To modify to this mode, we have to reboot with modified BIOS options. The cluster mode remains the same, just at present MCDRAM is treated as addressable memory and can be accessed via a separate NUMA node. If nosotros run the command numactl --hardware, two NUMA nodes will exist shown (meet Fig. 25.eight). The CPUs and DDR memory are in node 0, and MCDRAM is in node 1. If we run the workload without using numactl, for example, mpirun –np 68 ./myKNLwkld, the workload uses memory from the nearest NUMA node offset (node 0 in this case, i.due east., DDR retentiveness). Defining the NUMA nodes such that the NUMA distance betwixt the MCDRAM and the CPUs is greater than the distance betwixt the DDR and the CPUs is washed so that nosotros have to explicitly specify when MCDRAM is used. There a few different ways to use MCDRAM in flat mode.

Fig. 25.8. Flat mode showing two dissever NUMA nodes when executing  numactl -hardware.

Use numactl –-membind 1 to bind the workload to NUMA node 1

For example: numactl –-membind i ./myScriptToRunOnKNL.sh

By binding the workload, the workload is forced to use only the retentivity designated by that NUMA node (in this example, MCDRAM). If you need more than than the MCDRAM retention size, the workload will crash. For instance, GTC requires 32   GB to run on a unmarried node. Thus, GTC cannot be run just out of MCDRAM.

Utilize numactl –-preferred ane to prefer NUMA node 1

Now the workload will attempt to use MCDRAM first only will not crash if the memory required is greater than MCDRAM size; this provides a prophylactic net in case memory footprint becomes slightly larger than MCDRAM capacity.

memKind library (hbwmalloc)—this library enables users to specify which arrays or other blocks of memory should be allocated in MCDRAM.

Tools tin can help with flat manner:

Intel® VTune™ Amplifier Memory Object Analysis—determines dynamic and static memory objects and their functioning bear upon to place candidate data structures for MCDRAM resource allotment.

autohbw—Interposer library that comes with memkind. Finds memory allocation calls and automatically allocates memory in MCDRAM if allocation is greater than a given threshold.

Figs. 25.9 and 25.x show operation of the same five workloads previously plotted in Fig. 25.4; dashed lines indicate quadrant-cache results, and solid lines indicate quadrant-apartment results using numactl --preferred one.

Fig. 25.9. MiniFE and MiniGhost functioning in quadrant-flat and quadrant-cache.

Fig. 25.10. AMG, UMT, and SNAP performance in quadrant-flat vs quadrant-cache.

MiniGhost functioning is very interesting; at lower problem sizes (<   16   GB) apartment mode is better, merely at larger problem sizes (>   xvi GB), cache fashion is improve. This performance drop is due to limitations of numactl --preferred 1, using other methods to classify specific retentiveness blocks into MCDRAM such as memkind library may better operation. MiniGhost bandwidth measurements show the larger problem sizes in quad flat have minimal MCDRAM use when run with numactl --preferred 1.

Memory bandwidth was measured to show how bandwidth changes for MiniGhost and MiniFE at unlike problem sizes. At the xvi   GB problem size, MiniGhost in quadrant-cache has MCDRAM bandwidth of 312   GB/s and DDR bandwidth of fifteen   GB/s. At the 64   GB trouble size, MiniGhost in quadrant-cache has MCDRAM bandwidth of 214   GB/due south and DDR bandwidth of 65   GB/due south. In dissimilarity, MiniGhost in quadrant-flat at 64   GB problem size has MCDRAM bandwidth of 0   GB/south and DDR bandwidth of 88   GB/s. Although, we chosen MiniGhost cache unfriendly due to the "sweet spot" beliefs, the workload is still benefiting from MCDRAM cache at large problem sizes in quadrant-enshroud. MiniFE at large problem sizes does non benefit from MCDRAM as cache. At 16   GB problem size, MiniFE has 267   GB/south MCDRAM bandwidth and 31   GB/due south DDR bandwidth. Nevertheless, at 57   GB problem size, MiniFE in quadrant-cache has 16   GB/s MCDRAM bandwidth and 78   GB/s DDR bandwidth. MiniFE performance at large problem sizes is limited by DDR memory bandwidth.

UMT and AMG are very cache-friendly workloads at modest and big problem sizes. Notwithstanding, performance drops at larger problem sizes when using flat mode as the workload's allocated information no longer fits in MCDRAM. SNAP performs improve with flat mode at larger trouble sizes. SNAP suffers from MCDRAM cache aliasing on large trouble sizes similar to MiniGhost at small problem sizes. However in flat fashion, there is no aliasing, enabling SNAP to perform better at big problem sizes and MiniGhost to perform better at small-scale problem sizes compared to cache manner.

Consider using flat manner nether the following weather:

Problem size does not fit in xvi-GB MCDRAM and workload is latency bound instead of being bandwidth limited: DDR latency in flat manner will be lower than MCDRAM latency in cache fashion if there are excessive MCDRAM cache misses. (run into Chapters 4 and 6 Affiliate 4 Chapter 6 )

Workload uses full cache-line streaming stores or partial enshroud-line (e.g., SSE/AVX ISA) streaming stores

Streaming stores bypass core cpu-caches and update memory directly, but MCDRAM is a memory-side enshroud (whenever information technology is operated equally a cache) and cannot be bypassed. MCDRAM enshroud incurs additional overhead to handle streaming stores: it requires one extra retentiveness read to determine whether a line is already nowadays in MCDRAM (partial cache-line write may besides require an actress memory write to make full the line from memory in instance of a miss). Hence, if a workload is probable to have a large number of streaming stores, then it may be better to configure MCDRAM as apartment, since MCDRAM as flat memory does not have such overheads considering it is directly addressable memory same every bit DDR memory.

Workload is significantly affected past enshroud aliasing in cache manner (case: SNAP and MiniGhost). MCDRAM is a direct-mapped cache which means memory accesses that map to same MCDRAM cache-line will disharmonize causing evictions of data being used and increased conflict misses which will affect performance (see Fig. 25.11)

With a direct-mapped cache, capacity misses will occur when the working set size is greater than MCDRAM chapters (as in SNAP case). However, conflict misses tin as well occur with small problem sizes (less than MCDRAM capacity) depending on where data is allocated in physical memory (as we observed in the MiniGhost case).

Fig. 25.xi. Enshroud line conflict occurs on Knights Landing with a 16   GB MCDRAM cache if bits six to 33 of concrete address match. Increased MCDRAM cache misses due to conflicts will impact performance.

Performance comparison between enshroud mode and flat mode with respect to thread scaling is shown in Fig. 25.12. AMG and UMT are two of our enshroud-friendly workloads; as such, these workloads have better scaling in cache style than in flat style. SNAP and GTC in contrast have better scaling in flat fashion.

Fig. 25.12. Multiple threads per core scaling in quadrant-flat vs quadrant-cache; Y-centrality is speedup over corresponding manner'southward single-thread performance.

Fig. 25.13 includes the best operation of all eight Trinity workloads, where best performance was selected betwixt quadrant-cache and quadrant-flat, varying problem size, and varying hardware threads per cadre. Some workloads perform amend in quadrant-enshroud and others in quadrant-flat.

Fig. 25.thirteen. Best performance of all 8 Trinity workloads when because quadrant-enshroud and quadrant-flat.

Hybrid style

Knights Landing'due south hybrid manner combines cache mode with flat mode (see Fig. 25.14). This enables advanced optimizations where users tin specify which data should exist allocated in MCDRAM in flat mode via the memkind library (hbwmalloc calls, encounter Chapter iii), and it also enables a smaller retention-side cache for remaining data. The baseline versions of the Trinity workloads are non optimized for this style as no source lawmaking changes were done. Withal, keep in mind that this is an option to consider when optimizing workloads for Knights Landing.

Fig. 25.14. Hybrid mode, part of MCDRAM, is addressable retentivity and part is memory-side cache.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128091944000259

MEMORY PROTECTION UNITS

ANDREW N. SLOSS , ... CHRIS WRIGHT , in ARM Organisation Developer'due south Guide, 2004

13.3.2 ASSIGNING REGIONS USING A Retention MAP

The last cavalcade of Table thirteen.10 shows the four regions we assigned to the memory areas. The regions are defined using the starting address listed in the table and the size of the lawmaking and data blocks. A memory map showing the region layout is provided in Figure 13.7.

Effigy 13.7. Region assignment and memory map of sit-in protection system.

Region ane is a background region that covers the entire addressable retentiveness space. It is a privileged region (i.eastward., no user mode access is permitted). The instruction enshroud is enabled, and the data cache operates with a writethrough policy. This region has the lowest region priority because it is the region with the lowest assigned number.

The main function of region i is to restrict access to the 64 KB space between 0×0 and 0×10000, the protected system area. Region 1 has ii secondary functions: information technology acts equally a groundwork region and equally a protection region for dormant user tasks. As a background region it ensures the unabridged retention space past default is assigned organization-level access; this is done to preclude a user task from accessing spare or unused memory locations. As a user task protection region, information technology protects dormant tasks from misconduct past the running chore (see Figure 13.7).

Region two controls admission to shared organization resources. Information technology has a starting address of 0×10000 and is 64 KB in length. It maps directly over the shared memory space of the shared organization code. Region two lies on top of a portion of protected region i and will take precedence over protected region one because it has a higher region number. Region 2 permits both user and system level memory admission.

Region 3 controls the memory area and attributes of a running chore. When control transfers from i task to another, as during a context switch, the operating system redefines region iii so that information technology overlays the memory area of the running task. When region 3 is relocated over the new task, it exposes the previous chore to the attributes of region 1. The previous task becomes part of region 1, and the running task is a new region three. The running task cannot access the previous task because information technology is protected past the attributes of region 1.

Region 4 is the retentiveness-mapped peripheral system space. The primary purpose of this region is to establish the expanse as not buried and not buffered. We don't want input, output, or command registers bailiwick to the dried data problems acquired by caching, or the fourth dimension or sequence issues involved when using buffered writes (come across Affiliate 12 for details on using I/O devices with caches and write buffers).

Read full affiliate

URL:

https://www.sciencedirect.com/science/commodity/pii/B9781558608740500149

Busses, Interrupts and PC Systems

William Buchanan BSc (Hons), CEng, PhD , in Computer Busses, 2000

2.1.iii Address bus

The address motorbus is responsible for identifying the location into which the data is to be passed into. Each location in retentivity typically contains a single byte (8 bits), but could also exist arranged as words (16 bits), or long words (32 bits). Byte-oriented retention is the nearly flexible equally information technology also enables access to any multiple of viii bits. The size of the accost bus thus indicates the maximum addressable number of bytes. Tabular array 2.3 shows the size of addressable memory for a given address bus size. The number of addressable bytes is given by:

Addressable locations = two northward bytes

Addressable locations for a given accost bus size

where n is the number of bits in the accost jitney. For example:

A 1-chip address charabanc tin can address up to two locations (that is 0 and i).

A 2-scrap accost bus can address two2 or 4 locations (that is 00, 01, 10 and 11).

A xx-flake address motorbus tin can address up to 220 addresses (1   MB).

A 32-fleck address bus can accost upward to 232 addresses (4GB).

The units used for computers for defining memory are B (Bytes), kB (kiloBytes), MB (megaBytes) and GB (gigabytes). These are defined as:

KB (kiloByte). This is defined as 210 bytes, which is 1024 B.

MB (megaByte). This is divers as 2twenty bytes, which is 1024 kB, or 1 048 576 bytes.

GB (gigaByte). This is defined equally 2thirty bytes, which is 1024   MB, or 1048576 kB, or 1073 741 824 B.

Tabular array 2.1 gives a table with addressable space for given address bus sizes.

Table ii.1. Addressable memory (in bytes) related to address bus size

Address passenger vehicle size Addressable memory (bytes) Address bus size Addressable memory (bytes)
1 2 15 32   K
2 4 xvi 64   Grand
3 8 17 128   Grand
iv 16 eighteen 256   One thousand
5 32 19 512   Thou
6 64 20 1 G
7 128 21 2   Yard
8 256 22 4   M
9 512 23 8   Thou
ten i Thou* 24 16   M
11 2   K 25 32   M
12 4   K 26 64   Grand
13 8   Grand 32 4 G
14 16   Chiliad 64 xvi GG
*
1   M represents 1024
1M represents ane 048 576 (1024   K)
1G represents i 073 741 824 (1024   Grand)

Data handshaking

Handshaking lines are also required to allow the orderly flow of data. This is illustrated in Figure two.4. Normally there are several different types of busses which connect to the system, these dissimilar busses are interfaced to with a span, which provides for the conversion between 1 type of bus and another. Sometimes devices connect direct onto the processor'due south coach; this is called a local bus, and is used to provide a fast interface with direct access without any conversions.

Figure 2.4. Computer motorbus connections

The almost bones type of handshaking has two lines:

Sending identification line – this identifies that a device is ready to send data.

Receiving identification line – this identifies that device is a device is prepare to receive data, or not.

Figure 2.5 shows a simple course of handshaking of data, from Device ane to Device ii. The sending condition is identified by READY? and the receiving status by STATUS. Normally an event is identified past a point line moving from i country to some other, this is described every bit edge-triggered (rather than level-triggered where the bodily level of the signal identifies its land). In the case in Figure two.5, initially Device 1 puts data on the data bus, and identifies that it is gear up to send data by changing the READY? line from a LOW to a HIGH level. Device ii so identifies that it is reading the data past irresolute its Condition line from a LOW to a HIGH. Next it identifies that it has read the data by changing the STATUS line from a High to a LOW. Device 1 can then put new data on the data bus and beginning the cycle again by changing the Fix? line from a LOW to a Loftier.

Effigy 2.5. Unproblematic handshaking of data

This blazon of communication just allows advice in one management (from Device ane to Device 2) and is know every bit simplex communications. The main types of communication are:

Simplex communication. Only one device can communicate with the other, and thus but requires handshaking lines for one direction.

Half-duplex communication. This allows communications from one device to the other, in any direction, and thus requires handshaking lines for either direction.

Full-duplex communications. This allows communication from i device to another, in either direction, at the same time. A good case of this is in a telephone system, where a caller tin send and receive at the aforementioned time. This requires divide transmit and receive data lines, and separate handshaking lines for either management.

Control lines

Control lines define the performance of the data transaction, such as:

Data catamenia direction – this identifies that data is either being read from a device or written to a device.

Retentiveness addressing type – this is typically either by identifying that the address access is direct retentivity accessing or indirect memory access. This identifies that the accost on the bus is either a existent memory location or is an address tag.

Device mediation – this identifies which device has control of the bus, and is typically used when there are many devices continued to a common bus, and whatsoever of the devices are allowed to communicate with any other of the devices on the bus.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978034074076750002X

Mass Storage

Lucio Di Jasio , in Programming 16-Bit Movie Microcontrollers in C (Second Edition), 2012

Using the SD/MMC Interface Module

Whether you believe information technology or not, the six minuscule routines we take simply developed are all nosotros demand to gain access to the seemingly unlimited amount of storage space offered by the SD/MMC memory cards. For example, a 512 Mbyte card would provide us with approximately 1,000,000 (yes that is one one thousand thousand) individually addressable retentiveness blocks (sectors) each 512 bytes large. Note that, every bit of this writing, SD/MMC cards of this chapters are normally offered for retail in the USA for less than $v!

Let's develop a small test plan to demonstrate the use of the SD/MMC module. The thought is to simulate a somewhat typical awarding that is required to salvage some large corporeality of data on the SD/MMC memory card. A stock-still number of blocks of data will exist written in a predetermined range of addresses and so read dorsum to verify the successful completion of the process.

Let'south create a new source file that nosotros will call SDMMCTest.c and start by adding the usual config.h file followed by the SDMMC.h include file.

/*

** SDMMCTest.c

**

** Read/write Test

*/

#include <configure.h>

#include <EX16.h>

#include <SDMMC.h>

Than let's define two byte arrays each the size of a default SD/MMC retention block, that is 512 bytes.

#define B_SIZE512  // sector/data block size

unsigned char data   [B_SIZE];

unsigned charbuffer   [B_SIZE];

The test program will fill the kickoff with a specific and easy to recognize pattern, and volition repeatedly write its contents onto the memory carte. The chosen address range volition be defined past two constants:

#define START_ADDRESS10000// start block address

#define N_BLOCKS    1000// number of blocks

The LEDs on PORTA of the Explorer16 sit-in board will provide us with a visual feedback about the correct execution of the plan and/or whatsoever fault encountered.

The first few lines of the principal programme tin can at present be written to initialize the I/Os required by the SD/MMC module and the PORTA pins connected to the row of LEDs.

main(void)

{

LBA addr;

int i, r;

// I/O initializations

TRISA = 0;// initialize PORTA LED outputs

InitSD();  // initialize I/Os required for the SD/MMC card

// fill up the buffer with "data"

for(i=0; i<B_SIZE; i++)

  data[i] = i;

The next code segment will take to bank check for the presence of the SD card in the slot/connector. We will look in a loop for the card detection switch if necessary and we will provide an additional delay for the contacts to properly debounce.

// wait for menu to be inserted

while(!DetectSD())// assumes SDCD pin is by default an input

  Delayms(100);// wait for carte du jour contacts de-bounce and power upwards

We will be generous with the debouncing delay as we want to make sure that the card connection is stable before we outset firing write commands that could otherwise potentially corrupt other data present on the bill of fare. A 100   ms delay is a reasonable filibuster to use and the Delayms() role is taken from the EX16.h library module defined in earlier capacity.

Keeping the de-billowy delay function divide from the DetectSD() function and the SD/MMC module in general is important every bit this will let different applications to option and cull the all-time timing strategy and optimize the resource allocation.

Once nosotros are sure that the card is present, we tin can proceed with its initialization, calling the InitMedia() function.

// initialize the memory card (returns 0 if successful)

r = InitMedia();

if (r)    // could non initialize the card

{

  PORTA = r;  // show error code on LEDs

  while(1);  // halt here

}

The function returns an integer value, which is zero for a successful completion of the initialization sequence, or a specific error code otherwise. In our examination program, in the example of an initialization mistake nosotros will simply publish the error code on the LEDs and halt the execution inbound an infinite loop. The codes 0x84 and 0x85 will bespeak that respectively the InitMedia() part steps four or 5 have failed, corresponding to an wrong execution of the menu RESET control and carte INIT commands (failure or timeout) respectively.

If all goes well we will be able to proceed with the bodily data writing phase.

else

{

  // make full N_BLOCK blocks/SECTOR with the contents of data buffer

  addr = START_ADDRESS;

  for(i=0; i<N_BLOCKS; i++)

  if (!WriteSECTOR(addr+i, data))

  {// writing failed

    PORTA = 0x0f;

    while(1);// halt here

  }

The simple for loop performs repeatedly the WriteSECTOR() function over the address range from block 10,000 to block 10,999, copying over and over the same data cake and verifying at each step that the write control is performed successfully. If any of the block write commands returns an fault, a unique code (0x0f) will be presented on the LEDs and the execution will be halted. In practice this will be equivalent to writing a file of 512,000 bytes.

  // verify the contents of each block/SECTOR written

  addr = START_ADDRESS;

  for(i=0; i<N_BLOCKS; i++)

  {// read back one block at a time

    if (!ReadSECTOR(addr+i, buffer))

    {// reading failed

    PORTA = 0xf0;

    while(1);// halt here

  }

  // verify each block content

    if (memcmp(data, buffer, B_SIZE))

    {// mismatch

    PORTA = 0x55;

    while(1); // halt here

    }

  } // for each block

Next we will commencement a new loop to read dorsum each data block into the second buffer and nosotros will compare its contents with the original pattern still available in the first buffer. If the ReadSECTOR() function should neglect, we will present an error code (0xf0) on the LED brandish and finish the examination. Otherwise, a standard C library office memcmp() will help u.s.a. perform a fast comparison of the buffer contents, returning an integer value that is zero if the two buffers are identical equally we hope, not zero otherwise. Once more than, a new unique fault indication (0x55) volition be provided if the comparing should neglect. To gain admission to the memcmp() function that belongs to the standard C cord.h library.

We can now complete the main program with a final indication of successful execution, lighting up all LEDs on PORTA.

} // else media initialized

// indicate successful execution

PORTA = 0xFF;

// main loop

while(i);

} // main

If you have added all the required source files SDMMC.h , EX16.h , EX16.c , SDMMC.c and SDMMCTest.c to the project, you lot can now launch the project by using the Run>Run Project command. Yous volition need a daughter board with the SD/MMC connections as described at the commencement of the lesson to actually perform the test. But the effort of building one (or the expense of purchasing one) will be more than compensated for by the joy of seeing the PIC24 perform the examination flawlessly in a fraction of a 2nd. The amount of code required was also impressively pocket-size (Figure 13.7).

Figure thirteen.seven. MPLAB® C compiler retention usage report

Altogether, the test program and the SDMMC admission module take used up just one,374 bytes of the processor FLASH program memory, that is, less than one% of total memory bachelor, and 1,104 bytes of RAM (2×512 buffers+stack), which is less than 15% of the full RAM memory available. As in all previous lessons this issue was obtained with the compiler optimization options set up to level 1, available in the gratuitous evaluation version of the compiler.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9781856178709000130