CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Understanding Crypto Performance in Embedded Systems: Part 2
Standards & Industry Practices for Measuring Cryptographic Performance



Embedded.com
Part 1 of this series discussed hardware and software variables impacting system-level cryptographic performance. Now in Part 2 we will focus on two methodologies for measuring the performance of a high-level look-aside accelerator: 1) driver level accelerator testing to identify accelerator or SoC memory bandwidth constraints, and 2) application/protocol stack level testing which includes a full packet ingress to egress path.

Accelerator Measurement Methodology
The most common method of measuring a look-aside accelerator's performance is a driver level benchmark. In this test, a set of test data is loaded into the main memory of the SoC hosting the accelerator.

Software running on the SoC's CPU creates descriptors or otherwise causes the accelerator to perform cryptographic operations on the test data. Using a timer, the driver level test calculates the accelerator's throughput by dividing the number of bytes processed by the time taken.

While this seems simple enough, many of the variables described in Part 1 of this article can still influence the results. When evaluating (or creating) a driver level benchmark, the evaluator needs to consider the following:

Data size. Is the test encrypting a small or large chunk of data? The smaller the data size, the more the results will be influenced by the accelerator's memory latency, the descriptor size and other "non-data" context it must fetch to perform the operation, and the timer accuracy.

Understanding memory latency and DMA overheads to read context and data is important, but timer resolution can be the dominant variable when measuring the performance of a single operation on a small chunk of data.

Iteration. A good way to include accelerator DMA overheads and memory latency in a small data size measurement while reducing timer resolution as a variable is to construct the test so that the timer starts before iteration 1 and stops after the nth interation, where n is a fairly large number.

Freescale typically measures driver level performance using tests with 50,000 iterations. Iterations can introduce additional variables such as caching, pre-work, interrupts and checking.

1. Caching. Does the test repetitively encrypt the same data n times, using the same keys and context? Or does it encrypt n unique chunks of data with n unique keys? Assuming the accelerator operates within the SoC's cache coherency scheme (have fun if it doesn't!), repetitively encrypting the same data can lead to the data being read from on-chip cache memory rather than main memory. Freescale's driver benchmarks repetitively use the same keys on 50K unique chunks of data, mimicking the behavior of the accelerator in a single tunnel packet processing scenario.

2. Pre-work and interrupts. If a high-level accelerator iteratively encrypts 50K unique chunks of data, that implies software builds and launches 50K unique descriptors. Two major variables, particularly at small packet sizes, are: When does software build the descriptors, and when does it dispatch them to the accelerator?

The optimized benchmarking scenario is for software to build all these descriptors at the same time the test data is created, followed by software dispatching all the descriptors to the accelerator as soon as the timer is started.

Software doesn't monitor the completion of individual descriptors via polling or interrupts. It waits for an interrupt from the final descriptor and stops the timer in the interrupt service routine.

A second approach, "build and dispatch descriptor n+1 after descriptor n completes," is more representative of real-world packet processing scenarios. The "dispatch all at once" approach provides small data size performance that is approximately four times better than the "build and dispatch descriptor n+1 after descriptor n completes' approach.

Freescale's driver performance tests, which are provided along with the reference driver, demonstrate the latter approach. This is more representative of integrating the driver within a packet processing application.

3.Checking. The driver level test might check the output of each descriptor, or it may assume the output is always good. If the test checks the outputs, it may do those checks while the timer is running, or it may wait until the final job completes, then go back and check for expected results. The more checking done within the driver test, the less the results reflect the accelerator's raw performance.

Algorithm or ciphersuite. Data size and method of iteration are the dominant variables, but the more the driver benchmark focuses on true hardware performance, the more dominant algorithms or ciphersuites become in measured performance.

Some algorithms, and even modes of algorithms, have higher raw performance than others. Single-pass decryption + message integrity checking may be faster than single-pass encryption + message integrity generation.

Figure 1 below provides a comparison of ciphersuites using the "build and dispatch descriptor n+1 after descriptor n completes" approach, along with a single AES-HMAC-SHA-1 benchmark using the "dispatch all at once" approach.

Figure 1: Driver benchmarks

Figure 1 shows that at small packet sizes, all the "build and dispatch descriptor n+1 after descriptor n completes" results are nearly identical, because the software overheads of descriptor building, dispatching and monitoring descriptor completions overwhelm algorithm-specific performance differences of the accelerator.

Only at larger data sizes does the hardware algorithm performance difference become an observable variable. The single AES-HMAC-SHA-1 test in which all the descriptors are built in advance, and dispatched at the maximum rate the accelerator can accept them, shows approximately four times greater performance at small data sizes. However, by 1KB data size, the raw performance of the hardware is the dominant variable.

It is important for users of vendor-supplied benchmarking code to understand what exactly the benchmarking code does before comparing the results to another vendor's benchmarking code.

Different vendors may have different philosophies with regard to showing nearly raw accelerator performance vs. accelerator performance in a more realistic use case.

1 | 2 | 3 | 4 | 5

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS



TECH PAPER
WEBINAR
WEBINAR
WEBINAR




 :