# Toward 1.6T Low-Power Coherent DSP: Challenges, and Lessons Learned from Preceding Generations

## Shu Hao Fan, Ray L. Nguyen, Jose Luis Correa Lust, Hungchang Chien, Shih-Cheng Wang Marvell, Santa Clara, CA 94087, USA

fshuhao@marvell.com

**Abstract:** We review the progression of coherent DSP ASIC technology since 40nm silicon and identify the critical path toward beyond-terabit-per-wavelength pluggable modules. Challenges in various aspects of ASIC design and optical components are explored. © 2024 Marvell

# 1. Introduction

As demands for Internet capability continuously grow, coherent optics has emerged as a transformative solution for ultra-high-capacity, long-distance interconnect systems. It was not until 2008 that technology in CMOS and III-V optics matured enough to commercialize the concept of coherent communications. To meet today's Internet demands, on the other hand, coherent pluggable modules have become highly desired for various network infrastructures. In the past, power consumption in coherent DSP (CDSP) ASICs has been one of the most challenging barriers towards realizing pluggable applications, which require total module power below 30W. As shown in Table 1, the Marvell's advance to 16nm Si-node of CDSP ASICs finally achieves power below 20W. This opens a wide range of opportunities for CDSP. It can now be deployed beyond Metro/Long-Haul applications.

| PRODUCT               | SI   | DSP          | ADD. FEATURES | DISTANCE | POWER CONSUMPTION |
|-----------------------|------|--------------|---------------|----------|-------------------|
| 1 <sup>st</sup> GEN   | 40nm | 50G QPSK     |               | 4,000km  | 40W               |
| 2 <sup>ND</sup> GEN   | 28nm | 200G QAM16   | LDPC          | 15,000km | 40 - 100W         |
| 3 <sup>RD</sup> GEN   | 16nm | 200G QAM16   | LDPC, OTN/ETH | 2,500km  | 10 - 20W          |
| CANOPUS <sup>TM</sup> | 7nm  | 400G QAM16PS | LDPC, OTN/ETH | 2,000km  | 10 - 20W          |
| ORIONTM               | 5nm  | 800G QAM16PS | LDPC, OTN/ETH | 2,000km  | 10 - 20W          |

# Table 1. Generations of Marvell products

Today, 1.6T CDSP is anticipated by the industry for datacenter and cloud-AI computation for the nextgeneration 1-pbps network switching infrastructure. The 1.6T CDSP line rate requires over 240G Baud speed, including framing and error correction overhead. At such high speed, any minor imperfection in systems could impact the overall system performance. We review the challenges from our experience and examine the most critical factors toward beyond-1.6T applications, including analog amplifier peaking nonlinearity, analog front-end (AFE) and DSP optimization.

# 2. DSP and Optics Co-optimization

# 2.1. Amplifier Nonlinearity

Characterizing TIA/driver nonlinearity is difficult, especially at high frequency. Differences between specifications and measurements are often observed. Standard nonlinearity effects, such as 3<sup>rd</sup> harmonic distortion, are measured by sinusoidal input below 30GHz, limited by 100GHz test equipment. However, the majority of 100GBaud amplifiers incorporate peaking near Nyquist frequency, consequently exhibiting nonlinearity over 50GHz. To characterize this nonlinearity, we invented a two-tone method which creates an amplitude envelope. As shown in Fig. 1(a), at 45-GHz, we could clearly observe TIA output clipping, while not possible using traditional single tone input. Optimal TIA voltages for best SNDR at various frequencies were measured Fig. 1(b). Clipping at certain frequency restricts TIA overall output and affects ADC performance. Now that we are able to characterize nonlinearity over the full frequency range, we can optimize ADC design parameters, including

bandwidth roll-off, ENoB and voltage range. In the meantime, we can work with driver and TIA vendors to determine the right amount of peaking for nonlinearity/bandwidth tradeoff with a more accurate model.



Figure 1. (a) TIA output at 45GHz with high and low gains. (b) VIA output maximal and SNDR-optimal voltage

## 2.2. DSP and Laser Frequency Stability

Another example of DSP and optics co-optimization can be shown by examining laser frequency. Laser linewidth is one of the key specifications to allow long-distance transmission. However, we found that the linewidth alone is not enough because the intermittent timing errors caused by abrupt laser frequency surge. With a better method to measure laser characteristics, described in detail in our previous work [1], we can design the right codeword interleaving and timing recovery to accommodate the errors to lower costs and power of lasers.

## 2.3. Bandwidth and ADC Sampling Ratio

To lower DSP power, it is essential to minimize the DAC/ADC oversampling ratio while maintaining performance integrity. The primary source of performance deterioration stems from signal aliasing, which can be alleviated only by either increasing oversampling ratio or sharpening roll-off analog bandwidth. To assess the accurate aliasing effects against low oversampling ratio, the conventional 3-dB bandwidth loses its significance in performance estimation. How the roll-off spectrum behaves at 6-, 8-, 10-, 20-dB thresholds are also crucial for design reference. With better full-chain models (combined PD, TIA, PCB, packaging, and ASIC), every aspect of the component design can be finely adjusted to optimize the balance between power and performance.

## 3. Analog Front-End Optimization

Though Si-technology scaling has greatly benefited digital circuitry in terms of density and power consumption, AFE circuits are seeing more diminished benefits when progressing to smaller geometries. The small increase in fT of FinFET is offset by relative larger device parasitics which are further complicated by more stringent design rules. While Si devices continue to scale down, the minimal ESD requirements do not, and this impacts the maximum achievable bandwidth. Furthermore, though pre-emphasis and higher power can be spent to precondition and maximize transmit power, the low input signal received by the receiver necessitates a minimum SNR in design considerations, leading to a lower bound on power consumption. In the meantime, as baud rate increases, both analog bandwidth and jitter need to scale up proportionally, as shown in Fig. 2, where the published achievable transceiver throughput in Gb/s is plotted against the CMOS process geometry, indicating a roughly 2x trend at every halving of the CMOS node, which occurs every 4.7 years on average. If this continues, 1.6T would be expected by 2025.

Many complex bandwidth extension techniques such as series/shunt inductive peaking, T-coil peaking, active peaking, etc. have been extensively used over the years, which ultimately led to an achievable analog BW of up to 60+ GHz at the time of this writing. Though possible, such stacking of many inductive elements can lead to a sharper roll-off response with in-band ripples, complex reflection, non-linear phase relationships, and frequency dependent distortion that ultimately require far more complex equalization in the DSP. To properly account for the complete link budget, co-simulation, and co-optimization among AFE, optics, and DSP have thus become a necessity.

For example, we optimized for slightly reduced low frequency SNR requirements to reduce the load seen by the input bumps and re-invested that power into bandwidth, linearity, and jitter optimization. Our RX full-scale range was also pushed as high as possible into the optimal amplification range of the TIA, where signal amplification does not worsen overall SNR. Additionally, careful separation of voltage rails and regulators were designed to reduce supply induced coupling and deterministic jitter. Sophisticated calibration schemes were also devised to take out

high frequency and nonlinear effects previously unseen at slower speeds. However, this effort is partially gated by the limitation of the quality of scalable optical components' models and available CAD tools that can provide end-to-end verification.



Fig. 2. Transceiver scaling trend over the year, showing a doubling of speed at every node halving or 4.7 years.

## 4. Rx DSP Optimization

Receiver DSP is estimated to consume 50% power of a 1.6T coherent pluggable module. The maximum sampling frequency and the required throughput of the DSP usually define the overall parallelization factor because of the clocking limitation. A proper balance of clocking and parallelism scheme can facilitate the design of power vs area. Additionally, the modulation format must be carefully selected to reduce line rates. A very thorough link-budget analysis must be performed to find the optimal format to reduce the throughput and parallelism.

As parallelism grows, it is more effective to implement some DSP functions in frequency domain instead of time domain. Novel architectures on Fast-Fourier Transform (FFT) provide important optimized design that can work with not necessarily powers of two (2^N) without losing efficiency. Different implementation alternatives can be explored, such as folded FFT [2], pipelined FFT [3], cascaded FFT, etc.

Finally, as the processing speed are increased faster than the speed of the dynamic effects of the channel, many DSP algorithms, such as polarization or carrier frequency tracking, can be subsampled and reformatted to save power and complexity [4].

#### 5. Conclusions

Achieving a comprehensive CDSP solution for pluggable applications requires a wide variety of diverse improvements. As we strive for elevated line rates, a generic DSP solution for all types of optical components might prove insufficient. To surmount this challenge, a proactive approach involving collaborative development spanning DSP, analog design, and optical components is inevitable to reduce cost and power. A tight cooperation among CDSP, module and optics teams can ensure performance and power efficiency as we move forward.

#### 6. References

[1] H. Xu, M. O. Rebellato and S. -C. Wang, "System Impact of Laser Phase Noise On 400G And Beyond Coherent Pluggables," 2023 Optical Fiber Communications Conference and Exhibition (OFC), San Diego, CA, USA, 2023, pp. 1-3, doi: 10.1364/OFC.2023. Th1E.1.

[2] P. Zode, A. Thor and A. Y. Deshmukh, "Folded FFT architecture for real-valued signals based on Radix-23 algorithm," 2014 2nd International Conference on Devices, Circuits and Systems (ICDCS), Coimbatore, India, 2014, pp. 1-4, doi: 10.1109/ICDCSyst.2014.6926178.

[3] S. L. M. Hassan, N. Sulaiman, I. S. A. Halim, A. A. Ab Rahim and N. E. Abdullah, "Pipelined Fast Fourier Transform (FFT) Processor Power Optimization," 2019 IEEE 7th Conference on Systems, Process and Control (ICSPC), Melaka, Malaysia, 2019, pp. 127-130, doi: 10.1109/ICSPC47137.2019.9068069.

[4] R. Nagarajan, I. Lyubomirsky and O. Agazzi, "Low Power DSP-Based Transceivers for Data Center Optical Fiber Communications (Invited Tutorial)," in Journal of Lightwave Technology, vol. 39, no. 16, pp. 5221-5231, 15 Aug.15, 2021, doi: 10.1109/JLT.2021.3089901.

M2H.1