NVJiangShao

@StudyingShao

NVIDIA

Followers

Following

Public Repos

Private Repos

Language Breakdown

Lines of code distribution across 2 owned repositories

990K Total LOC

Python

936,133 lines

94.6%

N/A

Cuda

30,866 lines

3.1%

N/A

C++

16,171 lines

1.6%

N/A

CMake

4,330 lines

0.4%

N/A

2,095 lines

0.2%

N/A

I-Shaped Developer

I-shaped

Specialist — deep expertise in Python

Python

Cuda

C++

CMake

Collaboration Network

Global Impact visualization

LIVE

0 active collaborators

Repos

PRs

Growth

+18%

Top Collaborators

No collaborator data yet.

Coding Streak

Contribution activity over the past year

1 day

Contributions

Commits

Pull Requests

Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun

Based on GitHub activity

Less

Followers 16

Barry Kang

@Barry-Delaney

NayezPasPeur

@NayezPasPeur

Ahmad Al-qaisi

@ahmad-alqaisi215

Darragh Hanley

@darraghdog

papa

@woshipapa

View All

Following

10 total

Darragh Hanley

@darraghdog

Peng Zhang

@AniZpZ

Yuxi Chi

@cherichy

ZZK

@MARD1NO

Barry Kang

@Barry-Delaney

View All Network

Synced via GitHub

Top Repositories

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

4 0

C++

cutlass

CUDA Templates for Linear Algebra Subroutines

2 1

C++

flashinfer

FlashInfer: Kernel Library for LLM Serving

0 0

Python

LowLatencyGroupedGEMM

0 1

Cuda

tensorrtllm_backend

The Triton TensorRT-LLM Backend

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

0 0

Python

QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

0 0

Python

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

0 0

Python

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

0 0

Python

Open Source Impact

Contributions to external projects

8 merged PRs

flashinfer-ai/flashinfer

5795

NVIDIA/TensorRT-LLM

13878

Contributed to 2 repositories