Large Model Inference Acceleration Engineer 4006

San Jose, US-United States
Posted 13 hours ago
About The Company

This company pioneers short-form video creation and social engagement, boasting a vast, engaged user base. Its platform empowers users with creative tools, filters, and effects. With a diverse content ecosystem, it’s a hub of creativity and expression. The proprietary algorithm ensures personalized content feeds, enhancing user engagement and satisfaction. This company wields significant influence on digital media, making it an invaluable partner for innovative collaborations and marketing endeavors.


About the Team

We are an applied research team focused on Generative AI and Multimodal Understanding. The group works on advanced generative technologies across image, video, and multimodal systems, enabling scalable and practical AI creation tools. Research areas include generative modeling, image and video synthesis, intelligent editing, and virtual human technologies. The team emphasizes translating cutting-edge research into production-ready, efficient model systems.


About the Team

We are an AI platform engineering team building large-scale end-to-end AI production pipelines covering model training, optimization, deployment, and real-world applications. The team focuses on delivering scalable AI infrastructure and efficiency technologies that support high-volume generative AI and multimodal systems in production environments.


Role Overview

We are seeking an experienced AI model optimization engineer specializing in large model inference acceleration. This role focuses on optimizing inference performance, scalability, and deployment efficiency for large-scale generative and foundation models across heterogeneous hardware environments.


Responsibilities

• Design and optimize large model inference pipelines for low-latency and high-throughput production deployments
• Apply high-performance optimization techniques across diverse hardware architectures
• Benchmark and profile deep learning models to identify performance bottlenecks
• Optimize compute, memory, and kernel performance for large model inference
• Work on distributed inference and acceleration strategies
• Collaborate with infrastructure and production engineering teams to integrate optimized models into production systems

Minimum Qualifications

• Master’s or PhD in Computer Science, Electrical Engineering, AI, or related field
• Strong software engineering skills in Python and C++ • Strong CUDA programming experience
• 5+ years of experience in AI model inference optimization or acceleration
• Experience with ML compilers and performance optimization techniques
• Experience with parallel computing, graph fusion, and kernel optimization
• Hands-on experience with inference acceleration frameworks such as TensorRT, Triton, or Cutlass
• Solid understanding of transformer and diffusion model architectures
• Strong system-level performance debugging skills


Language Requirement
• Professional working proficiency in Mandarin and English required for cross-regional technical collaboration


Preferred Qualifications

• Experience optimizing large generative or multimodal models in production
• Experience with distributed inference systems
• Experience with hardware-aware model optimization
• Experience working closely with AI infrastructure or ML systems teams


Equal Opportunity Statement

We are an equal opportunity employer and consider qualified applicants in accordance with applicable laws. Reasonable accommodations are available during the recruitment process when needed.

Job Features

Job CategoryAI Research
SenioritySenior IC / Tech Lead
Recruiternina.li@ocbridge.ai

Apply Online