Senior Software Engineer, ML Performance & Systems Job in San Mateo, CA

Join our team at fal, where we are dedicated to pushing the boundaries of model performance for generative media models. You will play a vital role in designing and implementing innovative model serving architectures using our proprietary inference engine, with a clear focus on maximizing throughput while reducing latency and resource consumption.

As a key contributor, you will develop performance monitoring and profiling tools to pinpoint bottlenecks and discover optimization opportunities. Collaboration will be essential as you work closely with our Applied ML team and our customers in frontier labs within the media space, ensuring their workloads are optimized for our accelerator.

Key Responsibilities:

Drive the advancement of model performance for generative media models at fal.
Architect and implement cutting-edge solutions for model serving on our in-house inference engine, prioritizing throughput, latency, and resource efficiency.
Create tools for performance monitoring and profiling to detect bottlenecks and enhance optimization strategies.
Collaborate closely with our Applied ML team and customers, ensuring they derive maximum benefit from our accelerator solutions.

Requirements:

Robust background in systems programming with a proven track record of identifying and resolving performance bottlenecks.
Extensive knowledge of the latest ML infrastructure, including but not limited to PyTorch, TensorRT, TransformerEngine, and Nsight, with a keen interest in staying updated with developments in these areas.
Strong understanding of underlying hardware (currently Nvidia-based systems) and ability to dive deep into the stack to troubleshoot and optimize, including custom GEMM kernels with CUTLASS for common matrix shapes.
Experience with Triton or a strong willingness to learn, along with similar expertise in lower-level accelerator programming.
Familiarity with multi-dimensional model parallelism techniques utilizing a combination of parallelism methods such as tensor parallelism and context/sequence parallelism.
Understanding of the internals of Ring Attention, FA3, and FusedMLP implementations.

Compensation:

$180,000 - $500,000 + equity + comprehensive benefits package
Location: San Francisco, CA

What we offer at fal:

Engaging and challenging projects.
Emphasis on work-life balance.
Attractive salary and equity options.
Employee-friendly equity terms, including early and extended exercise options.
Opportunity to work in our downtown San Francisco office, with remote options available for exceptional candidates.
Visa sponsorship available to assist with relocation to San Francisco.
Comprehensive health, dental, and vision insurance (US).
Regular team events and offsites.
Generous paid vacation policy of 4 weeks.