Sign up menu

Senior Software Engineer, ML Performance & Systems

fal San Mateo, CA
Apply EasyApply
Join our team at fal, where we are dedicated to pushing the boundaries of model performance for generative media models. You will play a vital role in designing and implementing innovative model serving architectures using our proprietary inference engine, with a clear focus on maximizing throughput while reducing latency and resource consumption.

As a key contributor, you will develop performance monitoring and profiling tools to pinpoint bottlenecks and discover optimization opportunities. Collaboration will be essential as you work closely with our Applied ML team and our customers in frontier labs within the media space, ensuring their workloads are optimized for our accelerator.

Key Responsibilities:
  • Drive the advancement of model performance for generative media models at fal.
  • Architect and implement cutting-edge solutions for model serving on our in-house inference engine, prioritizing throughput, latency, and resource efficiency.
  • Create tools for performance monitoring and profiling to detect bottlenecks and enhance optimization strategies.
  • Collaborate closely with our Applied ML team and customers, ensuring they derive maximum benefit from our accelerator solutions.
Requirements:
  • Robust background in systems programming with a proven track record of identifying and resolving performance bottlenecks.
  • Extensive knowledge of the latest ML infrastructure, including but not limited to PyTorch, TensorRT, TransformerEngine, and Nsight, with a keen interest in staying updated with developments in these areas.
  • Strong understanding of underlying hardware (currently Nvidia-based systems) and ability to dive deep into the stack to troubleshoot and optimize, including custom GEMM kernels with CUTLASS for common matrix shapes.
  • Experience with Triton or a strong willingness to learn, along with similar expertise in lower-level accelerator programming.
  • Familiarity with multi-dimensional model parallelism techniques utilizing a combination of parallelism methods such as tensor parallelism and context/sequence parallelism.
  • Understanding of the internals of Ring Attention, FA3, and FusedMLP implementations.
Compensation:
  • $180,000 - $500,000 + equity + comprehensive benefits package
  • Location: San Francisco, CA
What we offer at fal:
  • Engaging and challenging projects.
  • Emphasis on work-life balance.
  • Attractive salary and equity options.
  • Employee-friendly equity terms, including early and extended exercise options.
  • Opportunity to work in our downtown San Francisco office, with remote options available for exceptional candidates.
  • Visa sponsorship available to assist with relocation to San Francisco.
  • Comprehensive health, dental, and vision insurance (US).
  • Regular team events and offsites.
  • Generous paid vacation policy of 4 weeks.




Date Posted February 19, 2025
Located In San Mateo, CA
Apply

Similar Jobs

icon
19 February ( Today )

Youth Development Specialist - Relocation to Hershey, PA Required

icon
19 February ( Today )

Residential Youth Caregiver - Relocation to Hershey, PA Required

icon
19 February ( Today )

Personal Trainer, Palo Alto

icon
19 February ( Today )

Full Stack Engineer - Migrations Specialist

header
fal