class: center, middle # Serverless Data Science: Scaling algorithms made simple ## Nabarun Pal --- # About Me - Platform Engineer at rorodata - Optimizing development time through simple abstractions/tooling - Venturing into Container Orchestration and Serverless Computing - Contributor to the Kubernetes ecosystem
.footnote.right.bottom[
www.nabarun.in
github.com/palnabarun
twitter.com/_palnabarun
] --- # Agenda - Genesis - Current State - The Abstraction - Demo - Performance Metrics - Limitations and Future Improvements --- class: center, middle # Genesis --- # Multithreading ### Pros - Lightweight - Shared state between multiple threads - Works flawlessly for I/O-bound applications ### Cons - Subject to Global Interpreter Lock - Context switching overhead - Code prone to race conditions - Does not work for CPU-bound tasks --- # Multiprocessing ### Pros - Isolation of memory space - Leverages multiples processors & cores - GIL limitations don’t apply - Synchronization primitives like locks are mandatory unless sharing data - Works well for CPU-bound tasks ### Cons - Sharing data between processes is a little bit complicated - Bulky memory footprint - Limited scaling --- # Can Kubernetes help? ## Pros - Abstracts out infrastructure - Simple interface - Can scale based on workload ## Cons - Layer on top of VM’s - Slow to scale up/down - Autoscaling is not a core functionality - Needs dedicated time to manage --- # What about Serverless? - Zero Infrastructure Management - Near Infinite Scaling - High Availability - No Idle Resources - Suitable for short-lived workloads --- class: center, middle # “Any problem in computer science can be solved by introducing an extra level of indirection.” ## David Wheeler --- class: center, middle # “LambdaPool” #### The Indirection --- # Requirements - Minimum overhead on users - Simple way to create, delete, list and update lambda functions - Coherent ways to invoke the lambda function - Easy to use interface --- # Features - CLI to create, list, update and delete functions - Support for specifying function layers and list the layers used for each functions - LambdaPool interface - LambdaExecutor Interface --- # Source Code
https://github.com/rorodata/lambdapool
# Installation ```bash $ pip install --user https://lambdapool-releases.s3.amazonaws.com/lambdapool-0.9.7.tar.gz ``` --- # CLI ```bash $ lambdapool Usage: lambdapool [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: create Create a new function delete Delete a function list List all deployed functions update Update an existing function ``` --- # Project Structure ```bash examples $ tree algorithms/ algorithms ├── functions.py └── __init__.py examples $ cat algorithms/functions.py def fibonacci(n): '''A naive implementation of computing n'th fibonacci number ''' if n==0: return 0 if n==1: return 1 return fibonacci(n-1) + fibonacci(n-2) ``` --- # Creating a function ```bash examples $ lambdapool create algorithms algorithms/ --timeout=300 --memory=128 === Creating lambdapool function === === Copying all specified files and directories === Copying algorithms.py... ... === Succesfully created lambdapool function algorithms === ``` --- # LambdaPool - Implements the same interface as `ThreadPool` and `ProcessPool` ```python >>> from lambdapool import LambdaPool >>> pool = LambdaPool(workers=10, lambda_function='algorithms') >>> from algorithms.functions import fibonacci >>> pool.apply(fibonacci, args=(10,)) 55 >>> pool.map(fibonacci, range(100)) [0, 1, 1, 2, 3, 5, 8, ...] ``` --- # LambdaExecutor - Implements the same interface as `ThreadPoolExecutor` and `ProcessPoolExecutor` ```python >>> from lambdapool import LambdaExecutor >>> from algorithms.functions import fibonacci >>> with LambdaExecutor(lambda_function='algorithms') as executor: ... futures = [executor.submit(fibonacci, n) for n in range(100)] ... fibonaccis = [f.result() for f in futures] >>> print(fibonaccis) [0, 1, 1, 2, 3, 5, 8, ...] ``` --- class: center, middle # Demo --- # Benefits - Compute time - Compute Costs - Developer Time --- # Current Limitations - Serialization of the payload is being a hurdle - Decoupling between function provisioning and invocation - Size of execution environment ### Inherent to Serverless - Cold start issues - Additional Network Overhead - Not suitable for long running workloads - Troubleshooting is hard - Local testing --- # Future Goals - Distribute lambdapool through PyPI - Permissions management system - System to fetch execution logs - Better layer management - Make the function update process intelligent --- class: center, middle # Thank You! ####
Contact Us:
opensource@rorodata.com