dennyglee.github.io

Hi there, my name is Denny Lee, and I am a developer advocate at Databricks, a long-time contributor to Apache Spark™ and MLflow, Delta Lake maintainer, LLM Avalanche creator, Data Brew by Databricks caster (Spotify, Apple Podcasts, YouTube). and long-time Seattle-ite. In my past life, I was part of Microsoft Engineering for SQL Server, Cosmos DB, Bing, and part of the Project Isotope incubation team that brought Apache Hadoop into Microsoft.

Here are my some of my presentations and videos at YT @dennyglee; I’m also (co-)author of the following books including Delta Lake: The Definitive Guide (Early Release), Learning Spark, 2nd Edition. If you’re interested in posts in coffee, foodie, travel, cycling, and data posts, check out my personal blog.

I am the (co-)author of the following posts, links, and assets are in reverse chronlogical order. Some posts were originally on my wordpress site but have been moved over to GitHub, blemishes and all, for posterity.

This blog is inspired by Frank McSherry’s musings which I highly recommend you follow especially if you’re a fan of Rust like I am


date topics source description
2023/08/31 Delta   What is the Delta Lake Transaction Log?
2023/08/25 Spark, Delta   Why Structured Streaming and Delta Lake for Batch ETL?
2023/07/27 LLMs   Quick Start with llama.cpp with Llama 2 and Macbook M2 Air
2023/06/29 LLMs, Spark Databricks Introducing English as the New Programming Language for Apache Spark
2023/06/29 Delta Lake Databricks Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering
2023/06/26 LLMs site LLM Avalanche: Over 40 speakers and 900 people attended this LLM conference-within-a-conference to kick start Data + AI Summit 2023
2023/03/20 Spark, Delta   Why does altering a Delta Lake table schema not show up in the Spark DataFrame?
2022/12/13 Delta Lake delta.io Building a more efficient data infrastructure for machine learning with Open Source using Delta Lake, Amazon SageMaker, and EMR
2022/11/10 community Integration Developer News How Developers Can Manage and Contribute to Successful Open-Source Projects
2022/08/11 Delta Lake delta.io Apache Flink Source Connector for Delta Lake tables
2022/08/02 Delta Lake delta.io Delta 2.0 - The Foundation of your Data Lakehouse is Open
2022/06/15 Databricks Databricks Defining the Future of Data & AI: Announcing the Finalists for the 2022 Databricks Data Team OSS Award
2022/05/18 Delta Lake delta.io Multi-cluster writes to Delta Lake Storage in S3
2022/05/05 Delta Lake delta.io Delta Lake 1.2 - More Speed, Efficiency and Extensibility Than Ever
2022/04/27 Delta Lake delta.io Writing to Delta Lake from Apache Flink
2022/03/24 Delta Lake, Trino Starburst Starburst and Databricks Collaborate on the Trino Delta Lake Connector
2022/03/16 Delta Lake Databricks Extending Delta Sharing to Google Cloud Storage
2022/03/12 Delta Lake, PrestoDB PrestoDB Native Delta Lake Connector for Presto
2022/01/31 Delta Lake Databricks Make Your Data Lakehouse Run, Faster With Delta Lake 1.1
2022/01/28 Delta Lake Databricks The Ubiquity of Delta Standalone: Java, Scala, Hive, Presto, Trino, Power BI, and More!
2022/01/21 Delta Lake Databricks Extending Delta Sharing for Azure
2021/12/01 Delta Lake Databricks The Foundation of Your Lakehouse Starts With Delta Lake
2021/04/23 podcasts Databricks How We Launched a Podcast: Lessons, (Minor) Mishaps & Key Takeaways
2021/04/21 Delta Lake Databricks Attack of the Delta Clones (Against Disaster Recovery Availability Complexity)
2021/02/10 Delta Lake Databricks Automatically Evolve Your Nested Column Schema, Stream From a Delta Table Version, and Check Your Constraints
2020/12/22 Delta Lake Databricks Natively Query Your Delta Lake With Scala, Java, and Python
2020/11/20 Delta Lake Databricks How Scribd Uses Delta Lake to Enable the World’s Largest Digital Library
2020/09/29 Delta Lake Databricks Diving Into Delta Lake: DML Internals (Update, Delete, Merge)
2020/08/27 Delta Lake Databricks Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0
2020/06/18 Delta Lake Databricks Time Traveling with Delta Lake: A Retrospective of the Last Year
2020/05/19 Delta Lake Databricks Schema Evolution in Merge Operations and Operational Metrics in Delta Lake
2020/04/14 health Databricks COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help
2020/01/29 Delta Lake Databricks Query Delta Lake Tables from Presto and Athena, Improved Operations Concurrency, and Merge performance
2019/11/05 ML Databricks Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions
2019/10/03 Delta Lake Databricks Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs
2019/09/24 Delta Lake Databricks Diving Into Delta Lake: Schema Enforcement & Evolution
2019/09/10 ML Databricks Using AutoML Toolkit to Automate Loan Default Predictions
2019/08/21 Delta Lake Databricks Diving Into Delta Lake: Unpacking The Transaction Log
2019/08/14 Delta Lake, ML Databricks Productionizing Machine Learning with Delta Lake
2019/06/18 Delta Lake, Streaming Databricks Simplifying Streaming Stock Analysis using Delta Lake and Apache Spark: On-Demand Webinar and FAQ Now Available!
2019/05/02 ML Databricks Detecting Financial Fraud at Scale with Decision Trees and MLflow on Databricks
2019/04/30 ML, MLflow Databricks Using Dynamic Time Warping and MLflow to Detect Sales Trends
2019/04/30 ML, MLflow Databricks Understanding Dynamic Time Warping
2018/11/13 ML Databricks Applying your Convolutional Neural Network: On-Demand Webinar and FAQ Now Available!
2018/10/29 Delta Databricks Simplifying Change Data Capture with Databricks Delta
2018/10/22 ML Databricks Training your Neural Network: On-Demand Webinar and FAQ Now Available!
2018/10/03 ML, MLflow Databricks MLflow v0.7.0 Features New R API by RStudio
2018/10/01 ML Databricks Introduction to Neural Networks: On-Demand Webinar and FAQ Now Available!
2018/09/18 ML, Spark Databricks Simplify Market Basket Analysis using FP-growth on Databricks
2018/09/13 ML Databricks Identify Suspicious Behavior in Video with Databricks Runtime for Machine Learning
2018/09/12 ML, MLflow Databricks MLflow On-Demand Webinar and FAQ Now Available!
2018/09/09 Delta Lake Databricks Building a Real-Time Attribution Pipeline with Databricks Delta
2018/09/09 ML Databricks Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning
2018/08/08 MLflow Databricks MLflow 0.4.2 Released
2018/07/19 Spark Databricks Simplify Advertising Analytics Click Prediction with Databricks Unified Analytics Platform
2018/07/19 Spark, Delta Databricks Simplify Streaming Stock Data Analysis Using Databricks Delta
2018/07/19 Streaming, Spark, Delta Databricks Make Your Oil and Gas Assets Smarter by Implementing Predictive Maintenance with Databricks
2018/07/09 Spark Databricks Analyze Games from European Soccer Leagues with Apache Spark and Databricks
2018/07/02 Spark, Streaming Databricks Build a Mobile Gaming Events Data Pipeline with Databricks Delta
2018/06/27 R Databricks Announcing RStudio and Databricks Integration
2017/11/07 CosmosDB github Lambda Architecture with Azure Cosmos DB and HDInsight (Apache Spark)
2017/07/01 Spark O’Reilly Introduction to Apache Spark 2.0
2017/02/18 Spark book Learning PySpark: Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0
2016/12/02 Spark github How Apache Spark performs a fast count using the parquet metadata
2016/06/30 Spark Databricks Introducing Getting Started with Apache Spark on Databricks
2016/06/22 Spark Databricks, KDNuggets Apache Spark Key Terms, Explained
2016/06/08 Spark Databricks Another Record-Setting Spark Summit
2016/05/28 Spark   On-Time Flight Performance with GraphFrames for Apache Spark
2016/05/24 Spark, Genomics Databricks Predicting Geographic Population using Genome Variants and K-Means
2016/05/24 Spark, Genomics Databricks Parallelizing Genome Variant Analysis
2016/05/24 Spark, Genomics Databricks Genome Sequencing in a Nutshell
2016/03/16 Spark, Graph Databricks On-Time Flight Performance with GraphFrames for Apache Spark
2016/02/11 Spark, ML InfoWorld Why you should use Spark for machine learning
2016/02/11 Spark   Presentation: Jump Start into Apache® Spark™ 2.0
2016/02/02 Spark Databricks An Illustrated Guide to Advertising Analytics
2015/12/19 community Databricks Databricks launches Meetup-in-a-box for Apache Spark Meetup Organizers
2015/11/09 Spark insideBIGDATA Apache Spark is the Smartphone of Big Data
2015/09/24 Spark Databricks Spark Survey 2015 Results are now available
2015/08/31 Spark Databricks Data Exploration with Databricks
2015/06/09 Spark Databricks Introduction to Databricks
2015/06/04 Spark, ML Databricks Simplify Machine Learning on Apache Spark with Databricks
2014/01/06 HDFS, pig   Quick Tip for Compressing Many Small Text Files within HDFS via Pig
2013/09/30 SSAS   Analysis Services Multidimensional: It is the Order of Things
2013/05/14 random   In the context of quantum entanglement and time travel – Stargate may be more correct than Star Trek
2013/04/26 Hive   Optimizing Joins running on HDInsight Hive on Azure at GFS
2013/03/18 blob   Why use Blob Storage with HDInsight on Azure
2013/03/12 Avro, Hadoop   Using Avro with HDInsight on Azure at 343 Industries
2013/02/04 Spark   Installing Spark 0.6.1 Standalone on OSX Mountain Lion (10.8)
2012/12/03 Hadoop, pig   Getting your Pig to eat ASV blobs in Windows Azure HDInsight
2012/09/26 SSAS, Hive Microsoft SQL Server Analysis Services to Hive (backup)
2012/09/03 random   In the context of quantum entanglement and teleportation – Stargate may be more correct than Star Trek
2012/06/28 SSAS Microsoft Microsoft SQL Server Analysis Services Multidimensional Performance and Operations Guide
2012/05/08 Hadoop   Installing Hadoop on OSX Lion (10.7)
2012/03/01 Hadoop, BI   BI and Big Data–the best of both worlds!
2012/02/17 Hadoop, JS   Hadoop JavaScript– Microsoft’s VB shift for Big Data
2012/01/31 big data   Moving data to compute or compute to data? That is the Big Data question
2012/01/24 big data   Scale Up or Scale Out your Data Problems? A Space Analogy
2012/01/21 PowerPivot, Hadoop   Connecting PowerPivot to Hadoop on Azure – Self Service BI to Big Data in the Cloud
2012/01/12 Hadoop, Azure   A funky way to do Hive and Hadoop … on Azure
2011/12/15 Hadoop, Azure   An Azure Elephant Never Forgets…
2011/10/01 MS-SQL Microsoft SQL Server 2008 R2: Analysis Services Performance Guide (backup)
2010/12/10 MS-SQL Microsoft Measuring and Understanding the Performance of Your SSIS Packages in the Enterprise (SQL Server Video)
2010/07/01 MS-SQL Microsoft Analysis Services ROLAP for SQL Server Data Warehouses (backup)
2010/06/01 MS-SQL Microsoft Scale-Out Querying for Analysis Services with Read-Only Databases (backup)
2009/12/22 Healthcare book Transforming Health Care Through Information: Case Studies (Health Informatics)
2009/12/16 MS-SQL book Professional Microsoft SQL Server Analysis Services 2008 with MDX
2009/05/12 MS-SQL Microsoft Disk Partition Alignment Best Practices for SQL Server
2008/11/05 MS-SQL Microsoft Reaching Compliance: SQL Server 2008 Compliance Guide (backup)
2008/04/17 MS-SQL Microsoft Analysis Services Distinct Count Optimization (backup)
2007/09/24 Privacy   Analyzing Data while Protecting Privacy – A Differential Privacy Case Study
2007/09/01 MS-SQL Microsoft SQL Server 2005: Precision Considerations for Analysis Services Users (backup)
2006/03/02 Research paper (acknowledgement) Early establishment of a pool of latently infected, resting CD4+ T cells during primary HIV-1 infection
2001/10/01 MS-SQL book Professional SQL Server 2000 Data Warehousing with Analysis Services