Reddit logo

Reddit

Welcome to the Reddit Tech Blog. We'll be talking about how we contribute to Reddit's mission to bring community and belonging to everyone. Read...

Try:
Reddit logo
Reddit

Fredrick Lee (Reddit CISO) Answers Your Questions!

A compiled AMA with Fredrick Lee (Reddit’s CISO) where he answers community questions about his path into cybersecurity, transitioning to leadership, staying current (newsletters, hands-on platforms), career advice (homelabs, programming, Kubernetes, communication), technical security topics (PKI, mTLS, TLS interception, dependency/supply-chain risk, WAFs, eBPF telemetry), vendor evaluation and buy-vs-build tradeoffs, prioritizing security against business goals, and the influence of AI on platform security.

Reddit logo
Reddit

Evolving Signals-Joiner with Custom Joins in Apache Flink

Reddit engineers describe re-architecting Signals-Joiner, a Flink-based streaming enrichment pipeline, to replace tumbling-window CoGroup joins with per-key custom windowing implemented via CoProcessFunction. They unified signal streams into a single Signal class, used keyed state (RocksDB), timers, and buffered early-arriving signals to improve signal enrichment rates (pushing toward 100%) while preserving performance, reliability, and maintainability in production.

Reddit logo
Reddit

Pragmatic, Compliant AI: Reddit’s Journey to adopt AI in Enterprise Applications

Reddit’s Enterprise Applications team describes a principles-driven, pragmatic approach to adopting AI for accounting/finance: enforcing accuracy, privacy, and auditability; using a hybrid architecture (deterministic core + AI for unstructured data) for AR cash application; leveraging orchestration platforms and copilots; and experimenting with agentic AI while maintaining SOX/GRC controls.

Reddit logo
Reddit

Meet the Team Behind r/RedditEng

A people-and-culture post introducing the volunteer team behind r/RedditEng, summarizing each member's role, tenure, favorite subreddits/memes/games, and outside-of-work interests.

Reddit logo
Reddit

Optimizing Go's Garbage Collector for Kubernetes Workloads: A Dynamic Tuning Approach

The post describes a runtime library and algorithm to dynamically tune Go's garbage collector (using GOMEMLIMIT/GOGC) for Kubernetes workloads. By monitoring memory usage and GC CPU percent and adjusting memory targets, the approach trades allocated container memory for reduced GC CPU overhead, lowering GC CPU from ~10-20% to ~1% and improving cluster efficiency.

Reddit logo
Reddit

We're Making Sure You Get The Message

This post details Reddit’s engineering effort to retire its legacy Private Messages system by centralizing all messaging into a new microservice-based Announcements channel and an enhanced Chat platform. The team mapped every PM use case, built a scalable Announcements backend with public APIs, audit logging and unsubscribe controls, and extended the Chat stack with new chat types, modmail integration, and accessibility improvements to ensure zero-disruption for moderators and third-party bots.

Reddit logo
Reddit

Bringing Shortcuts back to Reddit

A Reddit web engineer describes reintroducing keyboard shortcuts to the redesigned site: defining a TypeScript data model, building a ShortcutsController and registry, handling contextual vs global shortcuts, choosing PubSub over DOM events, lazy-loading the shortcuts modal, and dealing with infinite-scroll/virtualized traversal — with attention to accessibility and integration into the platform.

Reddit logo
Reddit

Houston, We Have a Process: A Guide to Control Maturity

Reddit describes building its GRC program and control framework—starting with SOX and ISO 27001, centralizing common controls in a GRC tool, mapping to SOC 2 and NIST CSF, reducing control count, and maturing controls from ad-hoc processes to automated checks and evidence collection. The post emphasizes automation, a "GRC engineering" approach, organizational roadmaps, and future plans (including AI risk management).

Reddit logo
Reddit

The Five Unsolved Problems of GraphQL

Reddit's GraphQL team outlines five persistent problems of running GraphQL at scale (minimizing overhead, balancing performance vs distributed ownership, ensuring contributors follow best practices, connecting clients to backends, and governing schema growth). The post describes Reddit's architecture (Golang Gateway, Apollo Router, GraphQL-Py/Go subgraphs), a migration from Python to Go for latency and cost benefits, trade-offs around Federation vs monolith, contributor tooling (SDK, linters, snapshot tests, GraphQL Ambassadors), and strong observability practices (a "Golden Metric" and tailored dashboards).

Reddit logo
Reddit

Analytics Engineering @ Reddit

Overview of Reddit’s Analytics Engineering team and their Curated Data strategy: centrally owned, entity-centric datasets (Aggregates and Segments), data-building patterns like smart accumulation, use of HyperLogLog sketches for distinct counts, and workload optimizations that materially reduce compute/time. The team’s work is intended to enable analytics, metrics, anomaly detection, and ML/AI use cases across Reddit.

Reddit logo
Reddit

From Outage to Opportunity: How We Rebuilt DaemonSet Rollouts

Reddit describes a Nov 2024 outage caused by unpaced first-time DaemonSet rollouts that flooded the kube-apiserver with concurrent LIST requests. They built ProgressiveDaemonSet — a mutating webhook + rollout controller using Kubernetes Pod Scheduling Gates — to rate-limit initial scheduling, expose Prometheus metrics, and allow live tuning via annotations. The solution and alternatives are discussed and the code is open-source.

Reddit logo
Reddit

Our Buildkite Brings All the Devs to the Yard: (Re)Building Reddit Mobile CI in 2025

Reddit’s Mobile Client Platform team evaluated multiple CI providers (GitHub Actions, Buildkite, TeamCity, Drone), built POCs (self-hosted Kubernetes runners and hosted options), and chose Buildkite for its dynamic pipelines, build choreography, and hosted/self-hosted flexibility. They migrated mobile CI for Android and iOS, introduced pipeline wrappers and caching, resolved emulator hosting trade-offs, and achieved large improvements in build times, queueing, and developer experience.

Reddit logo
Reddit

Modernizing Reddit's Comment Backend Infrastructure

Reddit migrated its Comments core model from a legacy Python monolith to a domain-specific Go microservice. For reads they used tap-compare testing; for risky write endpoints they created isolated "sister" datastores (Postgres, Memcached, Redis) that the new service wrote to so it could compare results against production writes without duplicating keys. The migration preserved CDC guarantees, uncovered issues like Go/Python serialization differences, ORM-related DB pressure, and race conditions in tap-compare logic. The move completed without user disruption and halved p99 latency on the migrated write endpoints. The team plans stronger local testing and continued migrations of other core models.

Reddit logo
Reddit

Evolution of Reddit's In-house P0 Media Detection

Reddit’s Safety Signals team describes improvements to their on‑prem P0 media detection: onboarding multiple third‑party hashsets, integrating Meta’s Hasher‑Matcher‑Actioner (HMA), building an internal hash database to store Ops review decisions and auto‑block known CSAM at upload, adding instrumentation to identify problematic hashes, and plans to migrate all hashsets to HMA and test AI‑based detection.

Reddit logo
Reddit

A Day In The Life of a S.P.A.C.E SWE Intern at Reddit

The author describes their work as a software engineering intern on Reddit’s S.P.A.C.E. team, building a Python backend for a new talent and performance management application. A key technical challenge was a severe N+1 query issue in Django causing 43,000 database calls, which was refactored by improving filtering logic to reduce queries down to seven. This optimization drastically improved the system’s performance and highlighted the importance of precise data-fetching strategies.

Reddit logo
Reddit

iOS Automation Accessibility testing at Reddit

Reddit describes their multi-layered automated accessibility testing approach for iOS: enforcing accessibility rules via SwiftLint, capturing accessibility trees with AccessibilitySnapshot, using Xcode 15’s performAccessibilityAudit() in UI tests, and augmenting audits with Deque’s axe DevTools — all integrated into UI tests and CI to catch regressions early.

Reddit logo
Reddit

When a One-Character Kernel Change Took Down the Internet (Well, Our Corner of It)

An AMI kernel regression (a typo in netfilter MARK registration) caused ip6tables-restore to fail, breaking kube-proxy atomic rule updates and triggering cluster-wide networking outages. The team rolled back the AMI, suspended automated rotations, cordoned bad nodes, scaled networking components, and pulled a patched kernel; lessons emphasize safer AMI rollout and testing in high-churn environments.

Reddit logo
Reddit

Query Autocomplete from LLMs

Reddit engineers built a query autocomplete feature by feeding cleaned query datasets into an LLM to normalize and sanitize suggestions, mapping those outputs into a hashmap for low-latency typeahead lookups, integrating via server-driven UI, and scaling the hackathon prototype to production with measurable improvements in latency and user metrics.

Reddit logo
Reddit

"Pest control": eliminating Python, RabbitMQ and some bugs from Notifications pipeline

Reddit replaced a fragile RabbitMQ + Python notifications pipeline with a Kafka-backed custom queue (Kafqueue) and a consolidated Go-based Notification Platform. The change addressed RabbitMQ backpressure and instability, enabled much higher throughput and efficiency, improved developer experience via a plugin-based architecture and templates, and introduced a testing framework to compare legacy and new pipelines during migration.

Reddit logo
Reddit

Risky Business - De-Splunkifying our SIEM

Reddit describes replacing an unstable ELK-based SIEM first with Splunk (using Cribl for aggregation) and then rebuilding a custom, in-house security observability pipeline using Cribl -> Kafka (Strimzi) -> BigQuery with Airflow and Kubernetes orchestration. The post covers architecture decisions (raw JSON storage in BigQuery + views for extraction), ingestion patterns (HEC, S3, vendor push/pull), operational observability (Prometheus, consumer lag, MTTI/MTTD), and future direction (streaming detections, LLM-assisted detection authoring).