CPU Cache-Friendly Data Structures in Go: 10x Speed with Same Algorithm

Key Takeaways

  • Cache misses can slow down your code by 60x compared to L1 cache hits
  • False sharing occurs when multiple cores update different variables in the same cache line
  • Proper data structure padding can improve performance by 5-10x in specific scenarios
  • Data-oriented design beats object-oriented for high-performance systems
  • Always measure with benchmarks - cache effects are hardware-specific

Table of Contents

L1 Cache:    4 cycles     (~1ns)      32KB
L2 Cache:    12 cycles    (~3ns)      256KB
L3 Cache:    40 cycles    (~10ns)     8MB
RAM:         200+ cycles  (~60ns)     32GB

Cache line size: 64 bytes (on x86_64)

Reading from RAM is approximately 60x slower than L1 cache. One cache miss equals 60 cache hits. This is why cache-friendly code can run significantly faster - often 5-10x in specific scenarios.

False Sharing: The Silent Killer

False sharing occurs when multiple CPU cores modify different variables that happen to share the same cache line. This forces cache line invalidation across cores, causing significant performance degradation.

The problem is subtle: your variables might be logically independent, but if they're physically adjacent in memory (within 64 bytes), updating one invalidates the cache for all others on that line.

In our metrics collection system, we noticed 10x slower performance during high concurrency. The issue was multiple goroutines updating different counters that were packed in the same cache line.

Detection requires careful benchmarking with concurrent access patterns. The performance drop isn't visible in single-threaded tests, only under parallel load.


type Counters struct { requests uint64 errors uint64 latency uint64 }

type Counters struct { requests uint64 _ [56]byte errors uint64 _ [56]byte latency uint64 _ [56]byte timeouts uint64 _ [56]byte
}

Measure Cache Misses (Linux)

# Profile cache misses
perf stat -e cache-misses,cache-references ./myapp

# Detailed cache analysis
perf record -e cache-misses ./myapp
perf report

# Go benchmark with perf
go test -bench=. -benchtime=10s &
pid=$!
perf stat -p $pid -e L1-dcache-load-misses,L1-dcache-loads

Data-Oriented Design: Array of Structs vs Struct of Arrays


type Entity struct { ID uint64 X, Y, Z float64 Velocity float64 Health int Type string } type World struct { Entities []Entity } func (w *World) UpdatePositions(dt float64) { for i := range w.Entities { w.Entities[i].X += w.Entities[i].Velocity * dt }
}


type World struct { IDs []uint64 Positions [][3]float64 Velocities []float64 Healths []int Types []string
} func (w *World) UpdatePositions(dt float64) { for i := range w.Positions { w.Positions[i][0] += w.Velocities[i] * dt }
}

Prefetching: Help the CPU Predict


func SumLinear(data []int) int { sum := 0 for i := 0; i < len(data); i++ { sum += data[i] } return sum
} func SumRandom(data []int, indices []int) int { sum := 0 for _, idx := range indices { sum += data[idx] } return sum
} func ProcessWithPrefetch(nodes []*Node) { for i := 0; i < len(nodes)-4; i++ { _ = nodes[i+4].Value expensive(nodes[i]) }
}

Hot/Cold Data Splitting


type User struct { ID uint64 Score int Name string Email string Address string Bio string } func TopUsers(users []User) []uint64 { sort.Slice(users, func(i, j int) bool { return users[i].Score > users[j].Score })
} type UserHot struct { ID uint64 Score int Cold *UserCold } type UserCold struct { Name string Email string Address string Bio string
}


NUMA-Aware Allocation

 func PinToCPU(cpuID int) { runtime.LockOSThread() var cpuSet unix.CPUSet cpuSet.Zero() cpuSet.Set(cpuID) tid := unix.Gettid() unix.SchedSetaffinity(tid, &cpuSet)
} type NUMAPool struct { workers [][]chan Task } func (p *NUMAPool) Submit(task Task) { node := hash(task.Key) % len(p.workers) worker := rand.Intn(len(p.workers[node])) p.workers[node][worker] <- task
} func (p *NUMAPool) StartWorker(numaNode, workerID int) { cpuID := numaNode*24 + workerID PinToCPU(cpuID) for task := range p.workers[numaNode][workerID] { processTask(task) }
}

Branch Prediction Friendly Code


func CountCondition(data []int) int { count := 0 for _, v := range data { if v > 128 { count++ } } return count
} func CountConditionSorted(data []int) int { sort.Ints(data) count := 0 for _, v := range data { if v > 128 { count++ } } return count
} func CountConditionBranchless(data []int) int { count := 0 for _, v := range data { count += int((uint(v) >> 7) & 1) } return count
}

Cache-Conscious Hash Table


m := make(map[uint64]uint64) type RobinHoodMap struct { buckets []bucket mask uint64
} type bucket struct { key uint64 value uint64 distance uint8 occupied bool
} func (m *RobinHoodMap) Get(key uint64) (uint64, bool) { idx := key & m.mask distance := uint8(0) for { b := &m.buckets[idx] if !b.occupied { return 0, false } if b.key == key { return b.value, true } if b.distance < distance { return 0, false } idx = (idx + 1) & m.mask distance++ }
}



Real Production Example: Analytics Pipeline


type Event struct { Timestamp time.Time UserID uint64 Action string Value float64 Tags map[string]string
} func ProcessEvents(events []Event) { for _, e := range events { if e.Action == "purchase" { updateRevenue(e.Value) } }
} type EventBatch struct { Timestamps []int64 UserIDs []uint64 Actions []uint8 Values []float64 TagIndices []uint32 TagKeys []string TagValues []string
} func ProcessEventsBatch(batch *EventBatch) { for i, action := range batch.Actions { if action == ActionPurchase { updateRevenue(batch.Values[i]) } }
}




SIMD-Friendly Layouts


type Vec3 struct { X, Y, Z float32 _ float32 } func AddVectors(a, b []Vec3, result []Vec3) { for i := 0; i < len(a); i++ { result[i].X = a[i].X + b[i].X result[i].Y = a[i].Y + b[i].Y result[i].Z = a[i].Z + b[i].Z }
} type AlignedBuffer struct { _ [0]byte data [1024]float64
} var buffer = new(AlignedBuffer)

Benchmarking Cache Performance

func BenchmarkCacheLineSize(b *testing.B) { data := make([]int64, 1024*1024) for stride := 1; stride <= 128; stride *= 2 { b.Run(fmt.Sprintf("stride_%d", stride), func(b *testing.B) { for i := 0; i < b.N; i++ { for j := 0; j < len(data); j += stride { data[j]++ } } }) } }

Rules for Cache-Friendly Go

  1. Pack hot data together: Frequently accessed fields in same cache line
  2. Pad for concurrent access: Separate goroutine data by cache lines
  3. Prefer arrays over linked lists: Sequential > random access
  4. Use smaller types: int32 vs int64 if range allows
  5. Sort before processing: Predictable branches and access patterns
  6. Pool allocations: Reuse memory = hot caches
  7. Profile, don't guess: Use perf, pprof, and benchmarks

The Performance Optimization Recipe

1. Profile: Find cache misses with perf
2. Restructure: AoS → SoA for hot paths
3. Pad: Eliminate false sharing
4. Pack: Group frequently accessed data
5. Prefetch: Linear access patterns
6. Measure: Verify with benchmarks

Real results from production (specific workloads):
- Analytics pipeline: Up to 14x faster for batch processing
- Game physics engine: 8x faster for collision detection
- Database indexing: 11x faster for sequential scans
- These gains are workload-specific and vary by hardware

Remember: Modern CPUs are fast. Memory is slow. The gap grows every year. Cache-friendly code isn't premature optimization—it's necessary optimization.

Security Considerations:

  • Be careful with memory alignment tricks - they can expose timing side channels
  • Cache-based attacks (Spectre, Meltdown) exploit predictable memory access patterns
  • Consider security vs performance trade-offs in sensitive applications
  • Use constant-time algorithms for cryptographic operations
  • Validate all array bounds to prevent buffer overflows

Testing Cache Optimizations:

  • Use micro-benchmarks to measure specific optimizations
  • Test on different CPU architectures (Intel, AMD, ARM)
  • Measure with realistic data sizes and access patterns
  • Use performance counters to verify cache hit rates
  • Run tests with different core counts to detect false sharing
func TestCacheFriendlyStructure(t *testing.T) { tests := []struct { name string size int expected time.Duration }{ {"small_data", 1000, 100 * time.Microsecond}, {"medium_data", 100000, 10 * time.Millisecond}, {"large_data", 10000000, 1 * time.Second}, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { data := generateTestData(tt.size) start := time.Now() processCacheFriendly(data) elapsed := time.Since(start) if elapsed > tt.expected { t.Errorf("Processing took %v, expected < %v", elapsed, tt.expected) } result := verifyCacheFriendlyResult(data) assert.True(t, result.IsValid()) }) }
} func BenchmarkFalseSharing(b *testing.B) { b.Run("with_false_sharing", func(b *testing.B) { c := &CountersUnpadded{} b.RunParallel(func(pb *testing.PB) { for pb.Next() { atomic.AddUint64(&c.requests, 1) } }) }) b.Run("with_padding", func(b *testing.B) { c := &CountersPadded{} b.RunParallel(func(pb *testing.PB) { for pb.Next() { atomic.AddUint64(&c.requests, 1) } }) }) }
Optimization Technique Typical Improvement Best For Complexity
False Sharing Elimination 2-10x Concurrent counters Low
Data Structure Packing 1.5-3x Hot data paths Medium
Array of Structs → Struct of Arrays 3-15x Batch processing High
Prefetching 1.2-2x Predictable access Low
Branch Prediction 2-5x Conditional logic Medium

These optimizations aren't unique to Go, but Go's runtime characteristics affect their impact:

Optimization Go Impact C/C++ Impact Java Impact Rust Impact
False Sharing High (GC scanning) Medium High (GC pressure) Medium
Data Packing High (GC overhead) Low (manual memory) High (object overhead) Low (zero-cost)
Branch Prediction Medium (similar CPU) High (direct control) Low (JIT optimizes) High (LLVM)

Go's garbage collector adds overhead to memory access patterns. Cache-friendly code reduces GC pressure by improving locality, making the collector's job easier. This double benefit explains why these optimizations often have higher impact in Go than manual memory management languages.

trang chủ - Wiki
Copyright © 2011-2025 iteam. Current version is 2.147.1. UTC+08:00, 2025-11-05 01:10
浙ICP备14020137号-1 $bản đồ khách truy cập$