驗證 Concurrent Algorithm

Posted on 2017-09-21 Edited on 2025-03-25

設計 concurrent algorithm 後當然要確認正確性。在《Principles of Concurrent and Distributed Programming》中 Ben-Ari 定義了為驗證 concurrent algorithm 各種特性的抽象結構（concurrency abstraction），可分為以下四個部份：

Programs are the execution of atomic statements.
Concurrent programs are the interleavings of atomic statements from two or more threads.
All possible interleavings of atomic statements must be shown to retain whatever property we are hoping to verify within a concurrent algorithm.
No thread’s statement may be (unfairly) excluded from any arbitrary interleavings.

一個個看。

Programs are the execution of atomic statements.

atomic statement 是無法再切成更小或無法中斷的 statement，語言層級從組合語言到高階語言都有。分析演算法時選擇可適合語言層級，通常用較 high-level 的語言，必要時才會用組合語言來分析。（用組語分析複雜的演算法應該會想瘋掉）

Concurrent programs are the interleavings of atomic statements from two or more threads.

concurrent 程式由多個 thread 所執行的 statement 交錯而成。（不知道除了翻譯還能說什麼…）

All possible interleavings of atomic statements must be shown to retain whatever property we are hoping to verify within a concurrent algorithm.

由於無法知道執行時 OS 如何做 scheduling，不能預設 thread 的 statement 會以何種順序執行，也不能假設每次執行的順序都相同。因此要驗證 concurrent algorithm 的特性，必須能說明 statement 以任何順序交錯執行，都能保有該特性，例如正確性。

在兩個 thread 的情況下有四五個 statement 互相交錯就會有非常多種可能。通常證明演算法擁有某個特性只需要考慮多種交錯中的幾種情況，但需要說明為什麼不用驗證其他交錯情況。

No thread’s statement may be (unfairly) excluded from any arbitrary interleavings.

簡單講就是不會有 thread 被餓死。

在各種交錯順序執行中，每個 thread 的 statement 總會被執行到，不會有 thread 的 statement 在某種執行順序情況下沒被跑到。OS 在分配執行時間給各 process 或 thread 時是公平的，每個 process 或 thread 都會分配到時間執行。

至於程式執行過程中某些時候有些 thread 會無法執行，例如要等其他人做完，這是演算法特性，而非 concurrency abstraction 考量的問題。

驗證 concurrent algorithm 跟找 concurrent 程式 bug（機率性、只在某些特定狀況下發生或者只發生一次就沒了）的基本方式一樣，只是驗證要完整。找 bug 至少要「找到出錯的那種執行順序」（更好是找完 bug 順便驗證一下），驗證正確性則要考慮所有情況下都是對的。所有執行順序交錯的情況太多，但可以分析演算法歸納出幾種情形，驗證這幾種情形都是正確的。

具體作法是用腦袋或紙筆模擬兩個或多個 thread 以不同順序執行某些 statement 時會不會有問題，這些 statement 可以是多個 function 也可以是一個 function 內的某一段，通常問題出現在多 thread 共用 data（各種 container、state 等等）的 critical section，所以要看不同執行順序下，共用 data 會不會錯或者造成執行結果不如預期。

燒腦(?)燒出了個可能有問題的執行順序後，如果不安心(?)要更確認，有時候可以用些~~奇怪的~~小技巧來複製或確認問題，例如用 breakpoint 之類的方式控制 thread 的執行順序、記錄 log（環境上無法複製或者就只出現那麼一兩次的很有用）、故意讓有問題的 code 跑很多次（loop 下去炸就對了，出現機率一般的 bug 通常就會壞了）。（絕對不承認有時候我以炸 code 為樂）

這邊雖然說 thread，但考慮程式交錯執行不限於狹義上的 thread，比較像考慮多個 executing thread，所以多個 process 會同時 access 相同檔案或資料庫也能用交錯執行來分析。（資料庫是否「共用」要看資料庫本身對資料處理細到什麼程度，這類共用問題我只在遙遠的記憶中有個 keyword 叫 transaction，比較不清楚其他方式）

這篇講的是驗證，但我目前為止還沒有很正式的去驗證過一個 concurrent algorithm，最多是找 bug 或者感應(?)一下會不會有問題。呃，感應，應該說需要一點經驗跟練習，會直覺大概知道有沒有問題，不過直覺當然不能當成正式驗證。