A Summary of Distributed Transaction Implementation
30 April 2016
Complex summary of distributed transaction & consistency, attached it below. I will tidy it up and translate it later when I have time.
1. reading & investigation & summary: 分布式事务、交易系统、互联网支付系统设计
1. 保证分布式系统数据一致性的6种方案(实际上,是互联网交易系统的设计,还有最终一致性的分布式事务)
https://mp.weixin.qq.com/s?__biz=MzAwMDU1MTE1OQ==&mid=2653546976&idx=1&sn=c3fb2338389a41e7ab998c0c21bd3e5d
1. 消息可以是被发送一次或者多次的,但至少发送一次
1. 消息发送不应该是动作,而应该是状态的传播。通常是对业务透明地,用中间件,根据本地数据库事务表的变化,自动发送消息,并保证一定发送
幂等性:订单/买卖执行应该是幂等性的,同一个transaction id最多执行一次,可以利用DB的if语句或DB的compare and swap
可以回滚,最多回滚一次 (上述,可以用一个专门的applied_transaction表实现)
2. 满足1的话,多个子系统的transaction,可以叠加为一个分布式的大transaction。
由一个中控结点反复监控transaction被全部执行或回滚 # 注意,这时是弱一致性,因为某个子系统可能未执行成功transaction
基本思想类似于DB里放了一个WAL(Write-ahead Logging)表
3. MVCC的话,版本比较->落盘日志(日志可能事先写好,直接swap in)->内存中写入新版本数据->增加版本号,这一系列需要原子完成
单机可以尝试用compare_and_swap;如果用Paxos的话,本质上是可以用锁实现的。
4. 电商的基本思路是,不使用分布式事务,它太重了。把一个大事务划分成N个子系统的各自的小事务,一部分立即执行(要求实时的),一部分通过消息队列异步执行。
小事务执行的时间是不确定的,总体上是最终一致性。
部分小事务的执行如果有顺序要求,如要求A->B->C->D,则可以A玩了再发给B消息,B完了在发给C消息。如果其中一部失败,那么发送全局废单消息,A、B、C、D各自回滚。
所有子事务的执行、回滚都是幂等性的。
——总结而言,大分布式事务划分为多个子事务,子事务的执行是幂等性的,消息保证至少发送一次;那么订单状态一定会逐渐传播至整个系统,最终一致性。
5. 如果有一个中心的coordinator来监控transaction各部分的执行的话,它自己的DB表,其实就相当于一个WAL
6. NoSQL数据库没有事务支持,但应该可以用原子的conditional update来实现子事务同等语义
例如,给数据加一个版本号字段
2. Dubbo中的分布式事务
1. http://javatar.iteye.com/blog/981787
http://www.iteye.com/magazines/103#250
http://www.dewen.net.cn/q/17502/Dubbo%E6%A1%86%E6%9E%B6%E4%B8%8D%E6%94%AF%E6%8C%81%E4%BA%8B%E5%8A%A1%EF%BC%8C%E9%82%A3%E9%9C%80%E8%A6%81%E6%80%8E%E4%B9%88%E6%A0%B7%E6%89%8D%E8%83%BD%E5%AE%9E%E7%8E%B0%E4%BA%8B%E5%8A%A1%E6%94%AF%E6%8C%81%E3%80%82%E8%A6%81%E6%80%8E%E4%B9%88%E6%89%A9%E5%B1%95%EF%BC%9F
http://dubbo.io/User+Guide-zh.htm#UserGuide-zh-%E5%88%86%E5%B8%83%E5%BC%8F%E4%BA%8B%E5%8A%A1
https://groups.google.com/forum/#!topic/dubbo/OCNBBzEQqzg
http://my.oschina.net/alexgaoyh/blog/519237
http://itindex.net/detail/52377-%E4%BB%A3%E7%90%86-spring-service
https://github.com/dangdangdotcom/dubbox/issues/60
http://www.cfanz.cn/index.php?c=uc/topic&a=read&id=15406
2. 一些分布式事务处理的资料
1. http://tech.dezai.cn/Detail.Aspx?AI=90001
2. http://wenku.baidu.com/view/3e2a8ff79e31433239689358.html
n. summary
1. Dubbo不支持分布式事务,但可被Java TransactionManager管理,成为事务中的一部分。实用性不强
3. write-ahead logging以及其幂等性idempotency的实现
1. http://work.tinou.com/2012/09/write-ahead-log.html
2. http://pages.cs.wisc.edu/~travitch/notes/cs764-notes-final.pdf
3. https://courses.cs.washington.edu/courses/cse444/12sp/lectures/lecture18-19-transactions-aries.pdf
4. https://parasol.tamu.edu/people/welch/teaching/310.f07/s22.ppt
5. https://www.cs.columbia.edu/~du/ds/assets/lectures/lecture15.pdf
n. summary
1. 对db而言,log entry的执行是idempotent的,其中记录是final value(redo log)或original value(undo log),而不是做什么操作(不然无法idempotent)
log entry中必须记录final value,那么生成时,read操作必须上lock,以保证final value是真实有效的
read (A, B) -> operate (A, B) -> commit log of (A, B) final value -> update (A, B) in memory,这期间(A,B)不会被其它transaction干扰
对于MVCC,估计是
record version -> read (A, B) -> operate (A, B) -> prepare log of (A, B) final value -> prepare update of (A, B) in memory -> compare version and swap
compare and swap,或者某种原子性操作(double-checked-log?),必须保证log和内存数据都是原子性地被swap in的
我猜可以是把transaction实现把log写进log流中,但只有compare version确定提交后,才把该log entry标记为有效的。因为有disk操作,估计还是得小范围上写锁。
2. ARIES is the famous write-ahead logging algorithm
4. Idempotency的实现,互联网支付和transaction系统中
1. http://codebetter.com/gregyoung/2010/08/12/idempotency-vs-distibuted-transactions/
2. http://hans-study.googlecode.com/svn/docs/%E5%B9%82%E7%AD%89_%E9%87%8D%E5%A4%8D%E6%8F%90%E4%BA%A4_%E4%B9%90%E8%A7%82%E9%94%81.pdf
3. http://yongpoliu.com/idempotent/
4. http://itindex.net/detail/40761-%E7%94%B5%E5%95%86-%E5%B9%82%E7%AD%89
5. http://read01.com/0GENL.html
6. http://www.cnblogs.com/weidagang2046/archive/2011/06/04/idempotence.html
7. http://www.infoq.com/cn/news/2013/05/idempotent
8. http://csrd.aliapp.com/?p=671
n. summary
1. 有DB的情况下,可以使用一个applied_transaction表,在事务中包含if not applied, do it and update applied。(另一个类似是排重表,可以是记transaction或消息)
2. 有DB的情况下,可以使用conditional update或叫DB compare-and-update。类似上一个,只需不需要事务了,只需要原子性的conditional update的SQL语句。
3. 无DB的情况下,使用lock:测试transaction状态->修改数据->更新transaction状态 (另外还得实现修改数据的原子性)
使用MVCC:记录版本->测试transaction状态->修改数据->更新transaction状态->测试版本->retry 或 atomic(测版本+提交数据+更新版本) (另外还得实现修改数据的原子性)
目测略难,可用paxos实现,需要小范围上(写)锁
4. API层设计,所有操作被改成创建一个transaction -> 执行一个transaction;transaction执行是幂等的(例子如取钱,用ticket号)
5. 发送消息,可以保证至少发送一次(可能发送多次),方法是用一个message_sent表(或直接用业务表)跟踪状态和定期检查重发
5. MVCC and transaction
1. https://en.wikipedia.org/wiki/Multiversion_concurrency_control#Databases_with_MVCC
2. https://devcenter.heroku.com/articles/postgresql-concurrency
3. http://stackoverflow.com/questions/27499/database-what-is-multiversion-concurrency-control-mvcc-and-who-supports-it
4. http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/
5. http://coolshell.cn/articles/6790.html
6. http://blog.notdot.net/2009/12/Damn-Cool-Algorithms-Log-structured-storage
7. https://github.com/cmu-db/peloton/wiki/Write-Ahead-Logging
8. http://www.enterprisedb.com/postgres-plus-edb-blog/amit-kapila/well-known-databases-use-different-approaches-mvcc
n. summary
1. the basic idea of MVCC: do first, compare the version, if someone else modified the data, then retry
2. Write-ahead logging and ACID, caching, MVCC, and old-version-purge are often intertwined to some extent. there are entire books about it
http://www.amazon.com/Transaction-Processing-Concepts-Techniques-Management/dp/1558601902/?tag=xaprb-20
http://www.amazon.com/Transactional-Information-Systems-Algorithms-Concurrency/dp/1558605088/?tag=xaprb-20
3. DB MVCC并未说一定是compare-and-swap,我估计,如上文分析,还是需要小范围的上写锁。搜索的资料中并未排除这种可能。
4. 想到一个DB MVCC + Journal的方案,日志写可以放到version检查之后
记录版本 -> 读数据 -> 操作 -> 写入内存数据 -> atomic(检查本版 -> 胜出:换入内存数据并更新版本 / 失败:回到开头) -> 日志写入磁盘 -> 返回用户提交成功
有可能一个更新的写的日志项,排在旧写的日志项的前头,因此日志项需要带上版本号以识别旧写(旧写还是需要被apply,因为它不一定只改一处)
6. MVCC and distributed transaction
1. http://www.nuodb.com/techblog/mvcc-part-4-distributed-mvcc
2. http://blog.csdn.net/zhang_shuai_2011/article/details/45673975
3. http://www.yankay.com/google-spanner%E5%8E%9F%E7%90%86-%E5%85%A8%E7%90%83%E7%BA%A7%E7%9A%84%E5%88%86%E5%B8%83%E5%BC%8F%E6%95%B0%E6%8D%AE%E5%BA%93/
4. http://stackoverflow.com/questions/18384883/why-is-googles-truetime-api-hard-to-duplicate
5. http://www.leafonsword.org/google-spanner/
n. summary
1. 对于分布式事务,MVCC的思路也可以应用,集群需要一个consensus的version记录,和consensus的锁服务
记录version -> 准备每一个node的内存更新 -> atomic(测试version -> 提交每一个node的内存更新,并递进version) 或 失败重试 -> 包含version号的日志落盘 -> 返回用户提交成功
2. 或者更朴实的做法是,选出一个leader node做仲裁,写的node需要broadcast-before-commit(其实leader node也相当于MVCC的central consensus)
3. spanner的TrueTime API,using GPS, atomic clocks?如何实现分布式事务的?
7. cockroachdb - 支持MVCC的分布式事务的DB(和跨数据中心复制)
1. http://www.infoq.com/cn/news/2014/08/CockroachDB
2. http://thenewstack.io/cockroachdb-unkillable-distributed-sql-database/
3. https://www.cockroachlabs.com/blog/how-cockroachdb-distributes-atomic-transactions/
4. https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions
5. https://smazumder05.gitbooks.io/design-and-architecture-of-cockroachdb/content/architecture/lock-free_distributed_transactions.html
6. https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/
7. https://news.ycombinator.com/item?id=10160797
8. https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
n. summary
1. cockroachdb is SQL, underlying it is a distributed sorted key-value store, it supports distributed transaction (MVCC, full ACID strong consistency)
local store is RocksDB (which is based on LevelDB)
2. cockroachdb is writen in Golang
3. create switch -> stage -> filter -> flip -> unstage; accessing to switch itself is exclusive, switch direct read/write to data/staged
it seems, version is by using timestamp, timestamp is from Hybrid Logical Clock; doesn't see explicit lock, other things not quite understand
WHATEVER, WAIT TO EXPLORE ...: https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions
4. to map sql table to key-value store: in kv it stores like: tableID/primary/primaryKeyVal/columnID->columnVal (since kv is sorted, easy to fetch entire row)
a secondary index stores like: tableID/indexID/indexColumnVals/primaryKeyVal/columnID->NULL (we need primaryKeyVal here also because index column may not be unique)
5. 有分布式NoSQL store,其支持conditional update,那么可以实现MVCC分布式事务吗?
可以认为,一个分布式事务,是由各个NoSQL结点的local transaction组合成。我们把事务信息记录到coordinator后,其实就相当于有了WAL。后面就等待状态传播到所有子节点即可。最终一致性经过等待,就可以变成强一致性。
这个时候真正怕的问题,是casual related的事务不按顺序执行,发生冲突。互联网公司貌似库存加减场景,不同大事务的子事务的执行顺序貌似没要求。但分布式数据库需要考虑这点;用来检测事务冲突的timestamp,貌似就非常需要了。于是有了TrueTime或Cockroach的算法,以提高精度、绕过uncertainty。貌似是这样。
6. 其它,blog中可以包含的内容
1. background: 2PC 3PC
paxos, zookeeper: locking, distributed transaction
idempotent: how to do it in paxos, in db transaction
transactional messaging & kafka
eventual consistency, of transaction, BASE, ACID
transfer state rather than action
asynchronized, scalable,
对账系统,中控事务状态的跟踪和管理
支付系统、交易系统
WAL, redu/undo, ARIES, idempotent
MVCC, db, paxos
trading systems, internet company order/trading/payment system
DB table as the write-ahead log
the power of conditional update
8. 更多理论
1. Concurrency Control in Distributed Database Systems [1272 references, 1981]
http://www.cs.berkeley.edu/~brewer/cs262/concurrency-distributed-databases.pdf
1. timestamp ordering的transaction思路完全不同,它通过给read/write分配好它们在未来什么时候执行的timestamp,来使transaction序列化
加锁、重试都变得不再那么需要了
参见4.1 Basic T/O Implementation
1. 每个item有item.read_timestamp和item.write_timestamp,它们记录将要在其上执行的transaction的read/write的最大timestamp
2. 对于发射过来的一个read,如果read.timestamp < item.write_timestamp,那么拒绝该read;否则允许其发射,并更新item.read_timestamp = read.timestamp
3. 对于发射过来的一个write,如果write.timestamp < item.read_timestamp,那么拒绝该write;否则允许其发射,并更新item.write_timestamp = write.timestamp
2. 另,见wiki: Timestamp-based concurrency control
https://en.wikipedia.org/wiki/Timestamp-based_concurrency_control
3. timestamp ordering: more materials
1. Concurrency Control – Time Ordering
https://discovery.csc.ncsu.edu/Courses/csc742-S02/T14_CControl_TO_6.pdf
2. DBMS - Concurrency Control
http://www.tutorialspoint.com/dbms/dbms_concurrency_control.htm
1. DB事务/并发控制有两类算法
1. Locked Based
1. Simplistic Lock Protocol: allow DB to obtain a lock on every object before a write
2. Pre-claiming Lock Protocol
3. Two-Phase Locking 2PL
4. Strict Two-Phase Locking
2. Timestamp Based
1. it is the most commonly used
2. Timestamp Ordering Protocol
3. http://www.cs.virginia.edu/~son/662.pdffiles/662.to.pdf
4. http://courses.cs.vt.edu/~cs5204/fall00/distributedSys/bto.html
5. http://www-i4.informatik.rwth-aachen.de/content/teaching/lectures/sub/vs/vsSS06/08_Transactions2_1P.pdf
6. https://www.youtube.com/watch?v=zzHjRWkL4_Y
7. https://www.youtube.com/watch?v=ompcyEsxkoc
8. https://www.quora.com/What-is-the-difference-between-timestamp-and-two-phase-locking-protocol-in-DBMS
9. Commitment ordering (CO)
https://en.wikipedia.org/wiki/Commitment_ordering
1. The commitment ordering solution for global serializability
2. The Vote ordering strategy for Global CO Enforcing
2. Snapshot Isolation and Serializable Snapshot Isolation
http://cs.nyu.edu/courses/fall14/CSCI-GA.2434-001/p729-cahill.pdf
1. good explanation of SI problems. SI reads from the snapshot of committed transaction, so
tx1_r(a_v1) -> tx2_(a_v1) -> tx1_op(a_v2 = a_v1 + 1) -> tx2_op(a_v2 = a_v1 + 1) -> tx1_w(a) -> tx2_w(a)
in the end, a++ twice results in a+1
3. Calvin: Fast Distributed Transactions for Partitioned Database Systems
http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf
1. layer Calvin on a non-transaction storage system, to provide ACID
2. concurrency control is done by the sequencer layer
the locking protocol resembles strict two-phase locking
4. How Spanner manages distributed transaction with TrueTime API
1. good explaination of how spanner works
By Murat etc: http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf
By Murat: http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/murat-spanner.pdf
Ref to Murat: http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spanner.pdf
1. Read-write transactions use 2-phase locking and 2-phase commit
5. By Murat & his friend: Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases - The HLC time
http://www.cse.buffalo.edu/tech-reports/2014-04.pdf
1. combine physical time and logical time (similar to vector clock)
2. to enable easy identify of consistent snapshots
3. track the causality-related event ordering
4. for more, see the conclusion part
9. 其它类似的分布式事务实现
1. DynamoDB
https://github.com/awslabs/dynamodb-transactions/blob/master/DESIGN.md
2. Percolator
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf
Spanner
http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spanner.pdf
http://research.google.com/archive/spanner-osdi2012.pdf
http://www.yankay.com/google-spanner%E5%8E%9F%E7%90%86-%E5%85%A8%E7%90%83%E7%BA%A7%E7%9A%84%E5%88%86%E5%B8%83%E5%BC%8F%E6%95%B0%E6%8D%AE%E5%BA%93/
http://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Talks/spanner-osdi2012.pptx
3. Transactions in MongoDB, Cassandra, Zookeeper and others
http://rystsov.info/2012/09/01/cas.html
n. summary
1. 基本上是2PC范畴(cockroachdb也可看作2PC),需要coordinator。coordinator是存在底层分布式NoSQL Store中的,所以状态不会丢。
NoSQL存储需要提供atomic write功能
2. 一个更简单的算法,甚至连version号都不需要。前提是底层NoSQL存储支持conditional update操作(CAS)
每个kv item,都需要配一个tx字段,表示当前进行中的transaction,还需要配一个update字段,记录当前正在写,还没提交的value
以下算法,适用于单机或分布式事务
1. 类型1:这个算法相当于先依次给需要更新的item上了锁,后续transaction必须等待锁释放才能执行。
读有可能读到未完全提交的数据(银行就不能用这种;不过也不一定,因为银行转账可以不即时到帐)。
```
新建transaction记录,后面用my_tx指代
for each item needs update # 需要按全局一致的顺序遍历item,避免死锁,或反复抢断
atomic(if item.tx is null, then update item.tx = my_tx.id) else
abort & cleanup & retry # 需要cleanup自己上的item.tx,类似于解锁
item.update = my_item_value
my_tx.state = committing
for each item needs update
item.value = item.update # 注意,读时可能读到transaction进行中的数据
write journal to disk
my_tx.state = committed
return to user with transaction succ
for each item needs update
item.tx = null
```
transaction由一个或多个coordinator执行,NoSQL Store保证其数据不丢失,挂了之后重启即可。
2. 类型2:后到的事务会抢断之前的事务。读有可能读到未完全提交的数据
```
create my_tx in transaction table
for each item needs update # 需要按全局一致的顺序遍历item,避免死锁,或反复抢断
loop1:
atomic ( if item.tx is null, then update item.tx = my_tx.id) else
its_tx = item.tx
if ( its_tx == null
|| its_tx.state == aborted
|| (its_tx.state < committing # 如果前者事务还没开始提交,那么就可以抢断
&& 抢断条件(its_tx))) # 为了避免两个事务反复互相抢断,我们须有某种优先级比较,多次重试时自动跳到最大优先级(不能抢断最大优先级)
atomic ( if item.tx == its_tx && its_tx.state < committing, then item.tx.state = abort ) else
goto loop1
atomic ( if item.tx == its_tx, then item.tx = my_tx) else # 以防中途item.tx发生变化
goto loop1
atomic ( if item.tx == my_tx, then item.update = my_item_update ) else
goto loop1
else
abort & retry as a new transaction # 不需要cleanup自己上的item.tx,因为可以被抢断;但应新建一个不同的transaction了
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # my_tx一旦进入committing,便不再可以抢断
abort & retry as a new transaction
for each item needs update
item.value = item.update # 注意,读时可能读到transaction进行中的数据
write journal to disk
my_tx.state = committed
return to user with transaction succ
for each item needs update
item.tx = null
```
transaction由一个或多个coordinator执行,NoSQL Store保证其数据不丢失,挂了之后重启即可。
3. 我们现在有三个问题
0. 最初一个问题是,transaction是否可抢断,上面两个算法在干这个
1. read是可以看见transaction进行中部分提交的数据的,这不符合事务语义,例如银行就不能用
解决这个问题,八成需要引入version(或者timestamp)
2. 能否省去item.value = item.update / item.value = my_tx.item.update,由后续读操作自己处理?
CockroachDB应该是采用了上述方案(https://github.com/cockroachdb/cockroach/blob/10dfc9deb99bcde726dc3c45b03a9ef86ca56dd7/docs/design.md ("Transaction execution flow".2))
3. version应该是和timestamp能够相互替换的。之所以要把version换成timestamp,估计是为了去中心(待验证)。
但是,timestamp的draft误差问题就引来了Spanner的TrueTime API,或者CockroachDB的HLC time方案
Cockroach选用了timestamp替代version
4. 类型3:不可抢断transaction;read可看见transaction部分提交的数据;不使用versoin;写事务不做item.value = item.update的更新
```
create my_tx in transaction table
for each item needs update # 需要按全局一致的顺序遍历item,避免死锁,或反复抢断
atomic(if item.tx is null, then update item.tx = my_tx.id) else
abort & cleanup & retry # 需要cleanup自己上的item.tx,类似于解锁
for each item needs update
item.value = my_item_value
write journal to disk
my_tx.state = committed # 读时可能读到transaction进行中的数据
return to user with transaction succ
for each item needs update
item.tx = null
```
对于读操作,因为允许读到transaction部分提交的数据,那么
```
read item.value
```
上述可以看出,如果允许read读到transaction部分提交的数据,其实根本就不需要item.update字段了
transaction的意义也被打破,实际上只保证了写操作组的隔离;读可以乱来,transaction实际上没用了,因为一个transaction可能是基于这种读做的操作;
5. 类型4:和上一个算法相似,只是允许transaction抢断。注意因为能读到部分提交的数据,这其实还是一个无效的transaction实现
```
create my_tx in transaction table
for each item needs update # 需要按全局一致的顺序遍历item,避免死锁,或反复抢断
loop1:
atomic ( if item.tx is null, then update item.tx = my_tx.id) else
its_tx = item.tx
if ( its_tx == null
|| its_tx.state == aborted
|| (its_tx.state < committing # 一个已经进入committing阶段的transaction相当于已经胜出,不能再抢断
&& 抢断条件(its_tx))) # 为了避免两个事务反复互相抢断,我们须有某种优先级比较,多次重试时自动跳到最大优先级(不能抢断最大优先级)
atomic ( if item.tx == its_tx && its_tx.state < committing, then item.tx.state = abort ) else
goto loop1
atomic ( if item.tx == its_tx, then item.tx = my_tx) else # 以防中途item.tx发生变化
goto loop1
atomic ( if item.tx == my_tx, then item.update = my_item_update ) else
goto loop1
else if (its_tx.state == committed) # 懒惰地将已提交的transaction的item.update更新到item.value
atomic ( if item.tx == its_tx, then item.value = item.update ) else
goto loop1
atomic ( if item.tx == its_tx, then item.tx = null) else
goto loop1
goto loop1
else
abort & retry as a new transaction
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # my_tx一旦进入committing,便不再可以抢断
abort & retry as a new transaction
write journal to disk
my_tx.state = committed
return to user with transaction succ
```
对于读操作,如果允许读到部分提交的transaction
```
read item.value
```
对于读操作,如果不允许读到transaction部分提交的数据,那么应该根据item.tx的状态,来传递读,传递是单向的
```
loop1:
its_tx = item.tx
if (its_tx != null && its_tx.state == committed)
atomic ( if item.tx == its_tx, then read item.update ) else # 要如何实现conditional read?倒是可以用atomic的item.value=item.update来代替
goto loop1
else
atomic ( if item.tx == its_tx, then read item.value) else
goto loop1
```
6. 上述大部分方法都不是完整的事务语义,因为读可能读到部分提交的事务。更完整的transaction语义应该是
1)两个事务的写操作组,不能互相穿插
2)一个事务可以看作是一系列读->内存运算->一系列写。事务过程中不允许穿插另一个事务的写操作
3)读操作不能读到部分提交的事务,即如果事务的写操作组没有全部完成,读应该是看不见的(Isolation)(例如,bank account A + B = 100,转账)
为了支持(3),还有(2),我估计是需要用version的了
7. 类型5:为了满足(3),即对{item1=a_old, item2=b_old},当一个transaction{item1=a, item2=b}提交后,要如何保证reader不能看见item1=a, item2=b_old?
用version版本号
```
# transaction begin
create my_tx in transaction table
my_tx.version = generate_from_global_monotone_increasing_number()
# read in transaction
for each item to read
my_tx.item.version = item.version
loop1:
its_tx = item.tx
if (its_tx != null && its_tx.state == committed)
atomic ( if item.tx == its_tx, then read item.update) else
goto loop1
else
atomic ( if item.tx == its_tx, then read item.value ) else
goto loop1
# write in transaction
for each item needs update or read # 需要按全局一致的顺序遍历item,避免反复抢断 # write和read的item都要上锁item.tx=my_tx
loop2:
its_tx = item.tx
if ( its_tx == null
|| its_tx.state == aborted
|| (its_tx.state < committing # 一个已经进入committing阶段的transaction相当于已经胜出,不能再抢断
&& 抢断条件(its_tx))) # 为了避免两个事务反复互相抢断,我们须有某种优先级比较,多次重试时自动跳到最大优先级(不能抢断最大优先级)
atomic ( if item.tx == its_tx && its_tx.state < committing, then item.tx.state = abort ) else
goto loop2
atomic ( if item.tx == its_tx, then item.tx = my_tx) else # 以防中途item.tx发生变化
goto loop2
if item needs update
atomic ( if item.tx == my_tx, then item.update = my_item_update ) else
goto loop2
else if (its_tx.state == committed) # 懒惰地将已提交的transaction的item.update更新到item.value
atomic ( if item.tx == its_tx, then item.value = item.update ) else
goto loop2
atomic ( if item.tx == its_tx, then item.tx = null) else
goto loop2
goto loop2
else
abort & retry as a new transaction
# transaction commit
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # my_tx一旦进入committing,便不再可以抢断
abort & retry as a new transaction
/* 可以把my_tx表示为read(C, A) -> write(A, B),对于一个在my_tx读写之间完成提交的事务tx2
如果tx2是read(C)->write(C),那么my_tx read item的版本号会变,但其实按照序列化要求,可以认为tx2发生于my_tx之后,my_tx可以无视版本号变化
如果tx2是read(C, A)->write(C, A),那么只能序列化为tx2发生在my_tx之前,my_tx需要重启
如果tx2是read(C, B)->write(B),那么只能序列化为tx2发生于my_tx之前,但my_tx可以继续执行
如果tx2是read(C)->write(C, A),那么只能序列化为tx2发生于my_tx之前,my_tx需要重启
因此,我们可以看出,读和写的item都需要上锁(item.tx=my_tx),但只需要检查读item的版本号即可;只有读item被tx2写了的情况,才需要重启my_tx
*/
for each item I read
if item.version != my_tx.item.version
abort & retry as a new transaction
for each item I update
item.version = my_tx.version
write journal to disk
my_tx.state = committed
return to user with transaction succ
```
8. 类型6:其实,使用抢断式的transaction,不用version号,也可以实现(3)和(2)。关键是写者会抢断读者。
原理是:reader不能读正在committing的item,writer进入committing状态后,abort所有reader
```
# transaction begin
create my_tx in transaction table
# subroutine to read an item
def sub_read(item)
while (true)
its_tx = item.tx
if (its_tx.state == committing) # reader不能去读committing的item
wait_for_a_short_time()
continue
if (its_tx != null && its_tx.state == committed)
atomic ( if item.tx == its_tx, then read item.update) else
continue
else
atomic ( if item.tx == its_tx, then read item.value ) else
continue
atomic ( if item.tx == its_tx, then item.reader_txs.add(my_tx)) else
continue
break
# subroutine to write an item
def sub_write(item)
while (true)
its_tx = item.tx
if (its_tx == null || its_tx.state == aborted
|| (its_tx.state < committing && 抢断条件(its_tx) )) # 事务有随机优先级,被多次抢断的事务优先级上升
atomic ( if item.tx == its_tx && its_tx.state < committing, then item.tx.state = abort ) else
continue
atomic ( if item.tx == its_tx, then item.tx = my_tx) else
continue
atomic ( if item.tx == my_tx, then item.update = my_item_update ) else
continue
else if (its_tx.state == committed)
atomic ( if item.tx == its_tx, then item.value = item.update ) else
continue
atomic ( if item.tx == its_tx, then item.tx = null) else
continue
continue
else
abort & retry as a new transaction
break
# read in transaction
for each item needs read
sub_read(item)
# write in transaction
for each item needs update
sub_write(item)
# transaction commit
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # 只有committing的transaction可以抢断committing的
abort & retry as a new transaction
for each item I write
for each r_tx in item.reader_txs # my_tx到committing状态后,reader就不会再进入item.reader_txs了
if my_tx.state == abort
abort & retry as a new transaction
atomic ( if r_tx.state < committing, then r_tx.state = abort)
else atomic ( if r_tx.state == committing, then r_tx.state = abort) # 可以保证,r_tx被抢断后会阻塞在read阶段
else atomic ( if r_tx.state == committed, then noop)
else noop
atomic ( if my_tx.state = committing, then write journal to disk, my_tx.state = committed)
return to user with transaction succ
```
9. 类型7:用timestamp代替version,且我们可以得到精确timestamp,且各节点都是一致(虽然现实上这不可能)
基本上,这个算法和类型5完全一样,只是把version换成了timestamp。
```
# transaction begin
create my_tx in transaction table
my_tx.start_time = get_timestamp()
# subroutine to read
def sub_read(item)
atomic( # 可以拆解成更小的atomic,以便CAS实现;此处和类型5其实没区别
if item.tx.state == committed
read item.update
else
read item.value
)
# subroutine to write & lock read
def sub_write_and_lock_read(item)
atomic ( if item.tx.state == committed, then item.value = item.update, item.tx = my_tx) else # 此处和类型5其实没区别,只是重写了一下
atomic(
if (item.tx == null || item.tx == aborted
|| (item.tx < committed && 抢断条件(item.tx)) )
item.tx.state = abort; item.tx.abort_time = now()
item.tx = my_tx
else if (item.tx.state = committed)
item.value = item.update
item.tx = null
continue
else
abort & retry as a new transaction
)
if item need update
item.update = my_item_update
# read in transaction
for each item needs read
sub_read(item)
# write & lock read in transaction
for each item needs update or read # 需按顺序上锁以防死锁
sub_write_and_lock_read(item)
# transaction commit
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # committing状态后,transaction不可抢断,read和write item都不许别人写
abort & retry as a new transaction
for each item I read # 根据类型5里的分析,只需要确保read item的version未变即可
if item.timestamp > my_tx.start_time
abort & retry as a new transaction
my_tx.commit_time = now()
for each item I write
item.timestamp = my_tx.commit_time
write journal to disk
my_tx.state = committed
return to user with transaction succ
```
10. 类型8:用timestamp代替version,但clock有偏差skew;所以真是的time应该是在[timestamp-skew, timestamp+skew]间
算法和类型7基本一样,测试是否可提交时,加上skew偏差;但是这样误伤几率很大,一个在write之后skew时间内开始的transaction,很可能重启
不知道spanner和cockroachdb解决clock skew的方法应如何使用?
```
# transaction begin
create my_tx in transaction table
# subroutine to read
def sub_read(item)
atomic( if my_tx.start_time == null, then my_tx.start_time = now()) # 记录first read时间作为开始时间
atomic( # 可以拆解成更小的atomic,以便CAS实现;此处和类型5其实没区别
if item.tx.state == committed
read item.update
else
read item.value
)
# subroutine to write & lock lock
def sub_write_and_lock_read(item)
atomic ( if item.tx.state == committed, then item.value = item.update, item.tx = my_tx) else # 此处和类型5其实没区别,只是重写了一下
atomic(
if (item.tx == null || item.tx == aborted
|| (item.tx < committed && 抢断条件(item.tx)) )
item.tx.state = abort; item.tx.abort_time = now()
item.tx = my_tx
else if (item.tx.state = committed)
item.value = item.update
item.tx = null
continue
else
abort & retry as a new transaction
)
if item need update
item.update = my_item_update
# read in transaction
for each item needs read
sub_read(item)
# write & lock read in transaction
for each item needs update or read # 需按顺序上锁以防死锁
sub_write_and_lock_read(item)
# transaction commit
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # committing状态后,transaction不可抢断,read和write item都不许别人写
abort & retry as a new transaction
for each item I read
if item.timestamp + skew > my_tx.start_time # 必须确保不可能有另外的transaction发生在我的read和write之间,考虑skew误差;不过误伤范围较大,应该会经常重启
abort & wait(skew) & retry as a new transaction
my_tx.commit_time = now()
for each item I write
item.timestamp = my_tx.commit_time # 这里有个问题,timestamp的更新和commit变可见,是不同步的
write journal to disk
my_tx.state = committed;
return to user with transaction succ
```
11. 类型9:当看到一个事件的timestamp,其真实时间可能是[timestsamp-skew, timestamp+skew];Spanner采用在提交可见前等待skew的策略
Tx1会先完成所有修改并记录commit_time->等skew时间->使提交可见。如果tx2(有与tx1重合的读写item)开始于提交可见之后,那么必定tx2.start_time > tx1.commit_time
如果tx2开始于tx1提交可见之前,那么一定会因为重试,变成提交可见后开始执行,从而必定tx2.start_time > tx1.commit_time
下面的算法,可能大体上就是Spanner所采用的处理时钟skew的方法
```
# transaction begin
create my_tx in transaction table
# subroutine to read
def sub_read(item)
atomic( if my_tx.start_time == null, then my_tx.start_time = now())
atomic( # 如果上句到此句见,item.tx状态由committing变位了committed,那么start_time就稍有不准确;结果是可能误判稍多的transaction
if item.tx.state == committed
read item.update
if item.tx.state == committing # 以保证发生在"等skew时间->使提交可见"之间transaction一定重试
abort & retry as a new transaction
else
read item.value
)
# subroutine to write & lock read
def sub_write_and_lock_read(item)
atomic ( if item.tx.state == committed, then item.value = item.update, item.timestamp = item.update_timestamp, item.tx = my_tx) else
atomic(
if (item.tx == null || item.tx == aborted
|| (item.tx < committed && 抢断条件(item.tx)) )
item.tx.state = abort
item.tx = my_tx
else if (item.tx.state = committed)
item.value = item.update
item.timestamp = item.update_timestamp # 此处也保证了,读timestamp时(只有committing阶段才需要读它)最新timestamp总是可见
item.tx = null
continue
else
abort & retry as a new transaction
)
if item need update
item.update = my_item_update
# read in transaction
for each item needs read
sub_read(item)
# write & lock read in transaction
for each item needs update or read # 需按顺序上锁以防死锁
sub_write_and_lock_read(item)
# transaction commit
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # committing状态后,transaction不可抢断
abort & retry as a new transaction
for each item I read
if item.timestamp >= my_tx.start_time # 已保证tx2.start_time > tx1.commit_time,所以能这么判断
abort & wait(skew) & retry as a new transaction
my_tx.commit_time = now()
for each item I write
item.update_timestamp = my_tx.commit_time # 使用update_timestamp以使timestamp的更新不是立即可见
write journal to disk # 如果crash重启,也得再等skew时间,以渡过时钟uncertain期
wait(skew) # 使得提交后,至少经过skew时间,才变得可见;为了实现上述tx2.start_time > tx1.commit_time
my_tx.state = committed;
return to user with transaction succ
```
12. 类型10:和类型9完全一样,只是把commit前wait(skew),变成了读之前wait(skew),同样可以实现
如果tx2开始于tx1提交后,则一定有tx2.start_time > tx1.commit_time,即使clock有最大为skew的偏差
下面的算法,可能大体上就是CockroachDB所采用的处理时钟skew的方法
```
# transaction begin
create my_tx in transaction table
wait(skew) # 为了实现上述tx2.start_time > tx1.commit_time;不过此处应该先把item锁上,以清空[now()-skew, now()]时间段
# subroutine to read
def sub_read(item)
atomic( if my_tx.start_time == null, then my_tx.start_time = now())
atomic(
if item.tx.state == committed
read item.update
else
read item.value
)
# subroutine to write & lock read
def sub_write_and_lock_read(item)
atomic ( if item.tx.state == committed, then item.value = item.update, item.timestamp = item.update_timestamp, item.tx = my_tx) else
atomic(
if (item.tx == null || item.tx == aborted
|| (item.tx < committed && 抢断条件(item.tx)) )
item.tx.state = abort
item.tx = my_tx
else if (item.tx.state = committed)
item.value = item.update
item.tx = null
continue
else
abort & retry as a new transaction
)
if item need update
item.update = my_item_update
# read in transaction
for each item needs read
sub_read(item)
# write & lock read in transaction
for each item needs update or read # 需按顺序上锁以防死锁
sub_write_and_lock_read(item)
# transaction commit
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # committing状态后,transaction不可抢断
abort & retry as a new transaction
for each item I read
if item.timestamp >= my_tx.start_time # 已保证tx2.start_time > tx1.commit_time,所以能这么判断
abort & wait(skew) & retry as a new transaction
my_tx.commit_time = now()
for each item I write
item.timestamp = my_tx.commit_time
write journal to disk
my_tx.state = committed;
return to user with transaction succ
```
13. 类型11:这里我们使用完全不同的一种方式处理事务——timestamp ordering
给事务的读写分配它们该在什么时间执行的timestamp。通过合理地分配timestamp,保证事务的可序列化
我们假设所有获取的timestamp都是准确无误的,没有skew。
另外,原子操作/CAS可以使按timestamp执行的在一个item上的操作互相不会重叠
```
item := {
value,
read_time, # 该item上的所有read的最大timestamp(即最后什么时候读,不过这个read可能是计划在将来执行的)
write_time, # 该item上的所有write的最大timestamp
commit_time, # 该item上最后一次commit的timestamp
tx, # 放着最后一次分派到item上write的transaction
update,
}
# transaction begin
create my_tx in transaction table
# subroutines to read
def schedule_read(item)
while (true)
read_time = max(now(), item.write_time, item.commit_time) + 1
atomic( if read_time > item.write_time && read_time > item.commit_time, then item.read_time = read_time) else
continue
break
schedule_read_at(item, read_time)
my_tx.first_read_time = min(my_tx.first_read_time, read_time) # 记录第一次读的时间
def sub_read(item)
atomic(
if item.tx.state == committed
read item.update
else
read item.value
)
# subroutines to write
def schedule_write_and_lock_read(item)
while (true)
write_time = max(now(), item.read_time, item.write_time, item.commit_time) + 1
atomic(
if (write_time > item.read_time && write_time > item.write_time && write_time > item.commit_time)
if ((item.tx == null || item.tx.state == abort
|| item.tx == my_tx) # 为避免有别的事务的写插在了我的读和写之间,应重启
&& my_tx.first_read_time > item.commit_time) # 不允许commit发生在事务的read和write之间
item.tx = my_tx
if item needs update
item.write_time = write_time
else if (item.tx < committing && 抢断条件(item.tx)) # 其实不推荐抢断;重启并重新schedule读写时间更便宜
item.tx.state = abort
continue
else if (item.tx == committed)
item.value = item.update
item.tx = null
continue
else
abort & retry as a new transaction
else
continue
)
break
if item needs update
schedule_write_at(item, write_time)
my_tx.last_write_time = max(my_tx.last_write_time, write_time) # 记录最后一次写的时间
def sub_write(item)
/* 如果一个write已经被schedule了,那么它未来可能写了item.update,在将item.update放到item.value上的时候,又需要检查item.tx.state,然而item.tx可能已被后来schedule上去的tx write给替换。
事实上,上述情况永远不会发生。因为write的schedule,要求上一个write一定是commit或abort了,而commit的话就帮它item.value=item.update。也就是说,在上一个schedule上去的item实际执行完成并commit之前,或者abort之前,都无法schedule下一个write。注意schedule commit时并没有松开item.tx锁,只有commit完成才会。
一个更优雅的设计,大概是把item.tx分成item.last_schedule_write_tx和item.last_write_tx。之所以不这样做,是因为我希望整体算法和cockroachdb的更接近
*/
atomic( if my_tx.state == in_process, then item.update = my_item_update)
# read in transaction
for each item needs read
sub_read(item)
# write & lock read in transaction
for each item needs update or read # 需按顺序上锁以防死锁
sub_write_and_lock_read(item)
# transaction commit
def schedule_commit() # only after all writes & read locks are successfully scheduled
for each item I read
if item.commit_time >= my_tx.first_read_time
abort & retry as a new transaction
for each item I read or update
commit_time = max(commit_time, now(), item.write_time, item.commit_time)
for each item I write
atomic( if item.commit_time < commit_time, then item.commit_time = commit_time)
schedule_commit_at(commit_time)
def commit()
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # committing状态后,transaction不可抢断
abort & retry as a new transaction
write journal to disk
my_tx.state = committed
return to user with transaction succ
```
加上item.commit_time纯粹是为了避免tx2的read发生在tx1的write和commit之间
其实,如果把item.update保存在tx各自的私有空间,就不再需要item.tx加锁,从而只需要留下item.read_time,item.commit_time即可;write在提交前完全不会互相影响;也不再需要抢断机制了。
14. 类型12: 从类型11继续,为了是其更接近cockroachdb的样子,我们把Timestamp Ordering和不schedule直接读写的方式混和,即
先尝试直接读写,如果不行,则schedule到合适的timestamp。注意此似乎clock还是被认为是完全精确的。
```
item := {
value,
read_time, # 该item上的所有read的最大timestamp(即最后什么时候读,不过这个read可能是计划在将来执行的)
write_time, # 该item上的所有write的最大timestamp
commit_time, # 该item上最后一次commit的timestamp
tx, # 放着最后一次分派到item上write的transaction
update,
}
# transaction begin
create my_tx in transaction table
# subroutines to read
def try_sub_read(item)
read_time = now()
atomic(
if item.tx == null || item.tx.state == abort
read item.value
else if item.tx.state == committed
read item.update
) else
schedule_read(item)
return
my_tx.first_read_time = min(my_tx.first_read_time, read_time)
atomic( if read_time > item.read_time, then item.read_time = read_time)
def schedule_read(item)
while (true)
read_time = max(now(), item.write_time, item.commit_time) + 1
atomic( if read_time > item.write_time && read_time > item.commit_time, then item.read_time = read_time) else
continue
break
schedule_read_at(item, read_time)
my_tx.first_read_time = min(my_tx.first_read_time, read_time) # 记录第一次读的时间
def sub_read(item)
atomic(
if item.tx.state == committed
read item.update
else
read item.value
)
# subroutines to write
def try_schedule_write_and_lock_read(item)
write_time = now()
atomic(
if ((item.tx == null || item.tx.state == abort
|| item.tx == my_tx)
&& my_tx.first_read_time > item.commit_time)
item.tx = my_tx
else if (item.tx < committing && 抢断条件(item.tx))
item.tx.state = abort
item.tx = my_tx
else if (item.tx == committed)
item.value = item.update
item.tx = my_tx
continue
) else
/* 上文已分析过,schedule write只有在上一个write tx执行完毕提交或abort时,才能成功
那么,干脆把它与write的实际执行合并
*/
schedule_at(
try_schedule_write_and_lock_read,
max(now(), item.read_time, item.write_time, item.commit_time) + 1
)
return
write_time = now()
if item needs update
item.update = my_item_value
my_tx.last_write_time = max(my_tx.last_write_time, write_time)
atomic(write_time > item.write_time, then item.write_time = write_time)
# read in transaction
for each item needs read
try_sub_read(item)
# write & lock read in transaction
for each item needs update or read # 需按顺序上锁以防死锁
try_sub_write_and_lock_read(item)
# transaction commit
def schedule_commit() # only after all writes & read locks are successfully scheduled
for each item I read
if item.commit_time >= my_tx.first_read_time
abort & retry as a new transaction
for each item I read or update
commit_time = max(commit_time, now(), item.write_time, item.commit_time) # 因为item.tx锁的缘故,十有八九commit会被立即执行
for each item I write
atomic( if item.commit_time < commit_time, then item.commit_time = commit_time)
schedule_commit_at(commit_time)
def commit()
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # committing状态后,transaction不可抢断
abort & retry as a new transaction
write journal to disk
my_tx.state = committed
return to user with transaction succ
```
15. 类型13:继续从类型12开始,为了更接近cockroachdb,我们认为clock有最大skew的偏差,真是事件发生在[timestamp-skew, timestamp+skew]内。
为了处理这种skew,读被约束为一定发生在commit之后至少skew时间后。如果是Spanner,则应该是commit前多等待skew。
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
```
item := {
value,
read_time, # 该item上的所有read的最大timestamp(即最后什么时候读,不过这个read可能是计划在将来执行的)
write_time, # 该item上的所有write的最大timestamp
commit_time, # 该item上最后一次commit的timestamp
tx, # 放着最后一次分派到item上write的transaction
update,
}
# transaction begin
create my_tx in transaction table
# subroutines to read
def try_sub_read(item)
read_time = now()
atomic(
if read_time > item.commit_time + skew && read_time > item.write_time + skew # read总是要开始于commit之后至少skew
if item.tx == null || item.tx.state == abort
read item.value
else if item.tx.state == committed
read item.update
) else
schedule_read(item)
return
my_tx.first_read_time = min(my_tx.first_read_time, read_time)
atomic( if read_time > item.read_time, then item.read_time = read_time )
def schedule_read(item)
while (true)
read_time = max(now(), item.write_time + skew, item.commit_time + skew) + 1 # read总是要开始于commit之后至少skew时间,以实现tx2.start_time > tx1.commit_time约束
atomic( if read_time > item.write_time && read_time > item.commit_time, then item.read_time = read_time) else
continue
break
schedule_read_at(item, read_time)
my_tx.first_read_time = min(my_tx.first_read_time, read_time) # 记录第一次读的时间
def sub_read(item)
atomic(
if item.tx.state == committed
read item.update
else
read item.value
)
# subroutines to write
def try_schedule_write_and_lock_read(item)
write_time = now()
atomic(
if ((item.tx == null || item.tx.state == abort
|| item.tx == my_tx)
&& my_tx.first_read_time > item.commit_time)
item.tx = my_tx
else if (item.tx < committing && 抢断条件(item.tx))
item.tx.state = abort
item.tx = my_tx
else if (item.tx == committed)
item.value = item.update
item.tx = my_tx
continue
) else
/* 上文已分析过,schedule write只有在上一个write tx执行完毕提交或abort时,才能成功
那么,干脆把它与write的实际执行合并
*/
schedule_at(
try_schedule_write_and_lock_read,
max(now(), item.read_time, item.write_time, item.commit_time) + 1
)
return
write_time = now()
if item needs update
item.update = my_item_value
my_tx.last_write_time = max(my_tx.last_write_time, write_time)
atomic(write_time > item.write_time, then item.write_time = write_time)
# read in transaction
for each item needs read
try_sub_read(item)
# write & lock read in transaction
for each item needs update or read # 需按顺序上锁以防死锁
try_sub_write_and_lock_read(item)
# transaction commit
def schedule_commit() # only after all writes & read locks are successfully scheduled
for each item I read
if item.commit_time >= my_tx.first_read_time
abort & retry as a new transaction
for each item I read or update
commit_time = max(commit_time, now(), item.write_time, item.commit_time, item.read_time) # 这里加上item.read_time以保证read总是发生在commit+skew之后,或一定被restart
for each item I write
atomic( if item.commit_time < commit_time, then item.commit_time = commit_time)
schedule_commit_at(commit_time)
def commit()
atomic ( if my_tx.state = in_process, then my_tx.state = committing ) else # committing状态后,transaction不可抢断
abort & retry as a new transaction
write journal to disk
# 如果是仿照Spanner,那应该是commit可见前,等待skew,以使tx2.start_time > tx1.commit_time
# 不过要注意,commit_time在提交前已经是可见的
# 并且read看committing状态需要abort,以保证等待skew->commit可见期间不穿插新transaction
# wait(skew)
my_tx.state = committed
return to user with transaction succ
```
但和cockroachdb比较,HLC时钟在这里有什么用呢?估计是为了实现consistent的snapshot cut,为了SSI,大概。
Create an Issue or comment below