MIT course "Security of computer systems". Lecture 10: “Symbolic Execution”, part 3

 3r33469. 3r3-31.

Massachusetts Institute of Technology. Lecture course # ???. "Security of computer systems." Nikolai Zeldovich, James Mykens. 2014


 3r33469. Computer Systems Security is a course on the development and implementation of secure computer systems. Lectures cover threat models, attacks that compromise security, and security methods based on the latest scientific work. Topics include operating system (OS) security, capabilities, information flow control, language security, network protocols, hardware protection and security in web applications.
 3r33469.
 3r33469. Lecture 1: "Introduction: threat models" 3r310. Part 1
/ Part 2 / Part 3
 3r33469. Lecture 2: “Controlling hacker attacks” 3r3-318. Part 1
/ Part 2 / Part 3
 3r33469. Lecture 3: "Buffer overflow: exploits and protection" 3r-326. Part 1
/ Part 2 / Part 3
 3r33469. Lecture 4: "The division of privileges" 3r3334. Part 1
/ Part 2 / Part 3
 3r33469. Lecture 5: “Where Security Errors Come From” Part 1 / Part 2
 3r33469. Lecture 6: “Opportunities” 3r348. Part 1
/ Part 2 / Part 3
 3r33469. Lecture 7: “Sandbox Native Client” Part 1 / Part 2 / Part 3
 3r33469. Lecture 8: "Network Security Model" Part 1 / Part 2 / Part 3
 3r33469. Lecture 9: "Web application security" 3r372. Part 1
/ Part 2 / Part 3
 3r33469. Lecture 10: “Symbolic Execution” 3r380. Part 1
/ Part 2 / Part 3 3r3386.
 3r33469.
 3r33469. Now, following the branch down, we see the expression t = y. Since we are considering one path at a time, we don’t have to introduce a new variable for t. We can simply say that t = y, it means that t is no longer equal to 0.
 3r33469.
 3r33469. We continue to move down and get to the point where we fall into another branch. What is the new assumption that we must make in order to follow further down this path? This is an assumption that t < y.
 3r33469.
 3r33469. What is t? If you look up the right branch, we will see that t = y. And in our table T = y and Y = y. From this it follows logically that our constraint looks like y < y, чего не может быть.
 3r33469.
 3r33469. 3r3104.
 3r33469.
 3r33469. Thus, we had everything in order, until we reached this point t < y. Пока мы не дошли до утверждения false, у нас все неравенства должны быть правильными. Но этого не получается, потому что при выполнении заданий правой ветви у нас появляется логическое несоответствие.
 3r33469.
 3r33469. We have what is often called the condition of the path. This condition must be true for the program to go this way. But we know that this condition cannot be fulfilled, therefore it is impossible for the program to go this way. So this path is now completely eliminated, and we know that this right path cannot be traversed.
 3r33469.
 3r33469. What about the other way? Let's try to go through the left branch in another way. What are the conditions for this path? Again, our character state begins with t = ? and X and Y are equal to variables x and y.
 3r33469.
 3r33469. MIT course "Security of computer systems". Lecture 10: “Symbolic Execution”, part 3  3r33469.
 3r33469. What is the path constraint in this case now? Denote the left branch as True, and the right one as False, and then consider the value t = x. As a result of the logical processing of the conditions t = x, x> y and t < y получаем, что у нас одновременно имеется то, что x > y and x < y.
 3r33469.
 3r33469.  3r33469.
 3r33469. It is quite clear that this condition of the path is unsatisfactory. We cannot have x, which is both larger and smaller than y. There is no variable assignment X, which satisfies both constraints. So this tells us that the other way is also unsatisfactory.
 3r33469.
 3r33469. It turns out that at this moment we explored all the possible ways in the program that could lead us to this state. We can actually establish and certify that there is no possible way that would lead us to asserting false.
 3r33469. Audience: in this example, you showed that you studied the course of the program in all possible branches. But one of the advantages of symbolic execution is that we do not need to study all possible exponential paths. So how to avoid this in this example?
 3r33469.
 3r33469. Professor: This is a very good question. In this case, you reach a compromise between character execution and how specific you want to be. So in this case, we are not so much using symbolic execution as the first time, when we looked at the course of the program on both branches at the same time. But thanks to this, our limitations have become very, very simple.
 3r33469.
 3r33469. The one-by-one individual constraints are very simple, but you have to do it again and again, studying all existing branches, and exponentially all possible paths.
 3r33469. There are exponentially many paths, but for each path as a whole, there is also an exponentially large set of input data that can go along this path. So it already gives you a big advantage, because instead of trying all possible inputs, you are trying to try every possible way. But can you do something better?
 3r33469.
 3r33469. This is one of the areas where many experiments have been done regarding symbolic execution, for example, the simultaneous execution of several paths. In the lecture materials, you met a heuristic and a set of strategies that experimenters used to make the search more decisive.
 3r33469.
 3r33469. For example, one of the things they do is that they explore one path after another, but they do not do it completely blindly. They check the conditions of the path after each step taken. Suppose that here in our program instead of the statement “false” there would be a complex tree of programs, a control flow graph.
 3r33469.
 3r33469. 3r3168.
 3r33469.
 3r33469. You do not need to wait until you reach the very end to check that this path is feasible. At that moment, when you reach the condition t < y, вы уже знаете, что этот путь неудовлетворителен, и вы никогда не пойдёте в этом направлении. Поэтому отсекание неверных ветвей в начале программы сокращает количество эмпирического труда. Разумное исследование пути предотвращает возможность неудачи программы в последующем. Множество практических инструментов, которые сегодня используются, в первую очередь начинают со случайного тестирования, чтобы получить начальный набор путей, после чего они начнут исследовать пути по соседству. Они обрабатывают множество вариантов возможного выполнения программы по каждой из ветвей, задаваясь вопросом, что происходит на этих путях.
 3r33469.
 3r33469. Especially useful if we have a good test suite. You run your test and discover that this piece of code is not being executed. Therefore, you can take the path that was the closest to the implementation of the code, and ask if it is possible to change this path so that it goes in the right direction?
 3r33469.
 3r33469. 3r3181.
 3r33469.
 3r33469. But the moment you try to do all the paths at the same time, constraints begin to become difficult to solve. Therefore, what you can do is perform one function at a time, and you can explore all the ways in a function together. If you try to make big blocks, then generally you will be able to explore all possible paths.
 3r33469.
 3r33469. The most important thing is that for each branch you check your limitations and set whether this branch can really go in both directions. If she cannot follow both ways, you save time and effort by not following the direction she cannot go. In addition, I do not remember the specific strategies that they use to find ways that are more likely to produce very good results. But cutting off the wrong branches at the initial stage is very important.
 3r33469.
 3r33469. Until now, we have mainly talked only about the “toy code”, about integer variables, about branches, about very simple things. But what happens when you have a more complex program? In particular, what happens when you have a program that includes a bunch?
 3r33469.
 3r33469. 3r3198.
 3r33469.
 3r33469. Historically, a bunch of hip has been the curse of all software analysis, because the clean and elegant things of the times of Fortran explode completely when you try to run them using C programs in which you allocate memory left and right. There you have overlaps and all the confusion that is associated with the work of the program with the allocated memory and with the arithmetic of pointers. This is one of the areas where symbolic execution has an outstanding ability to talk about programs.
 3r33469.
 3r33469. So how do we do this? Let's forget about the branches and the flow of control for a minute. We have a simple program here. It allocates some memory, resets it, and receives a new pointer y from the pointer x. Then it writes something to y and checks if the value stored in pointer y is equal to the value stored in pointer x?
 3r33469.
 3r33469. Based on the basic knowledge of the C language, you can see that this check is not performed because x is zeroed and y = 2? so x indicates a different location. So far we are doing well.
 3r33469. The way we model the heap, and the way the heap is modeled on most systems, uses the heap representation in C, where it is simply a giant address base, a giant array in which you can arrange your data.
 3r33469. This means that we can represent our program as a very large global dataset, which will be called MEM. This is an array that, in essence, is going to map addresses to values. The address is just a 64-bit value. And what will happen after you read any of this address? It depends on how you model the memory.
 3r33469.
 3r33469. If you model it at the level of bytes, it turns out to be a byte. If you model it at the word level, then the word is obtained. Depending on the type of errors you are interested in, and depending on whether you are concerned about memory allocation or not, you will model it a little differently, but usually memory is just an array from address to value.
 3r33469.
 3r33469.  3r33469.
 3r33469. So the address is just an integer. In a sense, it doesn't matter what C thinks about the address, it's just a whole 64-bit or 32-bit number, depending on your machine. This is just the value that is indexed in this memory. And what you can put in memory, then you can read from this memory.
 3r33469.
 3r33469. Therefore, things like pointer arithmetic simply become integer arithmetic. In practice, there are some difficulties, because in C, pointer arithmetic is aware of pointer types, and they will increase in proportion to size. So the result we get the following line:
 3r33469.
 3r33469. y = x + 10;  sizeof (int)
 3r33469.
 3r33469.  3r33469.
 3r33469. But what really matters is what happens when you write to memory and read from memory. Based on the pointer that you need to write 25 to y, I take an array of memory and index it with y. And I write 25 to this memory location.
 3r33469.
 3r33469. Then I turn to the statement MEM[y]= = MEM[x], read the value from the location y in the memory, read the value from the location x in the memory and compare them with each other. So I check if they match.
 3r33469.
 3r33469. This is a very simple assumption that allows you to move from a program that uses a heap to a program that uses this gigantic global array representing memory. This means that now, when reasoning about programs that manage the heap, you really do not need to reason about the programs that manage the heap. You'll be fine with the opportunity to talk about arrays, not heaps.
 3r33469.
 3r33469. Here is another simple question. What about the malloc function? You can simply take and use the malloc implementation in C, keep track of all the selected pages, track everything that has been released, just have a free list, and that is enough. It turns out that for many purposes and for many types of errors you do not need malloc to be complex.
 3r33469.
 3r33469. In fact, you can go from malloc, which looks like this: x = malloc (sizeof (int) * 100), to malloc of this form:
 3r33469.
 3r33469. POS = 1
 3r33469. Int malloc (int n) {
 3r33469. rv = POS
 3r33469. POS + = n;
 3r33469.}
 3r33469.
 3r33469. Which simply says: “I am going to save the counter for the next free space in the memory, and whenever someone asks for an address, I give him this location and increase this position, and then return rv”. In this case, what is malloc in the traditional sense is completely ignored.
 3r33469.
 3r33469.  3r33469.
 3r33469. In this case, there is no release of memory. The function simply continues to move through the memory further, and further, and further, and that is all and ends without any release. She is also not very worried about the fact that there are areas of memory in which you should not write, because there are special addresses that have special meaning reserved for the operating system.
 3r33469.
 3r33469. It does not model anything that makes writing malloc difficult, but only at a certain level of abstraction, when you try to reason about some complex code that manipulates the pointer.
 3r33469.
 3r33469. In this case, you do not care about the release of memory, but it does care whether the program is going, for example, to write outside of some buffer, in which case the malloc function can be quite good.
 3r33469.
 3r33469.  3r33469.
 3r33469. And this actually happens very, very often when you do symbolic execution of a real code. A very important step is the modeling of the functions of your library. How you model library functions will have a huge impact, on the one hand, on the performance and scalability of the analysis, but on the other hand, it will also affect accuracy.
 3r33469.
 3r33469. So if you have a malloc model like this, it will act very quickly, but there will be certain types of errors that you cannot notice. For example, in this model, I completely ignore the distribution, so I may get an error if someone gets access to unallocated space. Therefore, in real life, I will never use this mickey mouse malloc model.
 3r33469.
 3r33469. So it is always a balance between analysis accuracy and efficiency. And the more complex the model of standard functions, such as malloc get,the less scale their analysis. But for some classes of errors you will need these simple models. Therefore, various C libraries are of great importance, which are needed in order to understand what a program actually does.
 3r33469.
 3r33469. Therefore, we have reduced the problem of reasoning about the heap, reasoning about a program with arrays, but I actually did not tell you how to reason about a program with arrays. It turns out that most SMT solvers support array theory.
 3r33469.
 3r33469. 3r33333.
 3r33469.
 3r33469. The idea is that if a is a kind of array, then there are some notation that allows you to take this array and create a new array, where location i is updated to the value e. It's clear?
 3r33469.
 3r33469. Therefore, if I have an array a, and I do this update operation, and then I try to read the value of k, then this means that the value of k will be equal to the value of k in the array a, if k is different from i, and it will be equal to e, if k is i.
 3r33469.
 3r33469. Updating an array means taking an old array and updating it with a new array. Well, if you have a formula that includes the theory of arrays, so I started with a zero array, which is represented everywhere simply by zeros.
 3r33469.
 3r33469. 3r33336.
 3r33469.
 3r33469. Then I write down 5 at location i and 7 at location j, after which I read from k and check whether it is 5 or not. Then it can be extended using the definition for something that says, for example: “if k is equal to i and k is equal to y, while k is different from j, then yes it will be ? otherwise it will not be equal 5".
 3r33469.
 3r33469. And in practice, SMT solvers do not simply extend this to a variety of Boolean formulas, they use this strategy of moving back and forth between the SAT – solver and the engine, which is able to reason about array theory for the implementation of this work.
 3r33469.
 3r33469. It is important that relying on this array theory, using the same strategy we used to generate formulas for integers, you can actually generate formulas that include array logic, array updates, array axes, and array iteration. And as long as you fix your way, these formulas are very easy to generate.
 3r33469. If you do not correct your paths, but want to create a formula that corresponds to the passage of the program along all paths, then this is also relatively easy. The only thing you have to deal with is the special kind of cycles.
 3r33469.
 3r33469. Dictionaries and maps are also very easy to model using undefined functions. Actually, the array theory itself is just a special case of an undefined function. With such functions, more complex things can be done. In today's SMT solver, there is built-in support for reasoning about sets and set operations, which can be very useful if you reason about a program that includes the calculation of sets.
 3r33469.
 3r33469. When designing one of these tools, the modeling stage is very important. And it's not just how you model complex software functions down to your theories, for example, things like dropping heaps to arrays. The point is also what theories and solvers you use. There are a large number of theories and solvers with different relations, for which it is necessary to choose a reasonable compromise between efficiency and cost.
 3r33469.
 3r33469. 3r33333.
 3r33469.
 3r33469. Most practical tools adhere to the theory of bit vectors and, if necessary, can use array theory to simulate a heap. In general, practical tools try to avoid more complex theories, such as set theory. This is because they are usually less scalable in some cases, if you are not dealing with a program that really requires these kinds of tools to work.
 3r33469.
 3r33469. Audience: besides the study of symbolic execution, what are the developers focusing their attention on?
 3r33469.
 3r33469. Professor: Another very active area of ​​research is applications and the search for software models that will identify new types of errors. So, for example, Nikolai, Franz and Si Wan at last year’s lectures considered using symbolic execution for programs that could not be optimized with the help of the compiler. This concerned security checks that can be optimized without a compiler.
 3r33469.
 3r33469. Thus, it is very different from the question of whether the program will go this way or not. But there is a modeling step that allows you to move from a high-level conceptual question, is there any code in your program that can be compiled, to an algorithm based on symbolic execution that can easily determine whether a program can follow a particular path and whether such a specific path is possible.
 3r33469.
 3r33469. It turns out that applications are a large area, expanding to new classes of errors, to new features of different programming languages. For example, one of the things that is still quite difficult to model using symbolic execution is a very high level languages, such as jаvascript or Python, where you have many very dynamic language functions. But at the same time, almost any program can be adapted for symbolic execution, and this is definitely very good.
 3r33469.
 3r33469.  3r33469.
 3r33469. A couple of years ago, we had jobs that use symbolic execution for reasoning about errors in Python programming tasks. Thus, in the case of symbolic execution, part of the problem is that your character state does not allow you to simply say: “OK, I will execute this instruction, and then this instruction, and then this instruction”. There is no sequence.
 3r33469.
 3r33469. For example, several years ago we looked at very small pieces of code, but very responsible, containing a matching data structure within the operating system, or an unlocked data structure, or simulating the interaction between threads.
 3r33469. Basically, every time there is a variable that can be overwritten with some other value, you only replace its value with a new character value and create constraints that relate to these character values ​​and character values ​​in other threads.
 3r33469.
 3r33469. This was even used to talk about things like missing memory boundaries, while the complexity grows quite insignificantly. Relatively speaking, what you can no longer do on the Microsoft Word scale, you can do on the scale, for example, of a parallel data structure.
 3r33469.
 3r33469. There were other works in the context of parallelism, for example, is it possible to use symbolic execution to reconstruct misinterpretation based on knowledge of how the program behaved with it.
 3r33469.
 3r33469. This opens up many opportunities, as it allows you to ask specific questions about the path of the program. Symbolic execution allows you to ask what values ​​these things should have for the program to do something, or for something to happen. This is a very powerful feature, and there are many applications that have been tested on its basis. But this is quite a new technology, as is the technology for analyzing program execution.
 3r33469.
 3r33469. 3r33434. 3r33420. 3r33421. 3r33434. 3r33434. 3rr3465. 3rr3465. 3rr3465.
 3r33469. The full version of the course is available here .
 3r33469.
 3r33469. Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending a friend, 30% discount for users of Habr for a unique analogue of entry-level servers, which we invented for you: The whole truth about VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR???GB SSD 1Gbps from $ 20 or how to share the server? (Available options with RAID1 and RAID1? up to 24 cores and up to 40GB DDR4).
 3r33469.
 3r33469. VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR???GB SSD 1Gbps until December for free 3r3456. When paying for a period of six months, you can order 3r3343445. here 3r3458. .
 3r33469.
 3r33469. [b] Dell R730xd 2 times cheaper?
Only we have 2 x Intel Dodeca-Core Xeon E5-2650v???GB DDR4 6x480GB SSD 1Gbps 100 TV from $ 249 in the Netherlands and the USA! Read about How to build the infrastructure of the building. class c using servers Dell R730xd E5-2650 v4 worth 9000 euros for a penny? 3rr3465. 3r33469. 3r33469. 3r33462. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33434. 3r33469. 3rr3465. 3r33469. 3r33469. 3r33469. 3r33469.
+ 0 -

Add comment