I suppose I've been living under a rock for the past 15 years but I had never heard of a "Rich Header" until fairly recently. The Rich Header (RH) is a structure of data found in PE files generated by Microsoft linkers since around 1998. My interest was spurred a few months back when I saw the following Tweet which linked to an excellent article about this undocumented artifact.
The Undocumented Microsoft "Rich" Headerhttps://t.co/JzHZ8oeXlG— DirectoryRanger (@DirectoryRanger) April 16, 2018
I highly recommend reading it but the TL;DR is that at some point Microsoft introduced a function into their linker which embeds a "signature" in the DOS Stub, right after the DOS executable, but before the NT Header. You've probably seen it a thousand times when looking at files and never realized it existed. Back in the early 2000's, when the existence of the header was known for a while, everyone originally assumed it included unique data to identify systems or people, such as with a GUID, and it spawned numerous conspiracy theories - they even nick named it "the Devil's Mark". Eventually someone got around to actually reverse engineering (RE) the linker and figured out how the structure of information was being generated and what it actually reflected. Turns out the tin-foils were half-right. The blog post linked above in the Tweet shows what is actually in them and, while not truly unique to a system or person, it can still serve for some identifying purposes to a certain degree. My interest was peaked!
Effectively, this structured data contains information about the tools used during the compilation of the program. So, for example, one entry in the array may indicate Visual Studio 2008 SP1 build 301729 was used to create 20 C objects. For each entry, a tool, an object type, and a "count" are represented. None of this is really super important at the moment but just know that on some level, there is a lightweight profile of the build environment that is encoded and stuffed into the PE file.
When I began reading more about this topic, I came across an article from Kaspersky stating they used Rich Header information to identify two separate pieces of malware with matching headers. This is exactly what I was hoping for so I read on. Back when the Olympic Destroyer campaign occurred, every vendor under the sun rushed to attribution - from Chinese APT to Russian APT and finally North Korean APTs. After the article throws some shade at Cisco Talos, they state the Rich Header found in one sample of the Olympic Destroyer wiper malware matched exactly with a previous sample used by Lazarus group, along with the fact that there was zero overlap across other known good or bad files. Sounded promising! Then they dive into the actual contents of the Rich Header though, an analyst at Kaspersky noted a major discrepancy. Specifically, a reference to a DLL within the wiper malware that didn't yet exist if the Rich Header tools were to be believed. The (older) Rich Header in the Lazarus sample showed it was compiled with VB6, but this DLL was introduced in later versions of VB. They concluded the Rich Header was planted as part of a false-flag operation.
In this case the Rich Header was used to disprove actor attribution - the reverse of what I expected but fascinating none the less. It's an interesting read but the main point I want to make is that an adversary had the foresight to use this lightweight profile as part of multiple false-flag plants to throw off researchers and put them on the path of inaccurate attribution. That, to me, implied that on some level it's a viable attribution mechanism if an adversary is going out of their way to plant false ones. There is some merit to the technique and, as such, what prompted me to write this blog. What I plan to do then is take you along as I look at Rich Headers and try to determine there place in the analysis phase of hunting. To do this, I'll try to answer the following questions:
Alright, without further ado then...
It makes sense to start out by giving a quick overview of how the Rich Header is created. I won't be doing a deep-technical dive here because there are tons of great resources on it already, which I highly recommend reading.
So what does the Rich Header look like in the file?
Like I said above, you've probably seen it a thousand times and never realized it. Before diving in, here are a handful of key characteristics you should be aware of.
To parse this data out, I created a Python script (yararich.py) that will print out the structure while also generating YARA rules based off said structure. Below is an example of the parsed output.
In this example, which is from the Olympic Destroyer wiper malware looked at in Kasperskys blog, you'll see the last entry in the array ("id11=9782, uses=3") displays three values. The ID of 9782 has been mapped to "Visual Basic 6.0 SP6", the object type of 11 indicates it will be C++ OBJ file created by the compiler, and then the "uses" value of 3 is how many of those object files were created.
I haven't been able to find any kind of truly comprehensive listing for the Tool ID/value mappings, but across the websites linked in this post you'll find various tables for them that cover most cases I've seen. These have been reverse engineered from the compilers/linkers over the years and also enumerated through extensive testing by diligent individuals by compiling a program, changing things, re-compiling and noting the differences.
Alright, now that you have an idea of what information is available, I can talk about using YARA to hunt Rich Headers and some current limitations of this method. There are really four key parts to each Rich Header: the array itself and associated values, the XOR key, the integrity of the XOR key, and the full byte string which make up the entire Rich Header structure. These are effectively what you can utilize in YARA, minus verifying the integrity of the XOR key.
Throughout my analysis, I ran across quite a few YARA related gotcha's that I wanted to detail up front.
1) YARA's implementation of the Python PE module only parses out the Tool ID and the associated value (type of object). It does not take into account the "uses" count. This value, I feel, can be very important to the overall goal of using the Rich Header data as a profiling signature but there are cons to it as well. For example. there could be a major difference between a PE that used 4 source C files and 20 imports versus 200 source C files and 500 imports. Alternatively, if you used the count value you could miss related samples because the counts changed during the development of the malware, causing things like imports to change. I'll get into a concept of "complexity" for the Rich Headers later, but know that a higher complexity ensures a consistency across samples that means the coupling of only Tool ID/Object type is fine, whereas with a "low" complexity Rich Header would be easier to hunt on if counts were included.
2) In YARA there is not a concise way to establish the order of entries within the array when using the PE module as they are defined under the "conditions" section in YARA, thus not providing a mechanism to refer to the "location" (first match > second match). When you hunt with YARA, since the "uses" data is missing, you are limited to using the array entries in the PE module and will run into two problems. First, the order matters. The order in which the linker inserts these entries into the array, which is based on the input, should weigh into the idea of trying to match the same environment between samples and, if they are out of order, then even if the values were the same, it could be entirely different source code and layout.
Since Kasperky illustrated that the Rich Header is already on shaky ground in terms of how much you can trust them, making sure our rule is as precise as possible to match a specific environment takes a top priority.
Below is the output from a Hancitor malware sample.
The way this is represented in YARA follows.
Searching for this via VirusTotal will result in 527 hits (all of which are likely unrelated to Hancitor or the group behind it), but we can further refine this then by using the XOR encoded Tool ID/value entries, skipping every other 4-bytes (the "uses" value DWORD), and set the order via string match position in YARA.
This can be turned into the below YARA.
This is effectively the same data but will return 5 hits, which is more accurate, and they are all actually related to Hancitor group; however, one major caveat to this method, and one that really makes it unfeasible in the end, is that the encoded values are of course tied to the XOR key. Womp womp. You'll recall that the XOR key is derived, in part, by the array and the DOS stub, thus if the the DOS stub is different or the Rich Header array is different, then the XOR key will follow suit and these won't match at all. At that point, you're better off just searching for the XOR key or raw/encoded data. It would be nice if YARA was patched to allow for ordering of the decoded array entries.
3) Next, the example above with Hancitor falls apart because of over-matching. Specifically, the above examples contain 5 entries. The last entry is always the linker and the one above that is usually the compiler entry, so we have 3 other entries related to the build environment. That's not a lot to go on and you'll frequently run into situations where samples with high entry counts can trigger a match simply due to having the same entries included. In general, I'd say the larger the array, the more likely it will be accurate and easier to search on but I'll prod this angle further during analysis.
I haven't extensively tested this next piece but a lot of the over matching seemed to stem from DLL's. The following is the parsed array from one of the matches pulled from the 500+ samples returned via the first Hancitor YARA example.
You'll note that the entries are also out-of-order from the Hancitor sample but I digress. In addition to patching YARA to allow for specifying ordered array entries, another useful feature to address over-matching would be a way to specify bounds for the rule entries, something like Sample A matches if it only contains these Y entries.
4) The last item I'll touch on is searching by XOR key and the raw data. I clump these together because they, for the most part, represent the same thing even though the XOR key figures in the PE DOS header/stub, which can differ between samples, while the array remains the same.
Below is an example of using the XOR key and the raw data (hashed to avoid conflicting characters which may be in the raw data, such as quotes) in YARA.
Note that the hash must be lowercase in YARA. I haven't looked into the underlying code to validate but it appears this function is doing a string comparison and if you use uppercase in the hash it will fail to match. Either way, both of those will return the same 4 hashes, which isn't unexpected.
When doing any serious hunting you'll want to attack Rich Headers from multiple angles. A full YARA rule generated from my script, using the above Hancitor sample, is below.
I didn't have an easy way to directly access Rich Headers so I created a corpus of samples to test against. To start, I downloaded 21,996 samples which covered a wide range of malware families, actors, or campaigns - 962 different ones to be exact. These were identified by iterating over identifiers (eg "APT1", "Operation DustySky", "Hancitor") and, after removing duplicate files, I was left with a total of 17,932 unique samples. Out of that, 14,075 included Rich Header values, which was 78.4% of the total and seemed like a good baseline to begin testing against. This also highlights just how prevalent Rich Headers are in the wild.
I've parsed out the Rich Header from every one of these files and will be using that data for my analysis. As there is far too much to go through individually, I approached the problem with various analytical lenses and then explored interesting aspects that reared up during this exercise.
The first thing I was really interested in was whether the Rich Headers overlapped between those identifiers as it may be a strong indicator for profiling. A lot of this analysis will be referencing the XOR keys as I feel these are really the defining feature of the Rich Header when it comes to hunting, since it encompasses everything.
Below are the top 10 XOR keys by unique sample count that shared more than one identifier.
Looking at the first entry, 1,009 unique hashes across three identifiers. TurnedUp was a piece of malware used in the DropShot campaign by APT33 so starting off, we have a known relationship between the samples. Furthermore, they have a fairly sizable Rich Header with 10 entries so it's less likely to be "generic". In total, I have 1,014 APT33 samples and 999 of them shared this XOR key. Looking at one of the samples which doesn't share that key highlights a previous point I made regarding the "counts". Check out the below parsed Rich Header output for 0xB6CCD171 and the output of 0xD78F6BA1 right next to it; I've highlighted the differences which exist only in the "uses" field.
So what does this mean? The entry "id171=40219" corresponds to VS2010 SP1 build 40219 and indicates 62 C++ files were linked on the left, 61 on the right. Similarly for the next highlighted entry, this is a count of the imported symbols and you can see on the right it's grown by quite a bit. Finally, with entry "id147=30729", which corresponds to VS2008 SP1 build 30729, it indicates on the left that it linked 19 DLL's while on the right is 21. This helps establish a pattern of consistency between the samples in that the structure of the array remained the same but, with increasing counts, was likely modified over time.
Outside of the Rich Header, we can take a look at some other artifacts and see if the samples remain consistent in any other ways.
Based on the above information, it would appear the original author (as indicated by the PDB path retained in the debug information when it was compiled) later updated this particular malware source. The compile times show 2014 and 2015 respectively and the PDB path on the newer one reflects the 2015 information. Outside of that, the user path is the same and they share a number of strings for dropped file names, User-Agent, so on and so forth. Either way, they seem to confirm the samples are related which means that the Rich Header in this case is likely accurate and, given that it's not particularly generic, should be suitable for hunting.
Alright, so let's take a look at a key with groups that seems to be unrelated. For XOR Key 0x69EB1175 there are 26 different identifiers and they appear to be all over the place - cryptominers, droppers, trojans, ransomware, etc. I've taken the parsed Rich Header and annotated it with what I could find regarding the Tool Id values.
If we assume VS98 and "Utc12_2_C" are the compiler/linker, then there is really just one additional Tool (but two object types) in use for this Rich Header. That doesn't build a lot of confidence in being unique so for now we'll just define this as low complexity/generic. Given that, I pulled a couple of samples into an analysis box to see what similarities, if any, exist between them. Almost immediately it's apparent what the commonality here is - everybody's favorite friend...Nullsoft's NSIS Installer! This makes sense why it spans so many disparate malware families and campaigns. NSIS Installer will compile a brand new executable so the entire "development" environment is tied to this version of NSIS. This also means the Rich Header is effectively useless for defining anything but NSIS, which does have value in its own right but not the purpose of this exploration. Different versions of NSIS most likely have different Rich Headers that are embedded when it compiles the scripts into the executable, but I haven't tested this.
Picking another key to look at, 0xD180F4F9 has samples matching identifiers for 7ev3n, Paskod, PoisonIvy, and TrickBot malware families; however, it has even less entries than the previous example.
While not high in entry volume, all of these ToolID and types are not documented across any of the sources that I frequent to look them up. That alone is interesting to me - possibly newer or more obscure tools. Either way, assuming one is the compiler and one is the linker than we have but one additional "Tool" so it's extremely generic.
Taking a look at a PoisonIvy and TrickBot sample, they both match signatures for Microsoft Visual Basic v5.0-6.0 and each have one imported library "msvbvm60.dll". A general review of the samples doesn't show anything that looks related but these strings stand out as interesting.
The "VB6.OLB" file existed in two different paths at compilation time. If these strings are to believed, this would indicate a separate build environment. There does appear to be some similarity in the Visual Basic project files being somewhat randomized. Not a strong link but it's something. The language packs used in the binaries, along with the username listed in the one VBP, would make me lean towards it still being unrelated.
I believe this is likely another instance where the Rich Header is primarily useful for product identification and not much else unfortunately.
Below is a table of the aforementioned XOR keys and a short blurb about each. I've highlighted the ones in RED which seemed to support plausible attribution
There are a couple of interesting observations I made while going through these. First, it seems clear that the Rich Header can be useful for identifying software that generates new executables, eg NSIS and AutoIT. Overall I'd say anything with a large disparate set of identifiers is going to be created through some automated tool or have a low complexity array which is likely to just be picked up by unrelated samples. All of the "good" hits included more complex Rich Headers, when not built by automated tools, and that seems like a decent value-gauge. I felt around 8-10 was a sweet spot for "complexity".
Second, traditional packers like UPX and Themida do not modify the actual DOS Header/Stub so the Rich Header, in the case of 0xFA28276F, remains tied to the packed executable. As such, it was able to identify the underlying malware through the packer even though multiple layers of obfuscation were employed. This is a solid match in my opinion and also a pretty cool potential window into embedded malware.
Finally, there is some value when you look at dynamic analysis. As the Rich Header is a static feature, it allows another opportunity to identify samples that may otherwise fail to detonate in a sandbox. I was able to identify 15 unknown Brambul malware samples associated to Lazarus group by Rich Header alone as the samples had failed to actually detonate/do anything of note within a sandboxed environment. That's pretty handy.
The next phase of analysis I'll look at is from the angle of just raw XOR key volume. The below are the Top 20 XOR key counts and the identifiers for each key. For the keys previously analyzed I've just marked with a double-dash and will skip talking about here.
Looking at the list, the obvious repetition of GandCrab and Qakbot really stands out so I'll dive into each of these.
Out of the 1,075 GandCrab listed in the Top 20, every single one of them triggered a signature mismatch in my script. This means that the extracted XOR key was found to be different than what the script generated when it rebuilt the Rich Header XOR key. This is a clear sign of tampering and manipulation.
Looking at the set with the extracted XOR key of 0x00A3CC80 shows that every single hash has a different generated key with no overlap across the set. Upon further inspection, this is not caused by the actual Rich Header information, which is consistent between each sample, but due to the DOS Stub, which is factored into the algorithm that generates the Rich Header XOR key. Below shows the DOS Stub from two samples.
The DOS Stub is an actual DOS executable that the linker embeds in your program which simply prints the familiar "This program cannot be run in DOS mode" message before exiting. It appears that the bytes which account for the beginning of the message "This progra" are being overwritten after the Rich Header is inserted.
Four ASCII numerals followed by a null byte followed by six bytes. For this particular XOR key, the initial ASCII of the four bytes are listed below.
The six bytes after the null didn't stand out as having any observable pattern. I read through a handful of GandCrab blogs from the usual suspects and didn't see any mention of this particular artifact. Additionally, I checked the other XOR keys in this list to see if they had similar behaviors, which they all did. Some of the keys didn't modify the initial four bytes but all included the null followed by the six bytes.
As this artifact seems to be unknown, I was curious if it's possibly campaign related or maybe some sort of decryption key. I ended up loading one of these samples into OllyDbg with a hardware breakpoint on access to any of the bytes in question. At some point the sample ends up overwriting the entire DOS Header and DOS Stub with a different value which triggered the breakpoint. The sample proceeds to unpack an embedded payload which contained its own Rich Header entry.
Searching my sample set for the new Rich Header (0xEC5AB2D9) led me to five other GandCrab samples. A quick pivot on the temporal aspect of these 5 samples, along with the XOR key I've been looking at, shows the 233 0x00A3CC80 samples starting to appear in the wild on 19APR2018 with a huge ramp up between 23APR2018-25APR2018, while the 5 samples which shared the newly extracted XOR key first appeared on 22APR2018, one day before the campaign kicked into high gear. I did a quick check on the rest of the GandCrab in this list and they are all from that time frame, but samples as recent as yesterday (months after April), show the same pattern, with the most recent sample displaying "1018" for the first four bytes. Interesting stuff to be sure and anytime I see an observable pattern I tend to think it's significant, but that would be a tangent for a separate blog. Either way, this was a good example of signature mismatch and why it's important to check Rich Headers on embedded/dropped payloads. These particular modifications add credibility to the idea that these samples were produced by the same environment and so, in this case, the Rich Headers led to artifacts that can be coupled together for profiling.
Shifting over to the two Qakbot entries, both XOR keys were from samples delivered during the same couple of days. One characteristic both XOR keys showed was that they, within their respective key group, were all the same file size except a small cropping of outliers in the XOR key 0xD5FE80D5. Specifically, 61 of the samples with that specific key are 548,864 bytes but there is 10 samples at 532,480 bytes and one sample with 520,192 byte. It's always interesting when you see outliers like this when looking at a large sample set. I continued by analyzing a sample from each file size and they all share the same dynamic behaviors, along with a few static ones, but otherwise they appear quite different. For example, they each import 3 libraries and 7 symbols but none of them are the same - this also just looks wrong and out of place based on my experience.
Upon closer inspection, Qakbot is using some sort of packer so, after the success of embedded payloads from the GandCrab work, I manually unpacked one sample from each of the three file sizes to see if anything stood out. We know the files are the same in terms of family already due to my collection method, but the unpacked payloads all matched 225 functions and over a thousand basic blocks, so that adds a lot of weight to the idea of the malware being related. When I looked at the unpacked samples, they each had their own respective Rich Headers so I've dug in a bit more here. Keep in the mind that the question I'm interested in is whether we can use these to determine if they were created by the same actor or in the same environment.
Below are the parsed Rich Headers for the 3 samples and I've sorted them from left to right based on their listed compile times, which incidentally aligns perfectly from smallest to largest of the original files. That is to say the 520,192 byte files embedded payload was compiled first and the 548,864 byte files embedded payload was compiled last.
In addition, I've listed out what each entry correlates to; however, the 0xDCE3EAC8 embedded payload includes one additional CompID (171).
Alright, first off if you look at the tools listed in the Rich Header above, you can see the payloads were built with components from VS2K8/10 SP1 with a specific build number that remains consistent across the 120 day period. Furthermore, you can see the "count" values increasing for some of the entries too. Now, this isn't rocket science but the idea that as a developer continues working on their program, the code base will usually increase as new functionality is introduced. This could directly translate into more imports/objects and thus increases in these counts when compiled. The specific increases can be viewed here.
Another thing that stands out is that the earliest file included an entry for C++ object files created by the compiler that were later not present, along with the order of CompID 131 and 147 being flipped. All of this evidence seems to indicate a reordering of code that again may be related to development. Finally, the most interesting artifact from earliest embedded payload is that it was compiled with the `/DEBUG` flag on and contains a path to the PDB.
Increasing counts for objects, a temporal and file size alignment, and a solid consistency of entries across the set shows, to me, a clear pattern consistency across the embedded payloads that establish they were all created in the same environment by the same actor.
When talking about OPSEC, the more the bad guys have to worry about the more likely it is they'll screw up. Using the Rich Header from embedded payloads seems to be a very successful technique for painting a fairly convincing picture that makes contextual sense for attribution.
Wrapping things up then, I think the most important takeaway is essentially what Kaspersky showed - Rich Headers are NOT to be trusted. They are simply static bytes in a file and can be manipulated like anything else, along with the fact that adversaries are actively already doing this. Rich Headers should be treated as a supporting artifact that can provide a lot of contextual value to your analysis and can help in attribution. In addition to that, the most "valuable" Rich Headers seem to be the ones with a high complexity and large entry count. These offer a more granular picture of the environment within which the program was constructed.
What I hope I've shown through this analysis though is that Rich Headers shouldn't be written off entirely. They are like more nuanced then I originally thought they would be and they can provide context by establishing patterns in environmental details. This context is enhanced when you start peeling away layers of packed/embedded payloads. Finally, the meta data contained within Rich Headers can provide extremely useful context to other expected artifacts that should be found in a binary and vice versa.
Hopefully this proves to be helpful to other analysts out there getting started with Rich Headers and provide some insight into the pro's and con's of what Rich Headers can offer.
*EDIT - 12AUG2018*
Received a few DM's and e-mails containing additional links and asking for more info so I decided to just add a short resources section for others interested in learning more.
Mappings for ToolId/Obj Types
Blogs about Rich Headers from a RE perspective
Blogs about Rich Headers from a Threat perspective
White paper/Con preso on the subject
In this blog post I want to share a Python script that I wrote a few years back which generates YARA rules by using a byte->opcode abstraction method to match similar parts of code across files. At a high level, it's pretty straight-forward. It takes a cluster of files (ideally PE files) and attempts to find the longest common sequence (LCS) of x86/x64 opcodes with the idea that once you abstract back to just opcodes, you can find similar functions in other files and possibly unknown malware. There are a few other matching techniques employed that I'll dive into later but that is the gist. Hopefully the script can compliment hunters and analysts in searching for similar malware samples through automation. There are a lot of other similar tools already out there doing this, but it never hurts to have another option in the proverbial toolbox.
This script was written in Python and heavily utilizes Capstone and YARA Python modules. What initially started as a small script for CTF challenges, and an excuse to play with Capstone <3, has matured over time into more of an operational tool to leverage VirusTotal's Retrohunt capability. I've had a few wins with it but I can't say it's ever been terribly useful in my current capacity and thus it was more of an exercise in programming logic and interacting with files at a low-level.
The reality of this approach is that due to different versions of compilers, different compile time flags, architectures, variable values, etc, all can cause similar code in logic and structures that, at a byte level, is quite different. This is the gap that BinSequencer is trying to bridge.
The core matching technique in BinSequencer attempts to first determine where executable code may reside within a PE by looking at whether the "code" or "executable" bit is set within each section of the Windows PE file. Once it's identified the potential locations for code, it will disassemble the bytes into assembly instructions and strip out the opcode mnemonics. These opcode mnemonics will be strung together as a sequence in which the script attempts to find the longest match that exists across the sample set and then reverses the process of disassembling it back into bytes for YARA rules.
Below is an image I cobbled together to try and illustrate this particular process.
One of the pro's of this method is that it makes logic matching easier. The mnemonic strings are essentially byte-agnostic, meaning that going from byte to opcode is easy. Looking at a sequence of just mnemonic opcodes is also much faster than byte-by-byte comparison so it helps when it comes to scaling this across large sample sets as well. Unfortunately, we can't have a pro without an equal con. Converting from mnemonic opcode back into a byte can be extremely problematic and this additional processing can slow things down significantly. You end up having to test various byte-variations and variable lengths of the operands used by the instructions to properly generate an accurate YARA rule, not to mention the problems with following wrong branches when building up the YARA rules, but I'll touch on this more later.
Before getting into the actual script, it behooves me to attempt and give a mini-x86 assembly overview in order to help illustrate some of the aforementioned pros and cons. The x86 instruction layout is below.
When we go from low (bytes) to high (opcode) we will convert a byte like 0x75 to the opcode "JNZ" (jump if not zero). The bytes 0xF85 also use this same "JNZ" opcode so, regardless of the underlying bytes (0x75 vs 0xF85), we end up with the same opcode "JNZ" and the logic of the code remains intact. This is the major benefit of this abstraction technique.
Most instructions are fairly straight forward, "PUSHAD" will always be 0x60 and "CALL" will always be 0xE8 or 0xFF; however, the variations in the underlying bytes dictate things like operand length and change the overall size of the bytes in use.
One of the above "CALL" instructions is 5 bytes in total while the other is 6 bytes. This is one area of variation in which it begins to complicate things when we end up disassembling "CALL" back into bytes for matching in a YARA rule. There are also built-in optimizations that x86 uses, such as 0x5, which is "ADD EAX, ???" where a value is added to the EAX register. These optimizations create additional variations in potential bytes because the initial bytes change a lot. The "XOR" opcode is a good example of one with baked in optimizations creating a lot of variation which must be accounted for within our YARA rule.
This is 10 different byte-representations of the same opcode, so going from byte to opcode is easy, but reversing it can start to get tricky. As if that wasn't bad enough, it gets complicated further by things like the mod-r/m byte which changes the actual opcode meaning. The general 8-bit breakdown for the mod-r/m byte follows:
Using the 0x80 byte from the previous "XOR" example, take a look at the following two instructions.
The mod-r/m byte for the first instruction is 0x35 and 0x2D for the second. The 3 bits that dictate the opcode are 5,4, and 3. These correspond to a table of potential opcodess for that particular byte, so we can't just assume 0x80 is "XOR" when we do our conversion. For the 0x35 mod-r/m byte, the bits in question are 110, which equals 0x6 - the bits for 0x2D are 101, which equals 0x5. The table for that specific 0x80 byte is below.
I've probably re-written this script 3 times over the past 2-3 years tackling problems as they've developed from these types of variations that would pop-up. The "MOV" opcode has over 20 different variations by itself so sometimes things got quite messy!
Alright, so back to the script. The Python "pefile" module is used for the extraction of code from the files, assuming it is a PE, and then the Capstone module is used for the actual disassembly engine. For each file, it will convert the extracted area of data and pull each instruction out, followed by each opcode mnemonic, which get strung into a "blob" which is used as the base for matching.
There are various ways to tune the functionality of this script from the command line which can be seen below.
The `-c` flag lets you specify the percentage of samples the sequence needs to exist in and the `-l` flag sets the minimum length of linear opcodes required. I've found 25 to be around the sweet spot for bare minimum accuracy as it usually covers a couple of functional blocks but, in general, the higher the better. When you go too low, it leads to non-unique common sequences of instructions, such as with prologue and epilogues or basic code re-use. The "blobs" will look like "jnz|jmp|add|mov|xor|call" with hundreds to a few hundred thousand opcodes.
Since we're doing comparisons, the script tries to identify the best initial file to use for analysis as the "gold" file. If it's doing a 100% match, it will search for the file with the lowest volume of instructions since it has to exist in every other file; it does the reverse if it dips below 100%. As an aside, it's also worth noting a limitation of YARA at this point. YARA has a hard-coded limit of 10,000 hex bytes that can used for a rule so I artificially limit the amount of opcodes that can be used to 4,000 as a safeguard.
To do the actual comparison, it uses a simple sliding window technique between the low match limit (25) and 4,000 opcodes, starting at the highest size and subtracting one opcode after each iteration until it empties and moves the window up by one offset. From there, it utilizes a number of tricks to speed things up, like black listing known bad sequences, so that it can zero in on the optimal matching length and reduce the number of iterations required.
Once it has a match of sequenced opcodes, it moves into the actual YARA generation which is where things get fun. One nice feature of YARA is the ability to do a boolean OR in the hex match. A "JMP" opcode, which can be the 0xE8 or 0xFF byte, would be "(E8|FF)". You can also do wildcards in YARA so "PUSH", which is 0x50 through 0x57 and 0x6A, can be represented as "(5?|6A)". Finally, you can account for overall instruction length with a byte jump in YARA like "[4-5]", which skips between 4 to 5 bytes before the next match.
As I intended to use this on primarily on VirusTotal, I began running into a few undocumented limitations VirusTotal imposed on the rules that can be used for retrohunting. If you have too many jumps, boolean OR's, and even in some scenarios just the length of the jump could cause your rule to fail. I suspect these limitations are primarily for efficiency/performance management but it's something that I had to account for throughout the script - your mileage may vary and I have not kept up with any changes.
It's probably easier to just show some examples of the expected output to highlight what the script does. I've truncated parts of it and will interject commentary between the sections to explain each one. For this first example, I'll take a look at my favorite lame Malware - Hancitor!
The script takes a path to where the files reside and will iterate over each one to extract instructions. If you specify the `-n` flag, it will treat the files as non-PE, this was done for analyzing files of raw shellcode but the opcode technique holds up over other files as well (PCAP/JAR/whatever), but I wouldn't recommend it and you may run into bugs.
Since I ran the script with it's default settings, it attempts a 100% match across all samples and thus finds the sample with the lowest instruction count since any match *must* exist in this file.
The script should iterate over each section in the golden hash that it identified as potentially having code and check these against the other samples. In this case, it does 8 iterations of the sliding window for sizes above the minimum of 25 and finds no matches in ".text". Then it moves on to the ".edata" section, which has 477 instructions, and finds that this match exists in all of the samples.
Again, designing this with the intent of being used during analysis, it prompts the user multiple times to display the results in various ways. The above lets you see the opcode sequence and gives a quick impression of whether or not the code construct seems to make logical sense. For example, you can see near the beginning 3 "PUSH" and then a "CALL", which is common behavior for pushing values to the stack before calling a function. This tells me what we're looking at is most likely code as opposed to data that just happened to be intermingled within a code section and disassembled.
As was stated before, it doesn't have to be code to work but the abstraction method will be less useful if it's not.
Similar to the previous display but with more details so you can see exactly what the assembly looks like.
If you did not want to keep the match, possibly due to the code being part of a known library, not being unique, the match being just data and not an actual code construct you're after, then it's possible to decline it and restart the matching process.
Once it has the match, it will show the offset within each file so that you can look at it further if needed. In this instance, you can see all of the matches happened in the ".edata" section at offset 0x10003000 through 0x100033FE. A quick look in IDA looks promising.
You can see the matched sequence extends across a number of function blocks and the code looks interesting with multiple instances of Win32 API calls commonly found in malware.
Before it actually does the YARA generation, the script can perform up to 3 various matching techniques in order to improve performance and accuracy. Each one builds upon the last and I'm going to spend a second detailing the three methods as they aren't all on display here.
You can skip the first two checks with the `-o` flag (although the third check will still utilize their techniques during morphing to some degree).
Once the generation is complete and it validates the YARA rule, it will dump the rule to your console.
You can manually validate the rule matches with YARA as expected before retrohunting.
Below is another examples where I utilize some of the tuning features of the script. Specifically, the match must exist in at least 75% of the samples, it will include strings that exist across all 75% of them, it will skip the first two matching techniques, and it will set Capstone to x64 architecture.
That about sums it up. Hopefully someone finds it useful but if not, c'est la vie.
The GIT for this script is up. Enjoy.
Getting phished sucks. Getting phished really sucks when you've spent significant amounts of time analyzing phishing attacks only to end up falling prey to one anyway. It happens, this is my story...
If you're not familiar with who Shane Missler is, he's a 20 year old who recently won a Mega Millions jackpot. A HUGE jackpot, one of the largest they've ever had. Also being in Florida with Shane, he dominated our local news coverage for a short period of time. One thing that kept reoccurring was that he kept saying he wanted to help people and do good with his money. Cool, that's a nice thing to do and I wish him the best of luck and then I moved on to real news.
A few days later though, I saw a Tweet pop-up on my feed from an account claiming to be Shane, created in April of 2016, stating that he wanted to give back to everyone and offered USD $5,000 to the first 50,000 people who retweeted his message. Remembering his repeated messaging of wanting to do good I said "sure, what's the harm in retweeting?" and did so. I figured, if it's a fake account than I'll just unfollow and that'll be the end of it.
Fast forward another couple of days and I see a new Tweet pop-up on my feed, again from Shane. This time he says he's hired a company to put together a website to process the payments for the 50,000 people who met the requirement. I thought, "no way..." and went to the Twitter account. It looked like I remember it looking like when I saw it in the media and I started browsing his tweets.
They were thoughtful and offering positive messages with seemingly a lot of engagements. Huh, "I'll be damned!" I thought, this guy is actually doing it.
Now, in hindsight, besides all of the obvious red flags I even acknowledged and willfully ignored as this phish built-up, the basic math of it all should have been a no-brainer. At USD $5,000, across 50,000 people, you have USD $250M dollars which, again in hindsight, was far more than I knew Shane had since he took the cash payout which was significantly less than what he won.
Naturally, I opted for PayPal. Now, the information asked for felt off, but as I do a lot of PayPal and they weren't asking for things that weren't already in public domain or easily Googleable, I let the little devil on my left shoulder shut the little angel on the right out - "this is totally going to be legit" as I filled in my name, address, and e-mail with thoughts dancing through my head about getting my entire family to sign-up ASAP!
At this point, it does some "checks" to "verify" your eligibility and, again, I thought "this dude is crazy but hey I'm all about that free money!".
The deeper I went into this rabbit hole, the more I self-convinced myself to ignore all of the glaringly obvious red flags.
Next up, it says it needs to verify you're a human on the totally legit site "areyouahuman[.]co". Sounds reasonable, we definitely don't want to give money away to a bot so let's see what we have to do...
Now, mind you, I was doing this all on my phone while also preoccupied with something else so when the "verify you're a human" page came up and said I needed to install 4 Apps on my phone and let them run for 30 seconds, it made me stop what I was doing and take pause. What the hell kind of verification is this? How does that even work? Is this malware? What are these apps? Are they trying to generate money to cover some of the costs of sending out USD $5,000 to every person? This last question was, if you haven't already figured it out, somewhat true. The apps were all legitimate, Google validated, and very popular games on the Google Play store. Regardless, I trudged on due to greed, played a game of Solitaire, and pondered why I was letting myself be fooled by such an obvious fraud.
I decided to skip ahead, dread beginning to rise, to the Amazon voucher. It was a Amazon survey for $1,000 which was the final straw, as there is no way it's tied to human verification. I decided to go back to Twitter and confirm my fears; almost every subtweet to the original was along the lines of "THIS IS FAKE!!!". Red faced, annoyed, had, I decided to figure out just exactly what I got myself into it.
First up, I confirmed it didn't matter what options I picked, what bullshit I filled in, I would "verify" and get sent over to the "areyouahuman[.]co" site. On the PC, the entries were of course different so they are doing some device/source detection and redirecting based on that. I don't think this site is necessarily related to the other, but you can clearly see the affiliate ID at the top.
Going through the source code on the page, it luckily appears to just be a scheme to generate revenue and using the affiliate ID for tracking. The person behind the ID would get cash for each successful app install, links clicked, and surveys taken thus making me a pawn in their game. The links all followed a similar pattern of using this site "jump[.]ogtrk[.]net" preceeded by the affiliate ID and whatever the AD is, as shown below:
The next logical question then was, "who the hell is the man behind the curtain?".
A quick WHOIS didn't provide any useful information. The domain was created fairly recently, which would make sense, but otherwise it had the usual GoDaddy abuse information; however, there was a Registrant Name which would be useful.
Sean Courtney isn't a lot to go on. Looking in PassiveTotal there are 106 recorded Registrants that share this name. A lot of them seem unrelated so name alone might not be strong enough to go off. Looking at the resolutions for the domain show two IP addresses, both registered to GoDaddy. This implies that on the date it was registered they changed the IP address, which is always an interesting piece of data to pivot on.
Cross-correlating the two IP addresses with every other domain that shares the Registrant "Sean Courtney" showed a number of domains that overlapped.
Now, these are shared GoDaddy IP addresses, so each IP has a significant amount of domains attached to them (500 and 1K, respectively) but it's building a stronger relationship. Additionally, the other domains also have another overlap which makes things more interesting. Specifically, they share an e-mail address used during registration.
Obviously, the one that immediately stands out in the domains is "seancourtney[.]org". Looking at this sites WHOIS information reveals a bit more. Relevant bits follow:
Before diving in further on that, I want to take a second to talk about the e-mail address. If you Google it, you receive back a handful of other domains they've registered in the past all related to video game cheating. Similarly, using PassiveTotal to look at the historical domain registrations from this e-mail, a picture begins to emerge around content and theme.
Based on these and the few Google hits, it seems this individual tries to profit off "hacks" for very popular games, such as PokémonGo, Clash of Clans, and OverWatch. I also suspect that not one of these serve actual hacks or cheats but simply are used as a lure for desperate people. Again, if you play off peoples desire to win (or get free money) then you can more easily entice them into clicking your affiliate links and thus make money.
I have an e-mail, a name, and an address so lets see what else is available online.
Of course, almost immediately, I stumble onto the individuals LinkedIn page.
Sean Courtney, Advertiser, located in Muncie, Indiana.
He's also worked other Advertiser jobs in the past, but his most recent experience as an "Affiliate Advertiser" seems to line up perfectly with what we've uncovered so far. You'll also note the name of his currently employer, "OGAds". My gosh, that sounds awfully familiar...
You'll recall that when I looked at the source code of the page the surveys and apps were being funneled through, the domain was "ogtrk[.]net" - I wonder what the chances are those two are related?
Well, pretty fucking likely apparently.
The final icing on the cake is an all too now familiar app-install offer that OGAds displays proudly on their site.
That about wraps things up here. I'm sure they made a good chunk of change off everyone and it was a good lure (for me) so hats off to Sean and, most likely, OGAds. You're all a bunch of twats.
*EDIT - 27AUG2018*
I converted this to a dedicated Cuckoo module and it far exceeds the capabilities here. Branch can be found here for Cuckoo and general code on GitHub here.
In this post I want to give a brief introduction to a new tool I'm working on called "Curtain". It will be complimentary to another post I'm working on for $dayjob where I created a Curtain Cuckoo module.
Curtain is basically just a small script that can be used to detonate malicious files or PowerShell scripts and then scrape out the ScriptBlock logs for fast analysis. I'll go into the concept behind it more in my other post.
The idea here then is to get that same functionality but without Cuckoo, as it's not always available to everyone or people may not want to much with installing custom modules, and I wanted something standalone.
That being said, there are still a few requirements for this alternate iteration which, admittedly, was my first version before I decided to try and streamline it more for work usage.
The usage of the script is fairly straight forward and you simply pass it either a PS1 script OR a file which it will try to execute and then report back the ScriptBlock event logs. For PS1 scripts, it will launch PowerShell, otherwise it will rely natively on the extension and the OS the recognize/execute it. It'll wait 10 seconds and then simply scrape the logs, parse them into a simple HTML file, and display it on the host.
Below is a simple example of the script in action...
If you click HERE you can see the resulting output that gets created.
Nothing too crazy or fancy but it allows you to see how things flowed and have some visibility into the deobfuscated PowerShell which was executed on the Guest machine. I haven't yet beautified it so it's pretty raw at the moment but figured I'd put it out there if there was a need for it.
The GIT for this script is up and you'll need to fill out a few variables in the "curtain.sh" script to make it work.
Hopefully this helps if you do not have Cuckoo available or just want a quick way to be able to parse out PowerShell execution activity from a code sample.
There is also a file "psorder.py" I've included in the GIT repo. At some point I was writing a manual PowerShell deobfuscator and, while that is a fruitless endeavor, there are some benefits using it in tandem with Curtain for token replacement/etc.
When Magic the Gathering (MtG) came out I was beyond excited. As a fairly young kid, this was unlike anything I'd ever seen before and for a few years I would spend every penny from chores, lunch money, and birthday cards to round out my collection. It was a blast, I really enjoyed it, and developed many fond memories around playing and collecting the cards.
Alas, as with all good things, that came to an end as life started to kick in. Fast forward 20+ years to today and I'm free to indulge myself in my old(new) hobbies again. One thing I always wanted to do was buy a booster box. SO.MANY.CARDS! But holy shit is it not cheap, or at least not for past-me, so it was always a pipe dream until now.
I've been shitposting with friends at work about doing a MtG draft at a conference we'll be attending one night, rehashing old memories, and then I felt a familiar feeling...an itch that needed to be scratched. I read up on the latest Amonkhet set and fell in love with the theme of it so I went out and bought an Amonkhet Booster Box, the Amonkhet Deck Builder's Toolkit, and an Amonkhet Bundle Box. Indulge I shall.
It's a metric crap top of cards, espceially going from 0, and I found myself with an overwhelming amount of information to take in and try to process. Tons of new rules, new abilities, new everything. I spent the majority of time looking up card rules to try and get a grasp of the game, not even knowing where to begin with building a new deck. I cobbled together some decks as I opened packs but I wondered if, within my cache of newfound cards, there may already be a deck someone else has built and posted online. Surely that would be the case with 1,100 cards, right?
TL;DR don't buy packs and expect to have any semblance of a pre-constructed deck, official or otherwise.
To come to this conclusion, which I admittedly already thought may be the case before buying all of these (still didn't deter me from making the purchase though, at least to do it once) I created a script to essentially search decks I scrape online and attempt to match my library against. The script mtgdeckhunter.py is up on Github with usage examples and output data. You can skip everything below if you're just interested in that instead of my rambling about it.
After some time on the net looking up new MtG rules, I came across three main sites (Deckbox, MtG Goldfish, and MtG Top 8) that had tons of decks available to peruse in all kinds of formats. The problem then became not finding the decks, but identifying which decks I might have most of the cards for. The idea being I could then just craft some decks other people put thought into, give them a whirl, see if I liked that style, then maybe go from there. Unfortunately none of these sites have that feature available except for MtG Goldfish - part of their monthly pay service.
I decided I'd try to craft something simple in Python to scrape the publicly available decks and see if I had any matches. What I've now realized is that if you just have one "set" (eg Amonkhet) then you're pretty much SoL on finding pre-made decks. Since there are so many editions in rotation, you rarely find real decks solely focused on just one set except the officially released ones. In hindsight, I'd have bought a few of the "Deck Builder's Toolkits" and Bundle Boxes for maybe the most recent 2-3 sets or just a couple of the official pre-constructed ones that actually don't seem too bad. Initially the idea of buying a pre-constructed deck had me sticking my nose up as if I'm some kind of MtG legend (I'm not).
Anywho, I'm not providing the decks I've pulled from these sites, that is an exercise left to the user, but if you have a ton of cards that you've got listed out on your computer, then maybe this program will prove helpful to you. I ended up not really using it at all...go figure...and built two EDH decks that have been pretty fun in my limited playing. I may revisit this once I have a more well rounded collection.
You can see my cards from the three purchases mentioned previously on Deckbox or in text format here, if you're interested in what I got. Deckbox says the total value of cards in the set is $283.95, which I think is pretty good since it's about 2x what I paid (doesn't account for the foils I got). The new formats (eg EDH) seem really fun and I'm excited to get back into this hobby.
*NOTE: I likely won't be updating this code anytime soon for the reasons above. It's probably quite a bit buggy and after having collected 100,000+ decks, I quickly recognized using JSON as the storage format as being a terrible idea (80MB+ file). Also I'd say it's not particularly stable as it relies upon parsing these sites which can change their code at any moment.