Why master YARA: from routine to extreme threat hunting cases. Follow-up
On 3rd of September, we were hosting our “Experts Talk. Why master YARA: from routine to extreme threat hunting cases“, in which several experts from our Global Research and Analysis Team and invited speakers shared their best practices on YARA usage. At the same time, we also presented our new online training covering some ninja secrets of using YARA to hunt for targeted attacks and APTs.
Here is a brief summary of the agenda from that webinar:
- Tips and insights on efficient threat hunting with YARA
- A detailed demo of our renowned training
- A threat hunting panel discussion with a lot of real-life yara-rules examples
Due to timing restrictions we were not able to answer all the questions, therefore we’re trying to answer them below. Thanks to everyone who participated and we appreciate all the feedback and ideas!
Questions about usage of YARA rules
- How practical (and what is the ROI), in your opinion, is it to develop in-house (in-company/custom) YARA rules (e.g. for e-mail / web-proxy filtering system), for mid-size and mid-mature (in security aspects) company, when there are already market-popular e-mail filtering/anti-virus solutions in use (with BIG security departments working on the same topic)?
- What is the biggest challenge in your daily YARA rule writing/management process? Is it a particular malware family, actor, or perhaps a specific anti-detection technique?
- Some malware uses YouTube or Twitter or other social media network comments for Command-and-Control. In that regard, where there are no C2 IPs, is it currently hard to detect these?
- So what is the size of the publicly available collections for people to use YARA against? What are some good ways to access a set of benign files, if you don’t have access to retrohunts/VTI?
- Can YARA be used to parse custom packers?
- What is the trade-off when we want to hunt for new malware using YARA rules? How many FPs should we accept when we need rules that detect new variants
- Could YARA help us to detect a fileless attack (malware)?
- We can use YARA, together with network monitoring tools like Zeek, to scan files like malicious documents. Can YARA be used against an encrypted protocol?
- What open source solution do you recommend in order to scan a network with YARA rules?
- Which is better, YARA or Snort, for looking at the resource utilization for detection in live environments?
In the case of mid-size companies, they can benefit a lot from three things connected to YARA, because YARA gives you some flexibility to tailor security for your environment.
First is the usage of YARA during incident response. Even if you don’t have an EDR Endpoint Detection and Response) solution, you can easily roll-out YARA and collect results through the network using PowerShell or bash. And it’s often the case that someone in a company should have experience developing YARA rules.
Second is the usage of third-party YARA rules. It’s an effective way to have one more layer of protection. On the other hand, you need to maintain hunting and detection sets and fix rules and remove false positives anyway. Which once again means that someone needs experience in writing YARA rules.
Third is that, as mentioned earlier, it might be really useful to have rules to look for organization-specific information or IT assets. It can be a hunting rule that triggers on specific project names, servers, domains, people, etc.So the short answer is yes, but it is important to invest time wisely, so as not to become overwhelmed with unrelated detections.
In our experience, certain file formats make writing YARA rules more difficult. For instance, malware stored in the Office Open XML file format is generally more tricky to detect than the OLE2 compound storage, because of the additional layer of ZIP compression. Since YARA itself doesn’t support ZIP decompression natively, you need to handle that with external tools. Other examples include HLL (high level language) malware, notably Python or Golang malware. Such executables can be several megabytes in size and contain many legitimate libraries. Finding good strings for detection of malicious code in such programs can be very tricky.
Yes and no. Yes, it’s hard to get the real C2, because you need to reverse engineer or dynamically run malware to get the final C2. No, it’s relatively easy to detect, because from a ML point of view it’s a pure anomaly when very unpopular software goes to a popular website.
You can use YARA on clean files and malware samples. Creating a comprehensive clean collection is a challenge, but in general, to avoid false positives, we recommend grabbing OS distributions and popular software. For this purpose, a good starting point could be sites like:
https://www.microsoft.com/en-us/download
https://sourceforge.net/
ftp://ftp.elf.stuba.sk/pub/pc/
For malware collection it’s a bit tricker. In an organization it’s easier, since you can collect executables from your own infrastructure. There are also websites with the collection of bad files for research purpose in Lenni Zeltser blogpost there is a good list of references:
https://zeltser.com/malware-sample-sources/
The final size of such a collection could be several terabytes or even more.
Yes, but not out-of-the-box. YARA has a modular architecture, so you can write a module that will first unpack the custom packer and then scan the resulting binary.
A more common option is to run YARA against already unpacked objects, e.g. results of unpacking tools like Kaspersky Deep Unpack or sandbox and emulator dumps.
It depends what you want to catch. In general, from a research perspective, it’s ok to have an average FP rate up to 30%. On the other hand, production rules should have no FPs whatsoever!
Yes, YARA can scan memory dumps and different data containers. Also, you can run YARA against telemetry, but it may take some additional steps to achieve it and properly modify the ruleset.
Only if you do a MITM (Man-in-the-Middle) and decrypt the traffic, since YARA rules most likely expect to run on decrypted content.
YARA itself plus PowerShell or bash scripts; or, as an alternative, you can use an incident response framework and monitoring agent like OSquery, Google Rapid Response, etc. Other options are based on EDR solutions which are mostly proprietary.
YARA and Snort are different tools providing different abilities. Snort is designed specifically as a network traffic scanner, while YARA is for scanning files and/or memory. The best approach is to combine usage of YARA and Snort rules together!
Questions about creating yara rules and training course questions
- Are we able to keep any of the materials after the course is finished?
- Is knowledge about string extraction or hashing sufficient to create solid YARA rules? Are there other things to learn as prerequisites?
- Can we add a tag to the rule that says it is elegant, efficient or effective, such as the tag on the exploit (in the metasploit): excellent, great, or normal?
- Maybe you can explain more about the fact that metadata strings don’t have a direct impact on the actual rule.
As we described before, a YARA rule can consist of meta, strings and conditions. While the condition is a mandatory element, the meta section is used only for providing more info about that specific YARA rule. and it is not at all used by the YARA scanning engine.
- ASCII is the default, so why do you need to put ASCII in the rule?
- Can we use RegEx in YARA? Is nesting possible in YARA?
- Is there a limit on the number of statements in a YARA rule?
- Can we say that YARA can be a double-edged sword? So a hacker can develop malware and then check with YARA if there’s anything similar out there and enhance it accordingly?
- This is a philosophical question: Juan said YARA has democratized hunting for malware. How have APTs and malware authors responded to this? Do they have anti-YARA techniques?
- Would be possible to create a YARA rule to find Morphy’s games among a large set of chess games?
- Would you recommend YARA for Territorial Dispute checks?
Yes, Kaspersky YARA cheat-sheets or training slides which include Kaspersky solutions to exercises are some of the things that are available for you to download and use even after the training session has finished.
This depends on case-by-case knowledge. Strings and hashing are basic building blocks for creating YARA rules. Other important things are PE structure and preferences and anomalies in headers, entropy, etc. Also, to create rules for a specific file format, you need some knowledge of the architecture of the corresponding platform and file types.
Sounds like a good idea. Actually, YARA rules also support tags in the name:
https://yara.readthedocs.io/en/stable/writingrules.html
Without ASCII, say ‘$a1 = “string” wide’, only the Unicode representation of the string would be searched. To search both ASCII and Unicode, we need ‘$a1= “string” ascii wide’.
Yes, it’s possible to use RegEx patterns in YARA. Be aware that RegEx patterns usually affect performance and can be rewritten in the form of lists. But in some cases you just cannot avoid using them and the YARA engine fully supports them.
Nesting is also possible in YARA. You can write private rules that will be used as a condition or as a pre-filter for your other rules.
We created several systems that create YARA rules automatically; and over time these have reached tens of megabytes in size. While these still work fine for us, having a very large number of strings in one rule can lead to issues. In many cases, setting a large stack size (see the yara -k option) helps.
Sure, although they would need access to your private stash of YARA rules. In essence, YARA offers organizations a way to add extra defenses by creating custom, proprietary YARA rules for malware that could be used against them. Malware developers can always test their creations with antivirus products they can just download or purchase. However, it would be harder to get access to private sets of YARArules.
A few years ago we observed a certain threat actor constantly avoiding our private YARA rules for one to two months after we published a report. Although the YARA rules were very strong, the changes the threat actor made to the malware kind of suggested they knew specifically what to change. For instance, in the early days they would use only a few encryption keys across different samples, which we, of course, used in our YARA rules. Later, they switched to a unique key per sample.
Probably! Morphy was one of the most famous players from the so-called romantic chess period, characterised by aggressive openings, gambits and risky play. Some of the openings that Morphy loved, such as the Evans Gambit or the King’s Gambit accepted, together with playing with odds (Morphy would sometimes play without a rook against a weaker opponent), might yield some interesting games. Or, you could just search for ‘$a1 = “Morphy, Paul” ascii wide nocase’, perhaps together with’ $a2 = “1. e4″‘ 🙂
Yes, of course. In essence, “Territorial Dispute” references a set of IoCs for various threat actors, identified through “SIGS”. While some of them have been identified, for instance in Boldi’s paper, many are still unknown. With YARA, you can search for unique filenames or other artifacts and try to find malware that matches those IoCs. Most recently, Juan Andres Guerrero-Saade was able to identify SIG37 as “Nazar”: check out his research here:
https://www.epicturla.com/blog/the-lost-nazar
Pro tips and tricks from the audience
- Using YARA programmatically (e.g. via py/c) allows you to use hit callbacks to get individual string matches. This enables you to check for partial rule coverage (k of n strings matched but without triggering the condition), which is great for aiding rule maintenance.
- On the top white list (clean stuff), known exploits and payloads should be also populated in our YARArule sets.
- I always find it easier to maintain code by grouping the strings together.
- As a dedicated/offline comment to JAG-S: The “weird” strings from the rule discussed most likely come from the reloc section (thus locking on encoded offsets), which would make the rule highly specific to a given build, even with a soft 15/22 strings required. That would still probably work well if the samples originate from a builder (i.e. configured stub) but should not generalize well. And for the IDA-extracted functions: consider wildcarding offsets to have better generalizing rules.
- When it comes to strings – besides the strings from disk, mem, network dump, etc., bringing context and offset should be a best practice. Then rank the strings in the context of the malware. And this requires human expertise but can be easily adapted into the YARA rule building process.
- Сombining, in a flexible way, the YARA rules build process with the enrichment of the recently announced Kaspersky Threat Attribution Engine, will be also GReAT 🙂
Feel free to follow us on Twitter and other social networks for updates, and feel free to reach out to us to discuss interesting topics.
On Twitter:
- Costin Raiu: @craiu
- Dan Demeter: @_xdanx
- Yury Namestnikov: @SomeGoodOmens