Handling Big Data in a large DFIR Case

This post results from the project “21_DFIR” within the Munich Cyber Security Program (MCSP) The MCSP is a cooperation project between Champlain College and ComCode (Germany). The project 21_DFIR focusses on team-collaboration and big data handling in large scale DFIR cases for globally acting business organizations.

When one hears forensics, they often jump right into the fantasy of crime shows, bad guys, and analyzing biological evidence in a lab. Occasionally there’s a plot point of using cell records to shatter someone’s alibi, but very rarely is there any true deep dive on what it takes to analyze a computer, cell phone, tablet, server, cloud, whole social media account, or even Fitbit. Despite the lack of visibility, there isn’t a day that goes by where one can escape hearing about a data leak, ransomware scenario, hacking, privacy and security laws, or how technology will somehow be the downfall of society. Digital evidence is beginning to outweigh all other types of evidence in criminal, enterprise, and civil matters. The proper handling, analysis, and knowledge of the digital landscape are of the utmost importance in handling and responding to most issues. 

I think it’s important to quickly go over the current standard and best practices of forensic analysis. The National Institute of Standards and Technology (NIST) has broken it down into 4 basic steps when it comes to handling digital evidence. 

  • Identification: Being able to identify what evidence (or device) is relevant to the investigation 
  • Collection: Copying the evidence in a way that is forensically sound and admissible in court (using write blockers, or taking a copy in such a way that doesn’t disturb the original)
  • Analysis: Using a suite of tools and methodologies to analyze what was collected
  • Presentation: Creating a deliverable that contains all evidence gathered, how it was gathered, and proof of its existence

An important part of the analysis process is being able to make connections. If you have a phone, smartwatch, PC, and a user’s Twitter account, how do you put them all together to prove that a crime was committed? If you have four different file servers, an email server, and an active directory server, and 300 user accounts, how do you put all of the logs together to figure out where the data breach was? Typically, you would analyze each separately (usually with a different tool or methodology), and then manually and painstakingly put it all together to create a timeline of what happened. There are a handful of tools that boast about being able to make connections (such as Cellebrite’s UFED suite), but none can connect every single artifact in a way that is meaningful to examiners. 

On its own, this is a clear issue. The individual analysis of sources takes up substantial amounts of time, even with multiple examiners. Then you add in that the lack of centralization of tools means that most places don’t have a way of splitting up work between analysts easily. This process further complicates investigations by combining multiple people’s results. To make matters even more challenging, the amount of data being analyzed is not getting any smaller. The International Data Corporation estimates that the average person created ~1.7 megabytes per second in 2020. According to a survey done by Forensics Focus in 2013 (the latest figure I was able to find), over half of the investigators said that at least half of their investigations involved more than a terabyte of data and that nearly 20% of all cases involved more than five terabytes of data. Considering the age of this data, and the fact that you can now buy cell phones with one terabyte of data from mainstream companies such as Samsung, I have to assume this figure is on the much higher side now. 

The goal of the project this summer is to attempt to find a solution, or a set of solutions to address the issues of Big Data in forensics. I will be looking at open-source, commercial, and even built-in tools within server environments to find a tool or suite of tools that can help examiners tackle large amounts of data in practical ways. This is a very large task, which is why I am working with Ian Eubanks in this endeavor as we both work to find an answer to this. 

Some issues already run into during this process is the lack of prior research into this topic. There are plenty of forum posts asking about tips and tools to handle large amounts of data, some small undergrad thesis’ that attempt to address this broadly, and a handful of small hour-long presentations from security experts that talk about the issue in general but offer no real solutions. What little information there is seems to be locked behind a paywall (SANS mostly). Since starting, most time spent has been on research and getting used to working outside of my room and plotting how to best tackle this project.

Follow us for more updates on this project!

For further questions about Munich Cyber Security Program, or this project please feel free to contact mcsp@comcode.de

-Written by Kaya Overholtzer ‘22 //Digital Forensics & Cybersecurity
More Partners
DFIR & Threat Intelligence Post III
2022 Automotive Cybersecurity Project IV
2022 Automotive Cybersecurity Project III