In this series, we will discuss the role of Big Data in litigation. Much has been written about the role of unstructured data (emails, documents, chats etc.) and E-Discovery in litigation, so in this series we focus on the esoteric, but increasingly important, world of structured data. In this introductory article we will define “structured data” and “big data” in the context of litigation. The rest of the series will dive deeper into practical considerations for data discovery and extraction, common data pitfalls, our views on how to transform data into valuable insights, and how to present compelling analyses to the court.
It is important to note that not all litigations have Big Data components, but where they do, being able to effectively collect, manage and leverage the relevant data is often the difference between winning and losing. In our work supporting litigations, we work with clients and legal counsel to efficiently use data to establish a fact-base, support or disprove liability, and calculate and challenge damages. We will expand on these specific topics in future posts.
Between us we have over 35 years of working with structured data in the context of regulatory investigations and legal disputes, so we look forward to sharing our collective experience through these posts, and also hopefully convey our passion for demystifying the often esoteric world of data and technology.
Structured data: simple to interpret, provided everyone uses the same structure (spoiler alert – they don’t!)
Structured data is Electronically Stored Information (ESI) that has been deliberately created, collected and stored in a highly defined manner, most commonly a table, like the one you would see in a Microsoft Excel spreadsheet. The rows in the table represent a record of a specific event or object (e.g., payments, trades, entries in financial ledgers, purchase and sales records, and increasingly data captured by Internet of Things (IoT) sensors). The columns in the table provide common data points relating to the event or object, such as the product type, customer name, or the date and time an event occurred. The tables of data are regularly stored in a company’s financial or operational databases and maintain complex relationships that exist between events and objects (e.g., products in inventory and sales of those products). As the data is highly defined and structured it is a very efficient way of storing information. The challenge in interpreting structured data arises because of the huge variety of different systems that have been developed over the years by various companies (e.g., Microsoft, IBM, Oracle, etc.), all of which store data differently. The passage of time creates additional challenges as technology evolves, data is migrated to new systems, systems are retired, technologies cease to be supported, and data is deleted, destroyed or misplaced. All of this means there are countless ways to store structured data and many challenges in establishing what data still exists, whether it can be retrieved, and the time and cost implications of retrieving and producing the data.
Big Data: It is big, but there is more to it than that…
Big Data refers to ESI and is often described using the Five V’s:
- Volume: The quantity of the data that needs to be considered is so vast that ‘conventional’ analysis tools (for example, Microsoft Excel) are not practical, or indeed feasible, and therefore specialized data analytics hardware and software is required.
- Variety: The relevant data comes from multiple different sources and is stored in many different formats that must be standardized to consolidate them into a single usable view of the data. Big organizations typically store the same data in multiple systems (e.g., trades in the trading system and the resulting exchanges from these in the settlements system) and often this data needs to be crossed checked, or a hybrid dataset created where data gaps exist.
- Veracity: The data is of variable quality, meaning it is not always complete and accurate, and any approach to analysis must carefully consider this, either by avoiding the use of very dirty data, cleansing the data, or clearly articulating the limitations of analysis that makes use of it.
- Value: Fundamentally data offers value to us if it tells us about how the world was (i.e., the presence, or absence, or a digital record of a transaction or events); how the world currently is (e.g., readings from real-time sensors); how it might be in the future (through the use of predictive modeling); or how, through the use of statistics, the events captured and described in the data may or may not be related to one another.
- Velocity: The rate of flow of the data i.e., how much is being produced. This is most relevant to real-time systems that need to ingest, process and present data on-the-fly.
Our experience of the five Vs is that, perhaps counterintuitively, volume is often not a significant driver of complexity, albeit to handle large quantities of data (millions or billions of rows) you need specialized hardware, software and coding skills (we’ll talk about this further in post #5). In our view, of the five Vs variety and veracity are the most significant drivers of complexity. Analyzing and combining smaller datasets from three systems is often far more challenging and time consuming than doing the same for a very large dataset from one system, as each system’s function and structure must be understood before the data can be, correctly, analyzed. Incorrect or incomplete data can present significant issues as it can skew analysis or give an incomplete picture or trends, patterns and behaviors, and ultimately result in the incorrect conclusion being reached.
Defining the fundamentals of data is the first step in understanding the evolving role that big data plays in litigation. Here’s a preview of What’s Next?
Part #2 - Data archaeology: we discuss why Big Data is relevant to litigation, the challenges and pitfalls of discovering and extracting the data and how to be like Indiana Jones.
Part #3 - Data analysis: We discuss our view of how to transform raw data into valuable insight and how data visualization techniques help us tell a compelling story.
Part #4 - Interview: We interview two veterans of the litigation world to get their reflections on how the use of data has evolved over time and hear their predictions of where this might be headed.
Part #5 - Conclusions: Our wrap-up of this series to recap on some of the key themes we’ve explored, as well as some of our predictions on emerging trends, such as the use of AI, data privacy and the impacts of blockchain technologies.
About the authors:
Tom is a Director in our Forensic Data Analytics team, based in Dallas. He has over 20 years of experience leading clients through complex litigation and regulatory matters involving large amounts of structured and unstructured data with a focus on the financial services industry.
David is a Director in our Forensic Data Analytics team, based in London. He has over 18 years of experience assisting clients with high-profile regulatory and legal challenges, working to support blue-chip companies in areas such as antitrust, corporate investigations, commercial litigation and financial crime.