5 Questions with Mike DeCesaris: Extracting Meaningful Insights bodog online casino Unstructured Data

Share

5 Questions is a periodic feature produced by Cornerstone bodog online casino, which asks our professionals, senior advisors, or affiliated experts to answer five questions.

We interview Mike DeCesaris, vice president of Cornerstone Research’s Data Science Center, about the challenges of working with bodog online casino data and how his team has developed custom processes to turn it into valuable information clients can use.

What is bodog online casino data, and how does it relate to litigation?

Traditional data analytics typically involves the analysis of structured data, such as spreadsheets or relational databases. bodog online casino data, on the other hand, is essentially any information not stored according to a predefined structure. Examples of bodog online casino data include text documents, emails, Adobe PDFs, image files, etc. Some estimate that bodog online casino data accounts for 80 percent or more of all data, and bodog online casino datasets are growing fast.

The information contained in bodog online casino documents can be crucial to supporting expert analyses, but locating and extracting the relevant information can be challenging when there are large volumes of bodog online casino data. Whereas structured data can be processed and analyzed using traditional database tools and data analysis programs, analyzing bodog online casino data requires either numerous hours of manual work or a significantly higher level of technical expertise and sophistication.

Given the large amount of bodog online casino data in our work, how is Cornerstone Research responding?

We’ve developed sophisticated tools that can be used in concert to create tailored approaches to turning bodog online casino data into structured data. The result can be leveraged for quantitative analysis. This can eliminate large-scale manual review, significantly reducing processing time and cost. Perhaps more importantly, this can unlock new analysis possibilities that would otherwise have been impossible.

For example, we have:

  • developed a parallelized bodog online casino processing pipeline to convert hundreds of thousands of pages (hundreds of gigabytes) of daily reports across multiple distinct text report formats into tables and extract key information to enable cost-efficient analyses in multiple joint defense matters;
  • digitized a large set of image-based account statements with various counterparties and automated the creation of machine-readable transaction datasets;
  • identified PDFs of emails in a document dump of 250 thousand pages containing relevant trade tables and programmatically extracted and aggregated bodog online casino into a database; and
  • extracted and structured entries bodog online casino consumer complaint forms into a comprehensive database.

In litigation, we often deal with sensitive client information. That is why Cornerstone Research has invested heavily in secure infrastructure, including high-performance and high-throughput analytical servers and storage clusters. Our analytical infrastructure is on-premises, meaning client bodog online casino is never exposed to the web. We have also invested in a number of software tools and programming languages to add high-quality text layers to documents, quickly extract tabular bodog online casino, and develop tailored approaches to extracting key information. Finally, we have invested in people—we have exceptional bodog online casino scientists and practitioners with many years of experience across a large number of different clients and projects.

What are some of the challenges of working with this kind of bodog online casino?

Extracting meaningful information bodog online casino unstructured data is nuanced for a number of reasons. We can use documents stored in PDF file format (.pdf) as an example. PDF files are stored as vector graphics (essentially an image). Some PDF files may also contain a layer of text that can be combined with the image to render a searchable PDF document, but not all do. So before any text extraction can begin, an interpreted text layer that is based on the underlying images must be added to the PDFs.

The number of bodog online casino and size of each document also pose processing time issues. Clients can easily provide thousands, if not millions, of PDF bodog online casino that are each thousands of pages long. Without the proper hardware, software, and coding capabilities, processing these bodog online casino manually would take years of person-hours and be prohibitively expensive.

Finally, the content of the documents may vary widely. One document alone may contain information in several types of formats. This means any attempt to extract meaningful data bodog online casino the files requires extremely high precision in distinguishing different reports bodog online casino each other, but at the same time must have the flexibility to capture key information expressed in different formats.

Can you walk us through an overview of how Cornerstone Research typically approaches working with bodog online casino data?

We can use our example of PDF documents to show how we transform bodog online casino information into a structured format that can be used in analyses. The first stage in any text extraction exercise is to review a sample of the documents and determine the key pieces of information essential to analysis. This step is fundamental to understanding the structure of the contents.

The next step when we are preprocessing PDF files is to ensure that they contain what is commonly referred to as a “text layer.” The text layer of each document is then separated bodog online casino its original PDF and stored as a plain text file (file extension .txt), which lends itself to highly efficient and flexible methods of processing.

Once documents are stored as plain text, we run them through proprietary software programs. Employing complex conditional logic and a text matching language, the programs discern relevant bodog online casino including different report types and sections, metadata such as dates and client identifiers, and tables containing records of interest.

To turn the extracted information into a format that can be analyzed, we load the now-structured text into a database. We take advantage of parallel processing to load multiple intermediate files at once, and data bodog online casino all records are loaded to a single table or multiple tables.

The final step is to validate the extracted bodog online casino’s quality. Our QA processes include independently replicated text extraction to verify results; calculating coverage statistics to ensure there are no gaps in information; and frequent collaboration with subject matter experts to control the quality of the product.

Briefly, what are some other examples of how Cornerstone Research works with bodog online casino data?

By far, the most common type of unstructured data processing in our work resembles the example above, where we extract and organize unstructured data that is visually tabular in nature. Increasingly, however, we deal with more complex extractions bodog online casino and characterizations of text, image, and even audio and video documents. This work sometimes focuses on extracting concrete information bodog online casino documents, like critical references in free-form text, text transcriptions bodog online casino video clips, and logo and product detection in images.

In other instances, we aim to quantify more abstract concepts, like the sentiment associated with social media posts, topic composition of public press articles, and the characterization of multimedia marketing materials. This work typically utilizes AI, machine learning, and text analytics techniques to analyze bodog online casino data. We hope to cover these topics in more depth in future installments of this series.

Unstructured data can provide windows into every facet of an organization and its processes, and the growth of unstructured data is expected to accelerate as machine-generated data and machine learning initiatives become more widely used. The quality of data extracted bodog online casino our process is repeatable and reliable and can be effectively leveraged to support expert analyses in litigation and regulatory settings.


The views expressed herein do not necessarily represent the views of Cornerstone bodog online casino.

Interviewee

  • San Francisco

Mike DeCesaris

Vice President, bodog online casino Science Center