Building machine learning pipelines to extract structured data from unstructured text is a popular problem within an unpopular development lifecycle. We’ll talk through how you can use LLMs so that your schema can `interrogate` structured data from your unstructured text data in a declarative and typesafe way.
The challenge of converting unstructured text data into structured, usable data is a well-known adversary to engineers, analysts, and data scientists alike. In the traditional paradigm, this task has been the exclusive domain of specialists, often requiring the creation of bespoke models for each data feature. Missed a feature? Let’s circle back next quarter.
In this talk we’ll see that Large Language Models are surprisingly effective at not only rote extraction of structured data from documents, but extracting derived information and doing so in a type safe way that adheres to your data model. We’ll show how Marvin’s AI Models, grounded in Pydantic, let you interrogate your data with your data model by combining the potent reasoning capabilities of AI with the type boundaries set by Pydantic. By letting developers build NLP pipelines solely with their data model’s schema, this lets engineers and analysts enjoy a declarative development experience with NLP.
We’ll go through real life applications of how LLMs are being used in production: structuring electronic health records data, developing custom entity extraction pipelines, generating synthetic data for test driven development, and automated schema normalization for data warehousing.