What’s for dinner?
As my recipe collection expands, it’s only become harder to answer that question. I have text documents inherited from friends and family, a growing pile of paper cookbooks, and photos my mother sends me of newspaper clippings pasted into any of her dozen three-ring recipe binders. Searching across these sources is tricky enough when I know the exact item I’m looking for; when I want to review my options within specific requirements, it becomes borderline impossible. Say I have half a cup of sour cream that needs to get used up today, my oven is out of commission, and I’m hosting a vegetarian diabetic with celiac disease… what’s for dinner now?
Intuitively, I want to treat this as a data schema problem. A recipe has a certain structure to it that’s immediately apparent even to an amateur cook; there’s a title, a list of paired ingredients and quantities, and instructions. Most professionally created recipes will provide even more data, like cook time, number of servings, or an equipment list. We might furthermore be inclined to sort our recipes into multiple functional categories like “entrée/side/dessert,” “vegan/vegetarian/omnivore,” and “keeper/to try/DO NOT MAKE AGAIN.”
But chefs, for all their wisdom, do not generally publish their work in JSON format via a convenient API, and so the text of a recipe is, for all intents and purposes, unstructured data. This is particularly true for physical recipes, which don’t even have implicit HTML structure a web scraper could use to begin to organize the data. And thus, I have no standard, centralized, searchable recipe repository I can use the way I want.
By defining the problem, the shape of the solution becomes clear. We (me, my mother whose recipe collection is even more sprawling and physical than mine, and maybe you?) need a system that will:
- Ingest the unstructured data of a user’s full recipe collection (via photos of paper recipes and copypastes from text documents and websites)
- Allow the user to query the data in the kind of terms they need for real-world meal planning: looking for specific ingredients, excluding specific equipment, feeding at least a specific number of diners, et cetera.
And for that, we need:
- A front end that supports the uploading of raw recipe data, checking the processed output for errors, and querying the database of recipes.
- A component that performs Optical Character Recognition on image input (for many collections, this is likely to be the most common kind) to convert the image into text
- A component that takes raw recipe text and converts it into a semistructured schema
- A data storage layer that makes the recipes available for complex queries
Early experiments
I first articulated this problem in early 2021, and there were more roadblocks at the time than I’d anticipated. OCR was hardly a new field, but early experiments with Amazon Textract were uninspiring. Moderate text distortions like curved pages caused the service to skip and reorder words, and worst of all, it wasn’t appropriately sensitive to columnar organization of data – that is, if the ingredients list was a neat stack on the left of the page alongside the instructions, then the service would read each line continuously, without regard for the fact that the left column should be read as one unit, followed by the right column. This was disheartening.
Apple released iOS 15 in 2021, and with it a very heartening feature: on-device OCR. If you’ve never tried it, iPhone OCR is very sensitive to columns and quite good at its job, enabling users to copy and paste text from the camera into, for example, a front end web form. It wasn’t the exact snap-and-send solution I envisioned, and it had a few downsides – there’s a fussiness to arranging the camera so the OCR will pick up on the text block of interest, as well as the nuisance of having to use the feature multiple times to pick up on separate blocks.
At the time, I chose to consider the matter provisionally solved and turned my attention to the third component. The data that was most important to extract from the recipe, I figured, was the ingredients list; the instructions could be retained as a blob of text for human eyes only, but the ingredients needed to be queryable. I set about training a custom entity recognizer in Amazon Comprehend to detect amounts and ingredients. As I collected samples to train the model, it struck me that the problem was potentially less tractable than it initially seemed to be. Consider that “1 tsp.,” “1,” “1½ to 2 cups,” “a pinch,” and “to taste” are all amounts, while “5- to 7-ounce bone-in chicken thighs,” “bay leaf, crumbled,” and “sour cream, Greek yogurt, or crème fraîche” are all ingredients. Nuances of numerals and punctuation prohibit any hard-and-fast rules that can cleanly demarcate these two categories, but that’s what machine learning is for, right?
As it turned out, the custom entity recognizer had an overall F1 score of about 89, with precision a little lower and recall a little higher. The F1 subscores for recognizing amounts was higher (96) and for ingredients was lower (80). These numbers were pretty impressive, but in practice, average-length recipes consistently had enough errors to make correcting the input step into too much of a chore to be worth implementing. More training data likely would have improved the scores, but at the time, I let the project lapse.
What if we used a sledgehammer to crack this nut?
In November of 2022, OpenAI launched ChatGPT to massive interest. Large language models had existed in less accessible forms prior, but the low barrier of entry for experimentation meant that “new” capabilities and techniques were entering public consciousness at an incredible rate.
Hey, I thought to myself. If these things can write code based on comments, maybe they can turn a recipe into a JSON?
Early tests were extremely promising, even with an unsophisticated prompt. There were a few mistakes the models I tested made:
- Taking “1 egg” as an amount with no associated ingredient
- Adding undesired commentary like “This is a basic example of a recipe” or “Here is the recipe formatted in JSON”
- One missed an ingredient! But it was just walnuts, so maybe it was making an executive decision
- Difficulty inferring a complete equipment list
(To much less fanfare, in September 2023, Textract released a Layout feature that does, in initial tests, a splendid job separating out chunks and even assigning them to tentative categories.)
The tech is there. It’s time to apply it to the problem.
Next steps
To iron out irregularities in output, I intend to perform more prompt engineering and fine-tune several models (Titan, Llama 2, and Command) with an eye towards optimizing them to work on the specific output of the Textract’s Layout feature and pasted recipes from websites and text documents. Comparing models fine-tuned on the same datasets will give a better sense of which is best suited to the problem. Additionally, I’ll start to build up document databases in DocumentDB and OpenSearch to compare their suitability as a data store. That just leaves the front end and the orchestration of all the components… we’ll see about that, too.