Using AI to list for me the lowest bulk price products in Mercadona (Langchain + Ollama + openhermes)
So, let’s start with a disclaimer, this project is intended just for researching capabilities of current Open Source LLMs on a task that I consider it might be suitable for. So no commercial use etc, etc, etc…
Moving on to… Why?
I’ll be scrapping “Mercadona” a famous Spanish grocery store, they have an API which I was able to locate and kind of reverse engineer using dev tools on chrome. Not slowing down on this step, as it’s pretty basic. I got all the products, saved them in a JSON file, and moved to analyze my data
The “scrapper” (just making for loop requests)
import requests
from time import sleep
# range from 1 to 199
product_dict = {}
for i in range(1, 200):
request = requests.get(f'https://tienda.mercadona.es/api/categories/{i}')
sleep(1.5)
if request.status_code == 200:
print(f'Category {i} OK')
print(f'Name: {request.json()["name"]}')
print(f'Len of categories: {len(request.json()["categories"])}')
product_dict[request.json()['name']] = request.json()['categories']
else:
print(f'Error: {request.status_code}')
pass
# Save the dictionary in a json file
import json
with open('products.json', 'w') as fp:
json.dump(product_dict, fp)
The script is pretty simple, just ranges from 1 to 199 (which seems more or less the current amount of categories) for some of them I got 404, so probably that category ID was deleted that’s why I check the status code (always check the status code!), then I save the products on a JSON file, which turns about 9 MB of storage
Products.json file
So! What do we want from this file. We have the next structure:
{
"Fruta y verdura": [ // category
{
"id": 27,
"name": "Fruta", // subcategory
"layout": 1,
"products": [ // product list
{
"id": "3318",
"limit": 1000,
"badges": {
"is_water": false,
"requires_age_check": false
},
"packaging": "Bandeja",
"thumbnail": "https://prod-mercadona.imgix.net/20190521/18/3318/vlc1/3318_00_10.jpg?fit=crop&h=206&w=206",
"display_name": "Uva negra sin semillas",
...
Okay, so now that we now the file structure, to the next point! the prompter (or the script use to call ollama)
AI time!
Okay, so now, let’s sit and talk here for a second, every damn product I’ve used that worked with AI, worked with either GPT4 from OpenAI or GPT4 from Azure, it did not work well on local models, mainly because of my own hardware limitations, I wasn’t able to run models like llama2:13b, it will crash every time, so I did what every developer does in this situation: ask another AI to fix my AI… Nah just kidding, I decided that in order to reduce the amount of tokens of my prompt I will filter out the item I wanted to look up on the corresponding category, and subcategory, then sort it by bulk_price and allow up to 35 products to be sent. This allowed me to get coherent responses on openhermes 2.5 7b and math-wizard 7b.
Here’s the script:
import json
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms.ollama import Ollama
ollama_llm = Ollama(model="openhermes-custom", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))
math_llm = Ollama(model="wizard-math:7b", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))
item = "Magdalenas|muffin|bollería"
with open('products.json', 'r') as fp:
products = json.load(fp)
category_msg = f"Find the most similar product to {item} in the following categories: {products.keys()}, respond only with the category name, nothing else, no quotes. If the item match on more than one category, choose the most relevant one. Context is classifying food into categories"
category_to_use = ollama_llm(category_msg)
if category_to_use not in products.keys():
print(f"Category {category_to_use} not found in {products.keys()}")
exit(1)
subcategories_list = products[category_to_use]
subcategory_name = []
for subcategory in subcategories_list:
subcategory_name.append(subcategory['name'])
subcategory_msg = f"Find the most similar product to {item} in the following subcategories: {subcategory_name}, respond only with the subcategory name, nothing else, no quotes. If the item match on more than one subcategory, choose the most relevant one. Context is classifying food into subcategories"
subcategory_to_use = ollama_llm(subcategory_msg)
if subcategory_to_use not in subcategory_name:
print(f"Subcategory {subcategory_to_use} not found in {subcategory_name}")
exit(1)
subcategories_list_ = products[category_to_use]
product_parsed_list = []
for subcategory in subcategories_list_:
if subcategory['name'] == subcategory_to_use:
for products in subcategory['products']:
try:
product_parsed_list.append({
'display_name': products['display_name'],
'bulk_price': products['price_instructions']['bulk_price'],
'unit_price': products['price_instructions']['unit_price'],
'iva': products['price_instructions']['iva'],
})
except KeyError:
print(f"KeyError: {products['display_name']}")
pass
if len(product_parsed_list) == 0:
print(f"NO products found for subcategory {subcategory_to_use}")
exit(1)
else:
product_parsed_list_sorted = sorted(product_parsed_list, key=lambda k: k['bulk_price'])
for product in product_parsed_list_sorted:
print(
f"{product['display_name']} bulk_price: {product['bulk_price']} unit_price: {product['unit_price']} iva: {product['iva']}"
)
if len(product_parsed_list) >= 35:
# order by bulk_price, use the first 35
print(f'Found to many elements, sorting and using the first 35')
product_parsed_list_sorted = product_parsed_list_sorted[:35]
print(f"Found {len(product_parsed_list_sorted)} products for subcategory {subcategory_to_use}")
final_calculation_msg = f"""Using the below list, search the keyword or keywords "{item}" and order the most relevant products to the keyword or keywords by the LOWEST bulk price. Print a simple top 3 product list, no code, nothing but the list I ask you to print ordered by specified criteria. Context is finding the best price food product price for a given food item
List: {product_parsed_list_sorted}
"""
final_calculation_to_use = math_llm(final_calculation_msg)
Pretty rough code to read, I know! I spent way to much time figuring out how to not blow up the model or stop getting random F R E N C H answers when asking for lower priced products.
Note: openhermes-custom because I lowered the temperature of the model using a Modelfile
Testing time :D
Okay, so:
- Asking for lowest price Muffins:
Top 3 products with the lowest bulk price for "Magdalenas", "muffin", or "bollería":
1. Magdalenas Hacendado - Bulk Price: 1.63
2. Sobaos Hacendado - Bulk Price: 2.21
3. Bizcochos al huevo Hacendado palitos - Bulk Price: 2.50
This is indeed, a correct response.
- Asking for lowest price Beer
Top 3 products with the lowest bulk price for "Cerverza|beer|cerveza":
1. Cerveza Suave Steinburg - Bulk Price: 0.65
2. Cerveza Clásica Steinburg - Bulk Price: 0.65
3. Cerveza Pilsen Sabor a Sur Steinburg - Bulk Price: 0.88
This one is also correct, almost, the third place should have gone to the same beer in another format, but the model decided to not repeat the product (not intended by me) and choose the 4th place beer to be the 3rd one instead.
But! not everything is perfect, I came across these main issues:
- For certain keywords, I just could not get a nice answer (something in the model was making it ignore the “no code” rule) and it tried to write down code to get the best priced stuff
- When matching the item against a category, I usually get the same result for the same words, but sometimes this may vary, and it will either hallucinate, or say some non-related stuff
- If you run out of tokens (which is pretty normal on local LLMs) the model will not answer at all, and I wasn’t able to properly use agents
OpenSource LLMs for affordable computers are far away from perfect, but you can play with them and get pretty confident results.
Links:
That’s all folks!