When it comes to code refactoring, David can sometimes beat Goliath.
In our experiment, smaller and lesser-known LLMs like Claude-Haiku and Mistral surprised us by outperforming industry heavyweights such as GPT-4.
The task? Refactor a Shopify invoice generator to enhance efficiency and scalability using GraphQL.
As LLMs grow increasingly central to software development, their real-world efficacy becomes a pressing question. This experiment highlights an important insight: the size and fame of the model isn’t always the best predictor of success.
The Challenge: Simplifying Shopify Invoicing with GraphQL
Our experiment revolved around refactoring a Shopify invoice generator plagued by inefficiencies. The current implementation, built on Shopify's REST API, required multiple redundant API calls for every order processed:
- 1 call for order details.
- 1 call per line item for inventory item IDs.
- 1 call per line item for HSN codes.
For an order with five line items, this approach generates 11 API calls—a significant performance bottleneck. Consolidating these calls into a single GraphQL query offered a clear path to optimization.
Why GraphQL?
Shopify's GraphQL API can fetch all necessary data in a single query, reducing latency and simplifying the codebase. Here's a sample query illustrating the improvement:
query GetOrderDetails($orderId: ID!) {
order(id: $orderId) {
id
lineItems {
edges {
node {
variant {
inventoryItem {
harmonizedSystemCode
}
}
}
}
}
}
}
How We Put LLMs to the Test
The evaluation process was designed to assess how effectively each LLM adapted to the task requirements and how quickly it arrived at a correct solution.
Setup
- Codebase: The task used the
invoice-rest2graphql
branch of the aurovilledotcom/gst-shopify repository as the baseline. - Tools: The LLM Context tool extracted relevant code snippets and prepared structured prompts for the models.
Interaction Process
The interaction process involved iterative testing and refinement:
Initial Output - First Interaction
First Prompt: Context Setup
Each model received a system prompt and comprehensive code snippets, generated using the LLM Context tool. The provided files included:/gst-shopify/e_invoice_exp_lut.py # Contains the invoice generation code to be refactored /gst-shopify/api_client.py # Includes the GraphQL API wrapper for data retrieval
Second Prompt: Detailed Task Instructions
The second prompt outlined a clear, step-by-step guide to the solution, focusing on:- Replacing REST API calls with a consolidated GraphQL query.
- Using the
graphql_request
wrapper for error handling and retries.
The output from the prompt pair was merged into the codebase as commit out-1
in the branch det-r2gql/<model-name>
. If the solution worked, the process ended. Otherwise, errors were reported, prompts were refined, and new outputs were tested iteratively until no further progress was made.
Iteration Process
If the initial output contained errors—such as schema mismatches, incorrect query structures, or misinterpretation of the task—these were addressed through iterative prompts:
- Error Feedback: Models were provided with specific error messages, including test outputs or stack traces.
- Refined Prompts: Task instructions were clarified to address misunderstandings or overlooked details, like camelCase conventions in GraphQL.
- Testing and Integration: Each revised output was tested and committed (e.g.,
out-2
,out-3
). Iterations continued until a correct solution was achieved or progress stalled for two consecutive attempts.
Evaluation Criteria
The evaluation focused on two key metrics:
Correctness: Did the model produce a working solution that matched the output of the original REST implementation?
Iteration Count: How many iterations were required for the model to produce a correct solution? Iteration count serves as a proxy for developer productivity, reflecting how quickly a model enables a developer to solve a problem.
Where LLMs Fell Short
The models encountered several recurring challenges during the experiments, which significantly influenced their rankings:
Schema Mismatches Some models demonstrated incomplete or outdated knowledge of Shopify’s GraphQL schema, leading to issues like incorrectly named or referenced attributes.
Case Conventions The map key names in the code needed to be refactored from snake_case (REST) to camelCase (GraphQL). Successful models handled this seamlessly, but others struggled, leaving the keys unchanged.
Wrapper Misuse Several models hallucinated implementations of
graphql_request
instead of using the provided wrapper.Barcode Handling Oversight Some models initially excluded
barcode
from their GraphQL schema and set the invoice value to""
/None
. The issue initially escaped detection since our test data lacked barcodes. This meant the blank fields in the REST outputs coincidentally matched those produced by the model-generated code. Once identified, we opted not to redo all experiments and instead penalized these models by one iteration—possibly understating the actual work that would have been needed to fix this issue.Decimal Precision Issues Minor inconsistencies in decimal precision for calculated fields (CDP) or price-related fields (PDP) were observed. While these issues did not affect correctness, they are noted in the results.
Results and Model Comparison
Results Guide
This table ranks models based on two primary metrics:
- Iteration Count: The number of attempts required to produce a working solution. Fewer iterations reflect higher efficiency and productivity.
- Penalties for Challenges: Challenges are noted and penalties applied as described above.
Model | N | Notes |
---|---|---|
claude-haiku | 1 | CDP Deltas: det-r2gql/claude-haiku Site: https://claude.ai/new |
claude-3.5-sonnet-new | 2 | Wrong 'graphql_request', CDP Deltas: det-r2gql/claude-3.5-sonnet Site: https://claude.ai/new |
mistral on LeChat | 2 | Missed barcode, CDP Deltas: det-r2gql/mistral Site: https://chat.mistral.ai/chat |
o1-preview | 3 | 2 extra tries to find correct schema, PDP Deltas: det-r2gql/o1-preview Transcript |
grok-2-mini-beta | 3 | 2 extra tries for schema, missed barcode, PDP Deltas: det-r2gql/grok-2-mini-beta Site: https://x.com/i/grok |
llama-3.2 on WhatsApp | 3 | case convention mixup, hallucinated barcode value, CDP Deltas: det-r2gql/WA-llama-3.2 Site: https://web.whatsapp.com/ |
grok-2-beta | 3 | Wrong 'graphql_request', 1 extra try for schema, missed barcode, PDP Deltas: det-r2gql/grok-2-beta Site: https://x.com/i/grok |
gpt-4o | 3 | 1 extra try to find correct schema, missed barcode, PDP Deltas: det-r2gql/gpt-4o Transcript |
gemini-1.5-pro | 4 | 1 extra try for schema, multiple case convention mixup, PDP Deltas: det-r2gql/gemini-1.5-pro Site: https://aistudio.google.com/app/prompts/new_chat |
deepseek-r1-lite-preview | 4 | Wrong 'graphql_request', 1 extra try to find correct schema, case convention mixup, PDP Deltas: det-r2gql/deep-think Site: https://chat.deepseek.com/ |
gpt-4o-mini | 6 | Wrong 'graphql_request', multiple tries for schema, case convention mixup, PDP Deltas: det-r2gql/gpt-4o-mini Transcript |
gpt-4 | 8 | 2 tries to find correct schema, case convention mixup. Deltas: det-r2gql/gpt-4 Transcript |
gemini-1.5-flash | ❌ | Couldn't find working schema in 2 extra tries. Deltas: det-r2gql/gemini-1.5-flash Site: https://gemini.google.com/app |
o1-mini | ❌ | Couldn't find working schema in 2 extra tries Deltas: det-r2gql/o1-mini Transcript |
Note on Model Attribution: Some interfaces (WhatsApp, chat.mistral.ai) don't specify exact model versions. We use their provided names ('llama-3.2', 'mistral') though underlying versions may vary.
Untested Models
A few models that seemed interesting could not be evaluated due to external factors:
- nemotron-70b-instruct: (on https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct) Persistent web interface errors prevented successful completion.
- llama-3.2-90b-text-preview: (on https://console.groq.com/playground?model=llama-3.2-90b-text-preview) Message size rate limits restricted testing.
- mixtral-8x7b: (on https://console.groq.com/playground?model=mixtral-8x7b-32768) Per minute rate limits prevented testing.
Diverse Models, Surprising Outcomes
This experiment revealed that smaller or lesser-known LLMs like Claude-Haiku and Mistral can outperform larger, more established models. Emerging models like Grok-2 and Llama-3.2 showed promising results, positioning themselves as serious contenders. In contrast, of industry leader OpenAI’s suite of five models, only two (o1-preview and gpt-4o) ranked among the top performers, while one (o1-mini) failed the test entirely.
While these results are specific to this experiment, they highlight the value of exploring diverse tools for development tasks.
Future Work
This experiment focused on guided problem-solving, where models executed a predefined solution plan. While this structured approach ensured straightforward comparisons between models, it also limited their opportunity to demonstrate creativity and independent problem-solving.
Future studies could explore how LLMs perform with minimal guidance, testing their ability to identify bottlenecks, propose solutions, and implement them autonomously.
Have a model you'd like us to test? Comment on the GitHub issue, and we'll include it in future evaluations.
Streamline your LLM interactions with LLM Context—a tool for automating code sharing with AI models.