r/LocalLLaMA is the real benchmark of LLM's usability

Honestly, who really cares about MMLU score if doesn't provide any real value to the user?

Jan 04, 2024

There have been atleast a dozen models that claimed to be better than GPT-4 on x metric (Krutim having the latest idiotic take on this).

Whoever has used GPT-4 and other open-source models - the delta to close is big. One thing openai has done really well is to try to break gpt-4 as much as possible, rather than getting involved in the dick measuring competition of MMLU.

LocalLlama has been my constant source of info, where people are actually testing these new models and technique on real world usecases.Unfortunately, they are also too hard to keep up with! ( LMSys chat arena is another good one, but it still feels a bit drifted from the actual stuff people are using LLMs for.)

Even karpathy agrees!

https://twitter.com/karpathy/status/1737544497016578453

Can we convert the localLlama forum into an actual scoreboard? Definitely an experiment worth trying out.

Anyway, for now, wrote a small script to summarise daily subreddits from LocalLlama. Will continue posting the summaries regularly.

# Code to retrieve and summarize comments from a LocalLLaMA subreddit thread

import praw

# Initialize PRAW with your credentials
reddit = praw.Reddit(client_id='YourClientID',
                     client_secret='YourClientSecret',
                     user_agent='YourUserAgent')

post_url = 'https://www.reddit.com/r/LocalLLaMA/comments/ThreadID/'

# Retrieve and process comments
submission = reddit.submission(url=post_url)
submission.comments.replace_more(limit=None)
all_comments = [f"Author:{comment.author}\n:{comment.body}" for comment in submission.comments.list()]

# Prepare for summarization
total_str = ''.join(all_comments)

def get_summary(total_comment_str):
    # Your code to connect and send data to GPT-4 for summarization
    pass

# Example usage
summary = get_summary(total_str)
print(summary)

Summary for the thread -

Which Models are you currently using?

(No RAG, 128k context length ftw)

Model Overviews:

Deacon-3B: Described as prone to hallucinations and tangential responses, Deacon-3B might not be the first choice for those seeking precise answers. For example, when tasked with a direct question, users found that it often veers off-topic.
Rocket-3B: Noted for its bland responses, Rocket-3B is critiqued for its preambles and disclaimers, often acknowledging its limitations upfront. However, it shows potential in creative writing domains.
Starling-LM-7B-Alpha: A star performer! It shines across various tasks, especially as a physics research assistant. It's not just good; it's exceptional in areas like common knowledge and analysis. A new favorite for many users.
Python-Code-13B & Python-Code-33B: While both have their hallucination tendencies, they have been instrumental in teaching users about certain libraries and generating valid code. The 33B version shows improvement over its 13B counterpart.
Chupacabra-7B-v2: Strong in analysis and RAG (Retrieval-Augmented Generation), but may not excel in all areas. It's known for its tendency to ramble.
Deepseek-Coder-6.7B-Instruct: Outperforms Python-Coder-33B in language versatility and usability. However, it refuses to infer code for specific tasks like web-scraping.
Meditron-7B: Perhaps not the most useful for all, with responses criticized as short and lacking depth. Its training seems to have a unique flaw regarding its EOS token.
Bling-Stable-LM-3B-4elt-v0: Fast at generating brief answers, this might be your go-to for quick categorizations and yes/no questions. Beware of some inaccuracies.
Dopeystableplats-3B-v1: An all-rounder with a penchant for creative writing and analysis. It might surprise you with its detailed, passionate critiques in unexpected areas.

Community Favorites & Tips:

Mixtral Variants: Highly praised for performance and versatility. With the right hardware, users enjoy substantial speed and stability improvements. For instance, upgrading from 8 to 10 layers saw a stable speed increase from ~1.4 tokens/second to ~4.4 tokens/second.
Roleplay Models: For those interested in character-driven interactions, Noromaid Mixtral and various others have been making waves with their prose quality and adherence to character cards.
General Use Models: OpenHermes 2.5 is a popular fallback, with Nous-Capybara 34B and Mistral-7B-Instruct-v0.2 being on many users' lists to explore next.

Innovative Techniques & Tinkerings:

Layer and Quantization Experiments: Users are not just using models; they're tweaking them. By adjusting layers and playing with quantization, they're finding sweet spots for performance and efficiency.
Handling Model Repetitions and Biases: The community is actively discussing and sharing strategies to mitigate common issues like repetition and unexpected biases, employing various settings and prompt adjustments.
Exploring Model Fusion and Merges: Some are venturing into uncharted territory by merging different models or their aspects to create hybrids that potentially offer the best of multiple worlds.
Customizing Inference Scripts: There's a growing trend of writing custom scripts and using alternative platforms to run these models, optimizing for specific use cases or hardware setups.

Community Discussions:

Performance Tracking Across Parameters: With an ever-increasing array of models and settings, users are sharing their findings and methodologies for tracking and comparing performance effectively.
- Custom Evaluation Scripts: Many in the community write their own scripts to test models under specific conditions. These scripts might feed a series of prompts to the model and evaluate the responses based on criteria like relevance, coherence, and creativity. The discussion often centers around the best practices for scripting these evaluations and interpreting the results.
- A/B Testing: A/B testing is another frequently mentioned technique. Users discuss setting up controlled experiments where only one variable changes between two models at a time, allowing them to isolate the effects of that variable. This could involve tweaking one model's temperature setting, for instance, and comparing the results to a baseline.
- Parameter Sweep and Grid Search: Some users delve into more complex methodologies like parameter sweep and grid search. These involve systematically varying multiple parameters (like temperature, top-p, and token count) to observe how performance changes across a range of settings. The discussions often include tips on automating these tests and visualizing the high-dimensional data that results.
- Real-World Task Evaluation: Beyond synthetic benchmarks, there's a strong emphasis on evaluating models based on real-world tasks. Users discuss setting up scenarios that closely mimic the intended use case of the model, whether it's generating code, writing creative fiction, or conducting research. They share insights on creating a diverse and representative set of tasks and the importance of human evaluation in assessing the model's utility.
- Hardware and Inference Time: With the computationally intensive nature of LLMs, performance isn't just about accuracy and fluency. Discussions around tracking the inference time and resource utilization of models are common. Enthusiasts share methodologies for logging and analyzing the time taken for model inference and the impact of different hardware configurations.
- Community Datasets and Leaderboards: Some users advocate for community-maintained datasets and leaderboards where different models' performances are tracked and compared. They discuss the benefits of having a centralized and transparent benchmarking system to foster a competitive yet collaborative environment for improvement.
Anticipation for Upcoming Innovations: There's palpable excitement for what's next. Discussions aren't just about what's working now but what future developments, like GPT-4 and beyond, might bring to the table.
- Increased Efficiency: There's anticipation around models becoming more efficient, requiring less computational power and memory. This includes interest in techniques like quantization and sparsification, which could allow users to run more powerful models on less powerful hardware.
- Fine-Tuning and Specialization: Users are excited about the prospects of more effectively fine-tuning models for specific industries or tasks, such as legal analysis, medical diagnostics, or creative writing. This includes discussions around transfer learning and how new models might better adapt to specialized data.
- Interactivity and Multimodality: Users are looking forward to models that offer greater interactivity and can understand and generate not just text but other types of data, such as images, audio, and video. This includes discussions about models that can seamlessly interact with users in a multimodal manner.
- Personalization and Adaptability: Enthusiasts are looking forward to models that can better understand and adapt to individual user preferences, styles, and needs over time, providing a more personalized and intuitive AI experience.

Aashay's Newsletter

Discussion about this post