Machine Learning

16 readers
1 users here now

This subreddit is temporarily closed in protest of Reddit killing third party apps, see /r/ModCoord and /r/Save3rdPartyApps for more information.

founded 1 year ago
MODERATORS
1
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Annual-Minute-9391 on 2025-01-19 00:37:06+00:00.


Some background: I work as a data scientist/ML engineer for a small startup. I also adjunct for the department from which I got my PhD(in statistics).

For the last few years, I’ve been teaching a series of statistical programming courses for masters students, and early PhD‘s. This semester, my class unfortunately got canceled due to low enrollment, which I was told is due to poor recruitment last fall and poor advertising. We are thinking to offer that course every other year. I would like to propose a third course in the series with more advanced topics.

First course: programming fundamentals for both R and Python. Some basic analytical stuff for each.

Second course: Python based analysis course (many R courses exist already) which touches on statistical routines from basics to mixed modeling and Bayesian analysis. Also we go through the classic models with PyTorch as well as a few transformer based applications. Also work in some explainable AI techniques

Third course: optimization, variational inference, other Bayesian deep learning approaches, MLops concepts, ????

The thing is I need to work in a fair amount of stochastic approaches because it’s a statistics department after all.

Hope that’s clear. I would like to provide relevant information especially to PhD students who would like to live at the cutting edge with an emphasis on experimentation and implementation. I know there is a lot out there but at work I need to focus on my specific tasks.

Thanks so much for any advice!

2
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/samim23 on 2025-01-18 19:56:44+00:00.

3
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Wise_Panda_7259 on 2025-01-18 21:18:11+00:00.


I do a lot of experimentation in Jupyter notebooks, and for most projects, I end up with multiple notebooks: one for EDA, one for data transformations, and several for different experiments. This workflow works great until it’s time to take the model to production.

At that point I have to take all the code from my notebooks and refactor for production. This can take weeks sometimes. It feels like I'm duplicating effort and losing momentum.

Is there something I'm missing that I could be using to make my life easier? Or is this a problem y'all have too?

*Not a huge fan of nbdev because it presupposes a particular structure

4
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/moschles on 2025-01-18 10:54:40+00:00.

5
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Sea_Farmer5942 on 2025-01-18 03:29:40+00:00.


Hey guys,

So I am very new to Bayesian methods and am curious about it from a data science and modelling point of view and how it could determine causal relationships.

I don't really know where to start, but I've read some papers on Bayesian Networks and have heard interesting things about Bayesian Deep Learning so would be happy to see any recommendations on those topics.

I would also be happy to hear about any papers you may have recently read, but am looking for anything you guys have found interesting and not an application on any specific domain, just interested in learning the theory for now (unless you suggest that I pick a domain first).

Many thanks

6
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Sad-Razzmatazz-5188 on 2025-01-18 10:05:15+00:00.


This is a half joke, and the core concepts are quite easy, but I'm sure the community will cite lots of evidence to both support and dismiss the claim that softmax sucks, and actually make it into a serious and interesting discussion.

What is softmax? It's the operation of applying an element-wise exponential function, and normalizing by the sum of activations. What does it do intuitively? One point is that outputs sum to 1. Another is that the the relatively larger outputs become more relatively larger wrt the smaller ones: big and small activations are teared apart.

One problem is you never get zero outputs if inputs are finite (e.g. without masking you can't attribute 0 attention to some elements). The one that makes me go crazy is that for most of applications, magnitudes and ratios of magnitudes are meaningful, but in softmax they are not: softmax cares for differences. Take softmax([0.1, 0.9]) and softmax([1,9]), or softmax([1000.1,1000.9]). Which do you think are equal? In what applications that is the more natural way to go?

Numerical instabilities, strange gradients, embedding norms are all things affected by such simple cores. Of course in the meantime softmax is one of the workhorses of deep learning, it does quite a job.

Is someone else such a hater? Is someone keen to redeem softmax in my eyes?

7
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/DarkAutumn on 2025-01-17 22:52:06+00:00.


A year go I started trying to use PPO to play the original Legend of Zelda, and I was able to train a model to beat the first boss after a few months of work. I wanted to share the project just for show and tell. I'd love to hear feedback and suggestions as this is just a hobby project. I don't do this for a living. The code for that lives in the original-design branch of my Triforce repo. I'm currently tinkering with new designs so the main branch is much less stable.

Here's a video of the agent beating the first dungeon, which was trained with 5,000,000+ steps. At 38 seconds, it learned that it's invulnerable at the screen edge, it exploits that to avoid damage from a projectile. At 53 seconds it steps up to avoid damage, even though it takes a -0.06 penalty for moving the wrong way (taking damage would be a larger penalty.) At 55 seconds it walks towards the rock projectile to block it. And so on, lots of little things the model does is easy to miss if you don't know the game inside and out.

As a TLDR, here's an early version of my new (single) model. This doesn't make it quite as far, but if you watch closely it's combat is already far better, and is only trained on 320,000 steps (~6% less than the first model's training steps).

This is pretty far along from my very first model.

Original Design

I got the original project working using stable-baselines's PPO and default neural network (Shared NatureCNN, I believe). SB was great to get started but ultimately stifling. In the new version of the project I've implemented PPO from scratch with torch with my own simple neural network similar to stable-baseline's default. I'm playing with all kinds of changes and designs now that I have more flexibility and control. Here is my rough original design:

Overall Strategy

My first pass through this project was basically "imagine playing Zelda with your older sibling telling you where to go and what to do". I give the model an objective vector which points to where I want it to go on the screen (as a bird flies, the agent still had to learn path finding to avoid damage and navigate around the map). This includes either point at the nearest enemy I want it to kill or a NSEW vector if it's supposed to move to the next room.

Due a few limitations with stable-baselines (especially around action masking), I ended up training unique models for traversing the overworld vs the dungeon (since they have entirely different tilesets). I also trained a different model for when we have sword beams vs not. In the video above you can see what model is being used onscreen.

In my current project I've removed this objective vector as it felt too much like cheating. Instead I give it a one-hot encoded objective (move north to the next room, pickup items, kill enemies, etc). So far it's working quite well without that crutch. The new project also does a much better job of combat even without multiple models to handle beams vs not.

Observation/Action Space

Image - The standard neural network had a really tough time being fed the entire screen. No amount of training seemed to help. I solved this by creating a viewport around Link that keeps him centered. This REALLY helped the model learn.

I also had absolutely zero success with stacking frames to give Link a way to see enemy/projectile movement. The model simply never trained with stable-baselines when I implemented frame stacking and I never figured out why. I just added it to my current neural network and it seems to be working...

Though my early experiments show that giving it 3 frames (skipping two in between, so frames curr, curr-3, curr-6) doesn't really give us that much better performance. It might if I took away some of the vectors. We'll see.

Vectors - Since the model cannot see beyond its little viewport, I gave the model a vector to the closest item, enemy, and projectile onscreen. This made it so the model can shoot enemies across the room outside of its viewport. My new model gives it multiple enemies/items/projectiles and I plan to try to use an attention mechanism as part of the network to see if I can just feed it all of that data.

Information - It also gets a couple of one-off datapoints like whether it currently has sword beams. The new model also gives it a "source" room (to help better understand dungeons where we have to backtrack), and a one-hot encoded objective.

Action Space

My original project just has a few actions, 4 for moving in the cardinal directions and 4 for attacking in each direction (I also added bombs but never spent any time training it). I had an idea to use masking to help speed up training. I.E. if link bumps into a wall, don't let him move in that direction again until he moves elsewhere, as the model would often spend an entire memory buffer running headlong straight into a wall before an update...better to do it once and get a huge negative penalty which is essentially the same result but faster.

Unfortunately SB made it really annoying architecturally to pass that info down to the policy layer. I could have hacked it together, but eventually I just reimplemented PPO and my own neural network so I could properly mask actions in the new version. For example, when we start training a fresh model, it cannot attack when there aren't enemies on screen and I can disallow it from leaving certain areas.

The new model actually understands splitting swinging the sword short range vs firing sword beams as two different actions, though I haven't yet had a chance to fully train with the split yet.

Frameskip/Cooldowns - In the game I don't use a fixed frame skip for actions. Instead I use the internal ram state of game to know when Link is animation locked or not and only allow the agent to take actions when it's actually possible to give meaningful input to the game. This greatly sped up training. We also force movement to be between tiles on the game map. This means that when the agent decides to move it loses control for longer than a player would...a player can make more split second decisions. This made it easier to implement movement rewards though and might be something to clean up in the future.

Other interesting details

Pathfinding - To facilitate rewards, the original version of this project used A* to pathfind from link to what he should be doing. Here's a video of it in action. This information wasn't giving to the model directly but instead the agent would only be given the rewards if it exactly followed that path or the transposed version of it. It would also pathfind around enemies and not walk through them.

This was a nightmare though. The corner cases were significant, and pushing Link towards enemies but not into them was really tricky. The new verison just uses a wavefront algorithm. I calculate a wave from the tiles we want to get to outwards, then make sure we are following the gradient. Also calculating the A* around enemies every frame (even with caching) was super slow. Wavefront was faster, especially because I give the new model no special rewards for walking around enemies...faster to compute and it has to learn from taking damage or not.

Either way, the both the old and new models successfully learned how to pathfind around danger and obstacles, with or without the cheaty objective vector.

Rewards - I programmed very dense rewards in both the old and new model. At basically every step, the model is getting rewarded or punished for something. I actually have some ideas I can't wait to try out to make the rewards more sparse. Or maybe we start with dense rewards for the first training, then fine-tune the model with sparser rewards. We'll see.

Predicting the Future - Speaking of rewards. One interesting wrinkle is that the agent can do a lot of things that will eventually deal damage but not on that frame. For example, when Link sets a bomb it takes several seconds before it explodes, killing things. This can be a massive reward or penalty since he spent an extremely valuable resource, but may have done massive damage. PPO and other RL propagates rewards backwards, of course, but that spike in reward could land on a weird frame where we took damage or moved in the wrong direction.

I probably could have just not solved that problem and let it shake out over time, but instead I used the fact that we are in an emulator to just see what the outcome of every decision is. When planting a bomb, shooting sword...


Content cut off. Read original on https://old.reddit.com/r/MachineLearning/comments/1i3t4c3/p_building_an_reinforcement_learning_agent_to/

8
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Cranyx on 2025-01-17 21:15:06+00:00.


For the past few years I've had a job with the official title "machine learning engineer", but as I hunt for other jobs online, I wonder if that's actually accurate. Based on the experience requirements and responsibilities listed, it doesn't seem to match up with what I do.

I have a master's with a focus in ML (though that was pre LLM-boom, so things have changed a lot) but struggled to find work in my area pertaining to that out of college. Post-COVID when everyone went remote I got my current job. In it, I work on a team building and deploying software that utilize machine learning to accomplish tasks. However, I'm never the one actually building the learning models (there's a researcher on our team who does that); just creating the systems around them. I'm actually pretty happy in my "machine learning adjacent" role, but should I be searching for different job titles to find something similar?

9
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/treblenalto on 2025-01-17 05:36:33+00:00.


Hi I’m devising up a list of papers to recommend students just starting out in compsci.

What are some must-read papers to give that is not too deep?

These days all the statistic learning theories are within reach with online courses but I want them to grow to read academic papers.

I’m starting off with ilya Sutskever's reading list.

A brief explanation of why you’re recommending the paper would be welcome too!

10
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/JohnnyAppleReddit on 2025-01-17 01:06:00+00:00.


Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the naïve loss minimization (NLM) direction. This component of the gradient does not alter the model's predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and ⊥Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods.

Paper:

(not my paper, just something that was recommended to me)

11
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/BubblyOption7980 on 2025-01-16 09:12:27+00:00.


What are the initial impressions about their work? Can it be a game changer? How quickly can this be incorporated into new products? Looking forward to the conversation!

12
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/imadade on 2025-01-15 20:43:05+00:00.


Abstract:

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high- quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.

Arxiv link:

13
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/danielhanchen on 2025-01-15 18:22:48+00:00.


Hey r/MachineLearning! Last week, Microsoft released Phi-4, a 14B open-source model that rivals OpenAI's GPT-4-o-mini. I managed to find & fix 4 bugs impacting its output quality. You might remember me previously from fixing 8 bugs in Google's Gemma model! :)

I'm going to walk you through how I found & fixed the bugs. Phi-4's benchmarks were amazing, however many users reported weird or just wrong outputs. Since I maintain the open-source project called 'Unsloth' (fine-tuning LLMs 2x faster with 70% less VRAM) with my brother, I firstly tested Phi-4 for inference and found many errors. Our GitHub repo:

This time, the model had no implementation issues (unlike Gemma 2) but did have problems in the model card. For my first inference run, I randomly found an extra token which is obviously incorrect (2 eos tokens is never a good idea). Also during more runs, I found there was an extra assistant prompt which is once again incorrect. And, lastly, from past experience with Unsloth's bug fixes, I already knew fine-tuning was wrong when I read the code.

These bugs caused Phi-4 to have some drop in accuracy and also broke fine-tuning runs. Our fixes are now under review by Microsoft to be officially added to Hugging Face. We uploaded the fixed versions to

Here’s a breakdown of the bugs and their fixes:

1. Tokenizer bug fixes

The Phi-4 tokenizer interestingly uses <|endoftext|> as the BOS (beginning of sentence), EOS (end of sentence) and PAD (padding) tokens. The main issue is the EOS token is wrong - it should be <|im_end|>. Otherwise, you will get <|im_end|><|endoftext|> in generations.

2. Fine-tuning bug fixes

The padding token should be a designated pad token like in Llama (<|finetune_right_pad_id|>) or we can use an untrained token - for example we use <|dummy_87|>, fixing infinite generations and outputs.

3. Chat template issues

The Phi-4 tokenizer always adds an assistant prompt - it should only do this if prompted by add_generation_prompt. Most LLM serving libraries expect non auto assistant additions, and this might cause issues during serving.

We dive deeper into the bugs in our blog: https://unsloth.ai/blog/phi4

Do our Fixes Work?

Yes! Our fixed Phi-4 uploads show clear performance gains, with even better scores than Microsoft's original uploads on the Open LLM Leaderboard.

Some redditors even tested our fixes to show greatly improved results in:

We also made a Colab notebook fine-tune Phi-4 completely for free using Google's free Tesla T4 (16GB) GPUs:

Thank you for reading this long post and hope you all found this insightful! If you have any questions, please feel free to ask! :)

How I found the bugs:

  1. I first downloaded the original Phi-4 from , and tested inference out. Weirdly I found &lt;|im_start|>assistant&lt;|im_sep|> to be appended at the even with add_generation_prompt = False in Hugging Face, so I theorized there was a chat template problem. Adding assistant prompts by default can break serving libraries.
  2. And yes, had by default added the assistant prompt - I first fixed this!
  3. I then found &lt;|endoftext|> to be used for the BOS, EOS and PAD tokens, which is a common issue amongst models - I ignored the BOS, since Phi-4 did not have one anyways, but changed the PAD token to &lt;|dummy_87|>. You can select any of the tokens since they're empty and not trained. This counteracts issues of infinite generations during finetuning.
  4. For Llama-fication, I used torch.allclose to confirm all tensors are in fact equivalent. I also used some fake random data to check all activations are also mostly similar bitwise. I also uploaded the model to the HF Open LLM Leaderboard to confirm if the original Phi-4 arch and the new Llama-fied models are equivalent.
  5. Finally I verified all finetuning runs with Unsloth in a Colab Notebook to confirm all runs were correct.
14
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/ToThePastMe on 2025-01-15 16:17:30+00:00.


There is this dataset (won't link here as I don't want my kaggle and reddit associated) with a few input features (5-6) used to predict one target value.

But one of the features is basically perfectly linearly correlated with the target (>0.99).

An example would be data from a trucking company with a single model of trucks:

Target: truck fuel consumption / year Features: driver's age, tires type, truck age, DISTANCE TRAVELED / year

Obviously in average the fuel consumption will be linearly proportional with the nb of miles traveled. I mean normally you'd just use that to calculate a new target like fuel/distance.

Yet not a single person/notebook did this kind of normalization. So everyone's model has >.99 accuracy, as that one feature drowns out everything else.

Is that something other people noticed: more and more the code looks fine (Data loading, training many types of models), maybe thanks to LLMs. But the decision making process is often quite bad?

15
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/hardmaru on 2025-01-15 00:41:09+00:00.


Paper:

Abstract

Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce Transformer², a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer² employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific "expert" vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. Transformer² demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. Transformer² represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.

Blog Summary:

GitHub:

16
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/TheDevilIsInDetails on 2025-01-14 22:11:39+00:00.


Hello,

I am wondering what is the usual way people search for or discover new papers in ArXiv? Do you just use their search engine? Any tips/hints?

17
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/imadade on 2025-01-15 00:51:52+00:00.


Abstract: “Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps an attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of a fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.”

Arxiv:

18
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/wonder-why-I-wonder on 2025-01-14 15:08:32+00:00.


Some people say that AI research scientists (PhD holders) are pretty much irreplaceable because of their ability to push the boundaries of knowledge and come up with groundbreaking methods and algorithms. But let’s be real—tech companies don’t need a ton of researchers, especially if their work doesn’t directly boost profits.

On the flip side, Machine Learning Engineers are the ones putting those algorithms into action, scaling systems, and keeping production pipelines running—all things that directly bring in the $$$. That’s why some people think MLE roles will grow faster than AI research scientist roles in the future.

What do you think? Are there trends or experiences you’ve seen that suggest one of these roles will be more in demand down the line? I'm currently a PhD student by the way.

For a fair comparison, let’s assume both roles are at a FAANG company.

19
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Pringled101 on 2025-01-13 11:59:22+00:00.


Hi! A friend and I have been working on a project called SemHash which I wanted to share. We found that text deduplication is more complex than it appears, so we built this to simplify the process.

Duplicate samples can skew model training, return redundant samples in RAG workflows, reduce generalization, and cause train-test leakage—leading to unreliable results. Techniques like minhash handle exact or near-exact duplicates, but semantic deduplication also catches semantically redundant samples, which we believe is an important aspect of deduplication. Furthermore, it’s not trivial to see why something was removed with minhash, which we also believe is important. For this reason. we’ve added explainability features as well so that you can inspect why something was removed. We already found some interesting results on some well known datasets in our benchmarks which are included in the repo.

The package can be installed with pip install semhash, and the basic usage looks like this (this example assumes you have the datasets library installed):

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
train = load_dataset("ag_news", split="train")["text"]
test = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=train)

# Deduplicate the train set
deduplicated_train = semhash.self_deduplicate().deduplicated

# Or deduplicate the test set against the train set
deduplicated_test = semhash.deduplicate(records=test).deduplicated

I’m very interested in hearing your thoughts on this! Is deduplication a part of your current ML workflows, and if so, what techniques do you use?

20
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/MagnoliaPotato on 2025-01-13 15:16:18+00:00.


Hi Everyone, I recently noticed most LLM observability providers (Arize AI, Galileo AI, LangSmith) use a simple LLM-as-a-Judge framework to detect hallucinations for deployed RAG applications. There's a ton of hallucination detection research out there like this or this survey, so I wondered why aren't any of these providers offering more advanced research-backed methods? Given the user input query, retrieved context, and LLM output, one can pass this data to another LLM to evaluate whether the output is grounded in the context. So I benchmarked this LLM-as-a-Judge framework against a couple of research methods on the HaluBench dataset - and turns out they're probably right! A strong base model with chain-of-thought prompting seems to work better than various research methods. Code here. Partial results:

| Framework | Accuracy | F1 Score | Precision | Recall | |


|


|


|


|


| | Base (GPT-4o) | 0.754 | 0.760 | 0.742 | 0.778 | | Base (GPT-4o-mini) | 0.717 | 0.734 | 0.692 | 0.781 | | Base (GPT-4o, sampling) | 0.765 | 0.766 | 0.762 | 0.770 | | CoT (GPT-4o) | 0.833 | 0.831 | 0.840 | 0.822 | | CoT (GPT-4o, sampling) | 0.823 | 0.820 | 0.833 | 0.808 | | Fewshot (GPT-4o) | 0.737 | 0.773 | 0.680 | 0.896 | | Lynx | 0.766 | 0.780 | 0.728 | 0.840 | | RAGAS Faithfulness (GPT-4o) | 0.660 | 0.684 | 0.639 | 0.736 | | RAGAS Faithfulness (HHEM) | 0.588 | 0.644 | 0.567 | 0.744 | | G-Eval Hallucination (GPT-4o) | 0.686 | 0.623 | 0.783 | 0.517 |

21
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/skeltzyboiii on 2025-01-13 16:11:00+00:00.


Netflix and Cornell University researchers have exposed significant flaws in cosine similarity. Their study reveals that regularization in linear matrix factorization models introduces arbitrary scaling, leading to unreliable or meaningless cosine similarity results. These issues stem from the flexibility of embedding rescaling, affecting downstream tasks like recommendation systems. The research highlights the need for alternatives, such as Euclidean distance, dot products, or normalization techniques, and suggests task-specific evaluations to ensure robustness.

Read the full paper review of 'Is Cosine-Similarity of Embeddings Really About Similarity?' here:

22
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Singularian2501 on 2025-01-12 22:14:03+00:00.

23
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/South-Conference-395 on 2025-01-12 22:04:47+00:00.


I have encountered times that when optimizing with looser bounds, one can get better performance on test data. For example, in this paper:

authors state: "It seems that, at least for misspecified models such as overparametrized neural networks, training a looser bound on the log-likelihood leads to improved predictive performance. We conjecture that this might simply be a case of ease of optimization allowing the model to explore more distinct modes throughout the training procedure."

more details can be found below eq 14 in the appendix.

are there other problems where one has drawn a similar observation?

thanks!

24
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/AuspiciousApple on 2025-01-12 19:30:24+00:00.

Original Title: [D] Is a ViT with local window attention (SAM-style) not that much more efficient than a vanilla ViT with global attention in all layers? Especially at high resolution where global attention should be super expensive.


I was reading this blog post by Lucas Beyer:

When he compares ViTB/16 and the SAM variant with mostly local attention (window size 14), it was a bit surprised that throughput improvements are slight (left) and that the SAM variant requires more peak memory.

Now this is inference only, so maybe during training the difference is larger, but I naively would have thought that local attention is much faster still, especially at high resolutions.

At 1024x1024, we should have 1024/16=64x64 patches - so the global attention operation should be extremely expensive? Am I missing something?

25
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/habitante on 2025-01-11 20:02:48+00:00.


Hi, I've been toying with a simple idea for developing a future-proof, dynamic, AI model benchmark. The idea is pretty simple. A hidden function transforms data, and the model only gets to see the before and after, and has to deduce the hidden logic. I've carefully curated several levels of slightly increasing difficulty, and I've been surprised to see most current models I can access (GTP, o1, Sonet, Gemini) suck at it.

For instance, the first puzzle simply does ^=0x55 to the bytes on the input buffers, yet most models struggle to see it or deduce it.

I've spin up a opensource MIT repo with a live demo, so others can give this idea a try or contribute. I appreciate any feedback. Thanks!

view more: next ›