bot

joined 2 years ago
MODERATOR OF
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/todayilearned by /u/tipoftheiceberg1234 on 2025-01-14 16:36:24+00:00.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/todayilearned by /u/ObjectiveAd6551 on 2025-01-14 16:10:09+00:00.

Original Title: TIL 2015’s Star Wars: The Force Awakens is the most expensive movie ever made, with a total cost of $447 million. Disney reduced costs using the UK’s Film Tax Relief, receiving $86.6 million in reimbursements. The movie grossed $2.1 billion worldwide.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/geocaching by /u/may-flowers1 on 2025-01-14 13:11:04+00:00.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/selfhosted by /u/eightstreets on 2025-01-14 12:38:15+00:00.


About 3 weeks ago I decided to block openai bots from my websites as they kept scanning it even after I explicity stated on my robots.txt that I don't want them to.

I already checked if there's any syntax error, but there isn't.

So after that I decided to block by User-agent just to find out they sneakily removed the user agent to be able to scan my website.

Now i'll block them by IP range, have you experienced something like that with AI companies?

I find it annoying as I spend hours writing high quality blog articles just for them to come and do whatever they want with my content.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/gardening by /u/Alarmed_Hedgehog5173 on 2025-01-14 11:45:39+00:00.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/gardening by /u/AmphibianSimilar3899 on 2025-01-14 11:31:43+00:00.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/gardening by /u/DebateDisastrous7121 on 2025-01-14 08:08:14+00:00.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/armenia by /u/Weird-Round3987 on 2025-01-14 10:52:44+00:00.


Hey I am 24 yo male

I was born and raised in Armenia

I didnt have many friends growing up and was obsessed with making money on the internet, I found pretty good success with that at around 17-18 and immediately after I went to travel the world and became a digital nomad

so for last 6 years I mostly lived outside of Armenia, and the only people I know here are my family and relatives

I kind of feel like an alien here and don't know where to start

I want to make some cool friends but feels like most friendships here come from childhood or school and outsiders are not welcomed that much

I like Armenian girls too and would like to date them but feels like there is no chance if you don't have friends here

I have cold approached quite a few times, it doesn't work that well, to say the least

have been suggested to try some hobbies but all my hobbies are with sea sports so not much to do here

I would like to get your advice on how to integrate here with the society

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/armenia by /u/Typical_Effect_9054 on 2025-01-14 08:16:16+00:00.

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/armenia by /u/Total_Pin_3983 on 2025-01-14 00:09:15+00:00.


If anyone has any witnesses or experiences regarding the genocide from their family or friends, would they like to share them?

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Pringled101 on 2025-01-13 11:59:22+00:00.


Hi! A friend and I have been working on a project called SemHash which I wanted to share. We found that text deduplication is more complex than it appears, so we built this to simplify the process.

Duplicate samples can skew model training, return redundant samples in RAG workflows, reduce generalization, and cause train-test leakage—leading to unreliable results. Techniques like minhash handle exact or near-exact duplicates, but semantic deduplication also catches semantically redundant samples, which we believe is an important aspect of deduplication. Furthermore, it’s not trivial to see why something was removed with minhash, which we also believe is important. For this reason. we’ve added explainability features as well so that you can inspect why something was removed. We already found some interesting results on some well known datasets in our benchmarks which are included in the repo.

The package can be installed with pip install semhash, and the basic usage looks like this (this example assumes you have the datasets library installed):

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
train = load_dataset("ag_news", split="train")["text"]
test = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=train)

# Deduplicate the train set
deduplicated_train = semhash.self_deduplicate().deduplicated

# Or deduplicate the test set against the train set
deduplicated_test = semhash.deduplicate(records=test).deduplicated

I’m very interested in hearing your thoughts on this! Is deduplication a part of your current ML workflows, and if so, what techniques do you use?

 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/todayilearned by /u/GetYerHandOffMyPen15 on 2025-01-14 15:06:40+00:00.

Original Title: TIL that Winston Churchill’s famous “Iron Curtain” speech was given at a college in rural Missouri with about 600 students. The college later purchased a ruined historic church from London, transported it stone by stone, rebuilt it and turned part of it into a Churchill museum.

view more: ‹ prev next ›