Need to let loose a primal scream without collecting footnotes first? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid: Welcome to the Stubsack, your first port of call for learning fresh Awful youā€™ll near-instantly regret.

Any awful.systems sub may be subsneered in this subthread, techtakes or no.

If your sneer seems higher quality than you thought, feel free to cutā€™nā€™paste it into its own post ā€” thereā€™s no quota for posting and the bar really isnā€™t that high.

The post Xitter web has spawned soo many ā€œesotericā€ right wing freaks, but thereā€™s no appropriate sneer-space for them. Iā€™m talking redscare-ish, reality challenged ā€œculture criticsā€ who write about everything but understand nothing. Iā€™m talking about reply-guys who make the same 6 tweets about the same 3 subjects. Theyā€™re inescapable at this point, yet I donā€™t see them mocked (as much as they should be)

Like, there was one dude a while back who insisted that women couldnā€™t be surgeons because they didnā€™t believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I canā€™t escape them, I would love to sneer at them.

(Semi-obligatory thanks to @dgerard for starting this.)

  • aio@awful.systems
    link
    fedilink
    English
    arrow-up
    3
    Ā·
    1 month ago

    That o3 does well on frontier math held-out set is impressive, no doubt

    I think there is plenty of room for doubt still. elliotglazer on reddit writes:

    Epochā€™s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We havenā€™t yet independently verified their 25% claim. To do so, weā€™re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.

    My personal opinion is that OAIā€™s score is legit (i.e., they didnā€™t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we canā€™t vouch for them until our independent evaluation is complete.

    (emphasis mine). So there is good reason to doubt that the ā€œheld-out datasetā€ even exists.