I recently learned that traditionally in Shipibo culture, ayahuasca was never meant to be given to "the normal mind". Instead the maestras would be the ones taking the ayahuasca in order to help guide them into diagnosing people dealing with various sicknesses.
These maestras were also ranked by how many different plants they'd done a dieta on. A dieta is kinda similar to fasting. You can't shower with soap, you can't have sex, you can't have too much salt/seasoning, can't be exposed to too much smoke, can't have alcohol, etc. And you use that specific plant throughout your time. Basically you want to eliminate any conflicting variables so you can experience the plant as purely as possible to understand its effects. Traditionally these dietas could last over a year but modern day maestros typically do them for just a few weeks.
I don't really have a point to this. Just found it fascinating how deeply and strictly they study certain plant medicines and wanted to share
(Fwiw I've accumulated a couple years worth of dieta under my belt and am well aware of the restrictions! It's indeed very fascinating, been pretty serious about it the last few years and I've barely scratched the surface)
FYI - Lens on Android does in-place language translation including attempting to use the same/similar font that the original language is written/printed.
Unfortunately, I don't think Lens can be used in an automated batch translation mode to convert an entire book/multiple pages
And that translation is likely only a rough approximation, as words don't often translate directly. To add in an extra layer (spanish -> english) seems like another layer of imperfect (due to language) abstraction.
Of course your efforts are targeting a niche, so likely people will understand the attempt and be thankful. I hope this suggestion isn't too forward, but this being an electronic version, you could allow some way for the original spanish to be shown if desired. That sort of functionality would be quite helpful, even non-native spanish speakers might get a clearer picture.
What tools are you using to abstract all of this?
If the spacing and columns of the images are consistent, I'd think imagemagick would allow you to automate extraction by column (eg, cutting the individual pages up), and OCR could then get to work.
For the Shipibo side, I'd want to turn off all LLM interpretation. That tends to use known groupings of words to probabilistically determine best-match, and that'd wreak havoc in this case.
Back to the images, once you have imagemagick chop and sort, writing a very short script to iterate over the pages, display them, and prompt with y/n would be a massive time saver. Doing so at each step would be helpful.
For example, one step? Cut off header and footer, save to dir. Using helpful naming conventions (page-1, and page-1-noheader_footer). You could then use imagemagick to combine page-1 and -age-1-noheader_footer side by side.
Now run a simple bash vet script. Each of 500 pages pops up, you instantly see the original and the cut result, and you hit y or n. One could go through 500 pages like this in 10 to 20 minutes, and you'd be left with a small subset of pages that didn't get cut properly (extra large footer or whatever). If it's down to 10 pages or some such, that's an easy tweak and fix for those.
Once done, you could do the same for column cuts. You'd already have all the scripts, so it's just tweaking.
I'm mentioning all of this, because combo of automation plus human intervention is often the best method to something such as this.
Anyhow, good luck!
EDIT: you can try it yourself for free at https://console.mistral.ai/build/document-ai/ocr-playground once you create a developer account! Fingers crossed to see how well it works for my use case.
It seems like EU in general should be heavily invested in Mistral's development, but it doesn't seem like they are.
I don't know... feels like this sort of area, while not nearly so sexy as video production or coding or (etc.)... but seems like reaching a better-than-human performance level should be easier for these kinds of workloads.
Until then, they seem to be able to keep enough talent in the EU to train reasonably good models. The kernel is there, which seems like the attainable goal.
Are they? IIRC their best model is still worse than the gpt-oss-120B?
Though I haven't checked other benchmarks and they only report swe
The EU is extremely invested in Mistral's development: half of the effort is finding ways to tax them (hello Zucman tax), the other half is wondering how to regulate them (hello AI act)
Maybe, i think it will be to our benefit when the bubble pops that we are not heavily invested, no harm investing a little.
> can someone help folks at Mistral find more weak baselines to add here? since they can't stomach comparing with SoTA....
> (in case y'all wanna fix it: Chandra, dots.ocr, olmOCR, MinerU, Monkey OCR, and PaddleOCR are a good start)
Its failure mode are also vastly different. VLM-based extraction can misread entire sentences or miss entire paragraphs. Sonnet 3 had that issue. Computer vision models instead will make in-word typos.
Edit: Gemini 2.0 was good enough for VLM cleanup, and now 2.5 or above with structured output make reconstruction even easier.
In their website, the benchmarks say “Multilingual (Chinese), Multilingual (East-asian), Multilingual (Eastern europe), Multilingual (English), Multilingual (Western europe), Forms, Handwritten, etc.” However, there’s no reference to the benchmark data.
I don’t know how they can make this statement with 79% accuracy rate. For any serious use case, this is an unacceptable number.
I work with scientific journals and issues like 2.9+0.5 and 29+0.5 is something we regularly run into that has us never being able to fully trust automated processes and require human verification every step.
What matters is whether this is better than competition/alternatives. Of course nobody is just going to take the output as is. If you do that, that's your problem.
If I am wildly off, I am happy to learn.
The previous version already achieved up to 99% accuracy in multiple benchmarks, already better than most OCR software.
- paddleOCR-VL
- olmOCR-2
- chandra
- dots.ocr
I kind of miss there is not many leaderboard sections or arena for OCR and CV and providers hosting those. Neglected on both Artificial Analysis and OpenRouter.
https://www.ocrarena.ai/leaderboard
Hasn't been updated for Mistral but so far gemeni seems to top the leaderboard.
Getting the wrong answer really quickly is not the best goal.
It took an hour and a half to install 12 gigabytes of pytorch dependencies that can't even run on my device, and then it told me it had some sort of versioning conflict. (I think I was supposed to use UV, but I had run out of steam by that point.)
Maybe I should have asked Claude to install it for me. I gave Claude root on a $3 VPS, and it seems to enjoy the sysadmin stuff a lot more than I do...
Incidentally I had a similar experience installing open web UI... It installed 12 GB of pytorch crap.. I rage quit and deleted the whole thing, and replicated the functionality I actually needed in 100 lines of HTML.... Too bad I can't do that with OCR ;)
But yes, in general, you want to use uv. Otherwise, the next Python application you install WILL break the last one you installed.
I suppose you could use gemini-cli as a substitute for proper Python virtual environment management, always letting it fix whatever broke since the last time you tried to run the program, but that'd be like burning down a rainforest to toast a marshmallow.
E.g. with Gemini 3.0 flash you might seem that model pricing increased only slightly comparing to Gemini 2.5 flash until you test it and will see that what used to be 258 per 384x384 input tokens now is around 3x more.
I've got some foreign artbooks that I would like to get translated. The translations would need to be in place since the placement of the text relative to the pictures around it is fairly important. I took a look at some paid options online, but they seemed to choke - mostly because of the non-standard text placements and all.
The best solution I could come up with is using Google Lens to overlay a translation while I go through the books, but holding a camera/tablet up to my screen isn't very comfortable. Chrome has Lens built in, but (IIRC) I still need to manually select sections for it to translate - it's not as easy to use as just holding my phone up.
Anyone know of any progress towards in-place OCR/translations?
Wonder if Word uses the same system Edge has. I remember Edge was also good, but like Chrome's Lens, I'd need to highlight sections for it to get translated. Edge also OCR'd everything very well - just didn't do the translation part automatically.
1. Use native PDF parsing if the model supports it
2. Use this Mistral OCR model (we updated to this version yesterday)
3. UNLESS you override the "engine" param to use an alternate. We support a JS-based (non-LLM) parser as well [0]
So yes, in practice a lot of OCR jobs go to Mistral, but not all of them.
Would love to hear requests for other parsers if folks have them!
[0] https://openrouter.ai/docs/guides/overview/multimodal/pdfs#p...
I will pay a premium for an inferior product or service if it means I don't have to deal with sales people.
I also hate dealing with sales people and am not going to reach out to them via another avenue as they will try and posture as if they’re doing us a huge favor (in contrast to me begging gdb for gpt4 api access).
Regular Gemini Thinking can actually get 70-80% of the documents correct except lots of mistakes on given names. Chatgpt maybe understands like 50-60%.
This Mistral model butchered the whole text, literally not a word was usable. To the point I think I'm doing something wrong.
The test document: https://files.fm/u/3hduyg65a5
The model might need tuning in order to be effective - this is normal for releases of image mode models, and after a couple days, there will be properly set up endpoints to test from, so it might be much better than you think. Or it could be really bad with turn of the 19th century portugese cursive.
We were mind blown how good Gemini was at it.
Huge timesaver.