Hank G ☑️

3 years ago from Twitter Web App •

Hank G ☑️
3 years ago from Twitter Web App •

One thing I've enjoyed about hosting my own #friendica server on the #fediverse is being able to dabble with the addons. Such as the language detection feature which will auto-hide languages that aren't on your list of known languages. Works pretty well.

Hypolite Petovan

3 years ago •

It still is hit-and-miss with mostly image posts and quote-share posts. But I'm still happy with it.

Hank G ☑️

3 years ago •

Yes very. I was surprised it was triggering at all in those since the text is below the threshold but I think it is counting raw text not just the displayble text (i.e. it is going through all the markup, URLs, etc.).

Hypolite Petovan

3 years ago •

Yeah, we don't have a good way to extract the displayable text from the markup yet.

Hank G ☑️

3 years ago •

Is the plugin operating client side or server side? If client side it could be run through the HTML parser which should be able to break out the markup from non-markup and then just run the non-markup through that. It would then only run on text added to the quote text. So if I did a simple reshare of German it would not trigger at all, as an artifact though.

Hypolite Petovan

3 years ago •

It runs server-side but we have the HTML as this stage so we possibly could do what you're saying.

Hypolite Petovan

3 years ago •

Hum, looking into it, it seems we already are either stripping the tags of the HTML output or converting the BBCode to plaintext if we don't have the HTML output. The latter would be imperfect as image and link URLs would end up showing in the text we then parse to guess the language.

Hmm.

Hypolite Petovan

3 years ago •

Ok, I found the issue. Removing the tags doesn't remove the whitespace, so we run the language detection on messages that have very little content, but are reaching the minimum thanks to the spaces and tabs. I'll have a fix shortly, it should prevent most false positives, especially with share posts that are heavy on HTML tags with a lot of indentation whitespaces.

Hypolite Petovan

3 years ago •

Haven't found a solution for Wordle posts interpreted as Dutch, though.

Andy H3

3 years ago •

Yes hit-and-miss with posts and quote-share posts.

But also some languages are totally beyond. Portuguese is such an example. Almost all of the text posts in Portuguese are consistently identified as Spanish.

Hank G ☑️

3 years ago •

Does Italian get conflated for Spanish too?

Hypolite Petovan

3 years ago •

Nope, it’s less similar than you seem to think 😉

Hank G ☑️

3 years ago •

I know my Italian grandfather could watch Spanish shows but I had heard that written language wise it was too different. I just started some Portuguese DuoLingo exercises last month. I was surprised how different from Spanish it was, not that my Spanish is great to begin with.

Hypolite Petovan

3 years ago •

The language filter works by identifying common letter associations in a given language. So it is possible that the words are different between Spanish and Portuguese, but that the letter associations are similar, and the other way around, it's possible that Italian words are close enough to Spanish for humans to guess their meaning but the letter association would be very different.

Hank G ☑️

3 years ago •

Oh I thought it was doing a dictionary lookup of some sort. Good to know.

Hypolite Petovan

3 years ago •

No, it's cheaper this way, but it's also less accurate, even before the weird message text we feed it.

⇧

Hank G ☑️ 3 years ago from Twitter Web App •

Hank G ☑️
3 years ago from Twitter Web App •