linuxmemes

21250 readers

1260 users here now

Hint: :q!

Sister communities:

LemmyMemes: Memes
LemmyShitpost: Anything and everything goes.
RISA: Star Trek memes and shitposts

Community rules (click to expand)

1. Follow the site-wide rules

Instance-wide TOS: https://legal.lemmy.world/tos/
Lemmy code of conduct: https://join-lemmy.org/docs/code_of_conduct.html

2. Be civil

Understand the difference between a joke and an insult.

Do not harrass or attack members of the community for any reason.

Leave remarks of "peasantry" to the PCMR community. If you dislike an OS/service/application, attack the thing you dislike, not the individuals who use it. Some people may not have a choice.

Bigotry will not be tolerated.

These rules are somewhat loosened when the subject is a public figure. Still, do not attack their person or incite harrassment.

3. Post Linux-related content

Including Unix and BSD.

Non-Linux content is acceptable as long as it makes a reference to Linux. For example, the poorly made mockery of sudo in Windows.

No porn. Even if you watch it on a Linux machine.

4. No recent reposts

Everybody uses Arch btw, can't quit Vim, and wants to interject for a moment. You can stop now.

Please report posts and comments that break these rules!

founded 1 year ago

MODERATORS

poopsmith

zephyr

rtxn

339

Parsing HTML with regex (lemmy.sdf.org)

submitted 8 months ago by [email protected] to c/linuxmemes

40 comments fedilink hide all child comments

cross-posted from: https://lemmy.sdf.org/post/12950329

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 16 points 8 months ago (1 children)

There's a difference between 'processing' the text and 'parsing' it. The processing described in the section you posted it fine, and you can manage a similar level of processing on HTML. The tricky/impossible bit is parsing the languages. For instance you can't write a regex that'll relibly find the subject, object and verb in any english sentence, and you can't write a regex that'll break an HTML document down into a hierarchy of tags as regexs don't support counting depth of recursion, and HTML is irregular anyway, meaning it can't be reliably parsed with a regular parser.

[–] Blue_Morpho -2 points 8 months ago (1 children)

For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence

Identifying parts of speech isn't a requirement of the word parse. That's the linguistic definition. In computer science identifying tokens is parsing.

https://en.m.wikipedia.org/wiki/Parsing

[–] [email protected] 9 points 8 months ago (1 children)

That's certainly one level of parsing, and sometimes alk you need, but as the article you posted says, it more usually refers to generating a parse tree. To do that in a natural language isn't happening with a regex.

[–] uranibaba 1 points 8 months ago

Thanks for all the explaining. I always wondered why you can't parse HTML since I first saw the Stack Overflow post, when you can take any HTML code you find and write an expression to work against said set of data.

I never understood the word parse to mean understanding and building a structure based on any input.