Free Open-Source Artificial Intelligence

2900 readers

1 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

More AI Communities

LLM Leaderboards

Developer Resources

GitHub Projects

GitHub Stars

FOSAI Time Capsule

founded 1 year ago

MODERATORS

Blaed

fosai

Where to start? Text Extraction (self.fosai)

submitted 11 months ago by Loopedcandle to c/fosai

5 comments fedilink hide all child comments

I am absolutely new to AI/ML and need some guidance/direction.

Every "New to AI, try this" guide I find ends up going down a path that isn't right for the project I'm working on - or convoluted with so many terms I need to look up, I get rather frustrated. Maybe I'm too old to learn/use AI? Anyway . . .

This is my project, and any guidance, pointers, help would be super appreciated. I'm working on a job aggregator. I have a simple web crawler that goes to a url, fetches the HTML, cleans a lot of the text and structure, and outputs the content of the job posting.

I then go in manually, look at that simplified HTML and extract the actual job description (vs Company description, benefits, other stuff on a job posting) to be used in another database. I use the exact wording, straight copy and paste, no summarization or interpretation.

I have about 400 data points in a database that look like this: job_site: "COMPANY_NAME", raw_html: "Job TitleThis is what we doWe are looking for someone who" job_description: "We are looking for someone who" That I've manually extracted. I feel like I can use that as training data to do some form of text . . . extraction ?? . . . from an html document. But I don't have any clue on where to start

you are viewing a single comment's thread
view the rest of the comments

[–] Loopedcandle 2 points 11 months ago (1 children)

Thanks for this! I'll start learning!

A friend mentioned I should start with a pre-trained model because 400 (and growing 50ish / week with my crawler) is just not nearly enough. Then do continued learning on that pre-trained model. Does that sound right?

[–] coolkicks 1 points 11 months ago

Yeah, model training is hard. Like capital H HARD. you need a bunch of data and it needs to be high quality.

New York is the financial center of USA, so separating finance jobs from job postings written by someone using New England vernacular is a step you need to go through to make sure your data is high enough quality.

So if you are just starting, use 20 newsgroups dataset in those links, it’s pretty good data with a ton of resources written about it. It’s not fun data, but it isn’t as likely to fall victim to biases in data you aren’t expecting.