this post was submitted on 10 Jul 2023
75 points (98.7% liked)
Technology
60062 readers
3097 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Sort of - the models are able to predict numerical property values given a large amount of data to observe during training. In other words, given the scope of known data, we can extrapolate predictions for new data. The predictive capabilities of the model are only as reliable as the data used to train it, and unfortunately in our case we only have hundreds of samples per property, as opposed to other ML tasks with millions of samples. This highlights how much time it actually takes to find, synthesize, and experimentally test molecules!
Unfortunately neural networks, especially traditional multi-layered feed-forward networks, are often seen as a "black box" approach to regression and classification, where we don't really understand how a network learns or why its weights are tuned the way they are. Analysis methods have come a long way, but ambiguity still exists.
What we have done, however, is find the statistical significance of specific molecular substructures as they relate to combustion properties. For example, when we trained our models to predict sooting propensity (amount of pollution formed during combustion), we noticed that various algorithms such as random forest regression were putting a heck of a lot more weight into a molecular variable measuring path length (length of carbon chains, number of higher order bonds); from this, we were able to conclude that long-chain hydrocarbons with a higher number of double or triple bonds form more soot, and an idea of what mechanistic pathways we should stay away from when producing bio-oil.
As for fuel-grade molecules, we've found that furanic compounds and compounds with cyclohexane substructures generally have equal operating efficiency (cetane number), equal energy density (lower heating value, MJ/kg), operate well in various environments (optimal flash, boiling, and cloud points, deg. C), all while producing much less soot (yield sooting index) compared to diesel fuel. The next step is finding a cheap way to mass produce the stuff!
Recently we've started down the rabbit hole of fungus-derived bio-oils, terpenes (yes, those terpenes!) derived from fungus may be useful for use as soot-reducing fuel additives.
this is actually very interesting. i take it you've heard of the concept of "mechanistic interpretability"? perhaps you could learn something about your networks by implementing some of that methodology. here's a glossary. also recommend poking around neelanda's blog if you want to learn more.
Thanks for sharing! These seem to focus on LLMs/transformers, but since they use MLPs I should be able to find a way to adapt them for my use!