Machine Learning - Data | Vector | Storage | Processing

4 readers
1 users here now

Instance Notes

Please review our community rules and introduce yourself!

Useful links

founded 2 years ago
MODERATORS
1
2
3
4
5
1
Common Crawl (lemmy.intai.tech)
submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/[email protected]
 
 

https://commoncrawl.org/

Frequently Asked Questions

https://commoncrawl.org/big-picture/frequently-asked-questions/

General Questions

What is Common Crawl?

Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.

What can you do with a copy of the web?

The possibilities are endless, but people have used the data to improve language translation software, predict trends, track the disease propagation and much more.

Can’t Google or Microsoft just do that?

Our goal is to democratize the data so everyone, not just big companies, can do high quality research and analysis.

What terms is the data released under?

As strong believers in open data, we apply as few restrictions as possible to the dataset. The terms we add, primarily in an effort to prevent abusive or illegal usage, are fully described on our terms of use page.

Technical Questions

What is the CCBot crawler?

CCBot is a Nutch-based web crawler that makes use of the Apache Hadoop project. We use Map-Reduce to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of spider (bot) servers.

How does the bot identify itself?

Our older bot identified itself with the following User-Agent string: CCBot/1.0 (+https://commoncrawl.org/bot.html). The current version identifies itself as CCBot/2.0. We may increment the version number in the future. Contact information (a link to the FAQs) is sent along with the User-Agent string.

Will your bot make my website slow for other users?

The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain. We have taken great care to ensure that our crawler will never cause web servers to slow down or be unaccessible to other users.

The crawler uses an adaptive back-off algorithm that slows down requests to your website if your web server is responding with a HTTP 429 or 5xx status. By default our crawler waits few seconds before sending the next request to the same site.

How can I ask for a slower crawl if the bot is taking up too much bandwidth?

We obey the Crawl-delay parameter for robots.txt. By increasing that number, you will indicate to ccBot to slow down the rate of crawling. For instance, to limit our crawler from request pages more than once every 2 seconds, add the following to your robots.txt file:

User-agent: CCBot Crawl-Delay: 2

How can I block this bot?

You configure your robots.txt file which uses the Robots Exclusion Protocol to block the crawler. Our bot’s Exclusion User-Agent string is: CCBot. Add these lines to your robots.txt file and our crawler will stop crawling your website:

User-agent: CCBot Disallow: / We will periodically continue to check the robots.txt file has been updated.

How can I ensure this bot can crawl my site effectively?

The crawler supports the sitemap protocol and utilizes sitemaps announced in the robots.txt file.

Does the bot support conditional gets/compression?

We do support conditional get requests. We also currently support the gzip and brotli encoding format.

Why is the bot crawling pages I don’t have links to?

The bot may have found your pages by following links from other sites.

What is the IP range of the bot?

Older versions used the IPs 38.107.191.66 through 38.107.191.119. The current version crawls from Amazon AWS.

Does the bot support nofollow?

Currently, we do honor the nofollow attribute as it applies to links embedded on your site. It should be noted that the nofollow attribute value is not meant for blocking access to content or preventing content to be indexed by search engines. Instead, the nofollow attribute is primarily used by site authors to prevent Search Engines such as Google from having the source page’s PageRank impact the PageRank of linked targets. If we ever did ignore nofollow in the future, we would do so only for the purposes of link discovery and would never create any association between the discovered link and the source document.

What parts of robots.txt does the bot support?

We support Disallow as well as Disallow / Allow combinations. We also support the crawl-delay directive and the sitemap directive.

What robots meta tags does the bot support?

We support the NOFOLLOW meta-tag.

What to do with the crawled content?

The crawl data is stored on Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2.

6
 
 

Actual Open-Orca Dataset from openorca team

7
 
 

A member of the orca team is taking the orca data set and forking the project. This is billed as the "uncensored" data set. The orca team claims it is an earlier set with less refinement.

8
9
 
 

cross-posted from: https://lemmy.intai.tech/post/41747

On the Coverage of Cognitive mmWave Networks with Directional Sensing and Communication

Authors: Shuchi Tripathi, Abhishek K. Gupta, SaiDhiraj Amuru

Word Count: 5400

Average Reading Time: ~30 minutes

Highlights:

• The authors propose an analytical framework to evaluate the performance of a cognitive mmWave network consisting of a primary link and multiple secondary links using stochastic geometry.

• They consider directional channel sensing and communication in contrast to omnidirectional sensing, which allows secondary transmitters to transmit based on their orientation instead of being outside a certain distance. This provides better spatial reuse for secondary transmitters.

• They analyze the medium access probability, activity factor, and coverage probability of the primary and secondary links considering various parameters like directionality, threshold, density, etc.

• They show that directionality can improve the trade-off between the primary and secondary link performances by increasing both link coverages for appropriate threshold values.

• However, the effect of primary and secondary directionality depends on the location and orientation of the secondary links. While primary directionality does not always aid secondary coverage, secondary directionality always improves it.

• They propose an adaptive directional sensing where secondary links can choose higher or lower directionality based on their location to achieve similar coverage performances.

In summary, this work provides useful analytical insights into the performance of cognitive mmWave networks with directional sensing and communication. The proposed mathematical framework and results could potentially aid in the design and optimization of such networks.

Regarding applications of large language models, the analytical approach and results in this work could provide useful guidelines and inputs for developing agent-based simulations of cognitive mmWave networks. The simulations could leverage language models to emulate the behaviors of cognitive transmitters based on the derived insights to validate and extend the proposed framework.

10
11
1
submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/[email protected]
 
 

this is confirmed as the ACTUAL open orca set, the other data set copy has been renamed dolphin

12
 
 

An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2.

This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University.

Initial Data Collection and Normalization The authors started by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Using Facebook FastText, non-English web pages were filtered out.

Subsequently, near-duplicate documents were identified using local-sensitivity hashing (LSH). Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The the remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents.

13
14
15
 
 

Introducing HomoScriptor - A human-written, community-driven dataset for fine-tuning large language models.

Greetings, AI Community!

I am thrilled to announce the launch of HomoScriptor, a collaborative project that aims to revolutionize language models and drive innovation in natural language processing. And I want YOU to join me on this incredible journey!

What is HomoScriptor?

HomoScriptor is a vibrant and collaborative initiative where language model enthusiasts like myself can come together to create a remarkable human-written dataset for fine-tuning language models. I have curated a diverse collection of meticulously organized JSON files, specifically designed to enhance the training of large language models (LLMs).

Key Features:

📁 Categorized JSON Files: The dataset in HomoScriptor is thoughtfully organized into various categories, each with its own JSON file. This structured approach makes it effortless for us to explore specific linguistic domains and seamlessly incorporate them into our LLM training pipeline.

📋 Short and Long Variant Outputs: Versatility is important! Every task in the JSON files includes both short and long variant outputs. This flexibility allows us to tailor the dataset to meet our specific needs, accommodating a wide range of applications and use cases.

🤝 Open-Source and Collaborative: At HomoScriptor, I embrace the power of collaboration. I actively encourage and welcome contributors from all backgrounds to join our project and help it grow. By sharing your expertise and insights, we can enhance the overall quality of the dataset and ensure its relevance to the broader language model research community.

Join the HomoScriptor Community: https://discord.gg/9C5ec9Eysk

Together, let's create a remarkable dataset that fuels innovation and drives the progress of language models!

Best regards,

16