Buy/Sell Crypto

Top AI dataset pulls data from BitcoinTalk, Steemit, and U.S. SEC


Colossal Clean Crawled Corpus (C4), an AI dataset used by major tech companies, contains data from various crypto-related websites.

C4 dataset draws from crypto sites

The Washington Post and the Allen Institute for AI recently analyzed the C4 dataset, ranking websites by the number of “tokens” or text snippets taken from each source.

The U.S. Securities and Exchange Commission — which in part contains content on cryptocurrency regulation — was among the dataset’s largest sources. Its website (sec.gov) ranked at #39 and accounted for 36 million, or 0.02%, of C4’s tokens.

Bitcointalk.org, a blockchain discussion board created by Satoshi Nakamoto, ranked at #780. It accounted for 6.1 million, or 0.004%, of C4’s tokens.

Cryptocurrency news and aggregation sites such as Cointelegraph and Coinmarketcap.com were also represented. Eight such sites collectively accounted for at least 0.008% of C4’s tokens, though other sites likely increase the true total.

Websites related to specific cryptocurrencies and exchanges were also represented in the dataset but accounted for a negligible amount of tokens.

Two crypto-adjacent sites also ranked highly. IPFS (ipfs.io) ranked at #16 while Steemit (steemit.com) ranked at #594. The first site is a distributed network from the blockchain firm Protocol Labs, while the second makes direct use of blockchain. However, these sites do not necessarily contain content related to cryptocurrency.

Mainstream sites topped the list

The C4 dataset is used in AI language models from major tech companies including Google’s T5 and Facebook’s LLaMA, according to the Washington Post.

Though the above sites are among C4’s most significant crypto-related websites, they are outranked by mainstream websites and news sources, which often cover cryptocurrency topics and are likely the primary source for all crypto-related data.

C4 has also been criticized for containing hate speech and pirated data. Though the dataset’s name suggests that it has been “cleaned,” its assemblers only used a list of 400 words to censor specific content, meaning that controversial content remains intact.

The presence of crypto sites, as well as the presence of controversial data, could affect the level of bias seen in content produced by AI chatbots.



Source

Tags

Share this post:

Share on facebook
Share on twitter
Share on pinterest
Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Posts

THE ONE AND ONLY WAY TO MAKE MONEY IN AUTOMATIC EASILY!

Receive the whole procedure to be able to follow our signals in less than 2 minutes.

Follow Us

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

To access the VIP channel for free and enjoy the benefits of this exclusive channel, just follow these 3 steps:

1. Open a real account with one of our partner brokers necessarily through these links.

⚠️ Select Standard account

2. Make a deposit of at least €500 (€1000/2000 recommended) or more depending on your capital.

Double bonus as a gift! 🎁

        • 1st deposit: 50% bonus offered!
        • 2nd deposit: 20% bonus offered!

*The bonus will of course be added automatically after your deposit. ✅

3. Once done, you can send us the Screenshot of your deposit to support@signaltrading.cryptalite.com to receive the link of the VIP channel 🚀

(If you already have an account with these different brokers, you need to use another ID with another name + email).

Follow Us

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.