OpenAI Bot Overwhelms Small Business Website, Exposing Data Scraping Loopholes - Lens

Last Saturday, Triplegangers CEO Oleksandr Tomchuk received alarming news: his company’s e-commerce site had gone offline. At first glance, it appeared to be a distributed denial-of-service (DDoS) attack. However, the culprit turned out to be a bot from OpenAI, aggressively attempting to scrape the website’s content.

“We have over 65,000 products, each with a dedicated page,” Tomchuk told TechCrunch. “Each page has at least three photos.” OpenAI’s bot was reportedly sending “tens of thousands” of server requests, aiming to download hundreds of thousands of images and their accompanying descriptions.

Tomchuk revealed that OpenAI used more than 600 IP addresses in its attempts to access his site. “We are still analyzing logs from last week—it’s possible the number is even higher,” he said. “Their crawlers were crushing our site. It was basically a DDoS attack.”

A Small Business in a Big Data World

Triplegangers, a small seven-person company based in Ukraine with a U.S. license in Tampa, Florida, has spent over a decade building what it describes as the largest database of "human digital doubles." These include 3D image files of human models, ranging from hands to hair, skin, and full-body scans. The company’s business model revolves around selling these detailed assets to 3D artists, video game developers, and other digital creators.

Despite having terms of service explicitly prohibiting bots from accessing its data without permission, Triplegangers discovered that wasn’t enough to stop OpenAI’s bot. Websites must use a properly configured robots.txt file to instruct bots like GPTBot to stay away. OpenAI’s documentation indicates it respects such files, but only when configured with specific tags for its crawlers.

However, as Tomchuk pointed out, this system unfairly shifts the burden to website owners. Without the right setup, AI crawlers treat sites as open to scraping. “It’s not an opt-in system,” Tomchuk said.

Costly Consequences

The relentless scraping attempts not only took Triplegangers offline during U.S. business hours but also left Tomchuk anticipating a steep AWS bill from the excessive CPU usage and downloads triggered by the bot.

Even after implementing a robots.txt file and deploying Cloudflare to block OpenAI’s bot, as well as other crawlers like TikTok’s ByteSpider and Barkrowler, Tomchuk is left uncertain about the extent of the data that may have already been taken. OpenAI has not provided a method to contact them or request removal of any scraped content, and a long-promised opt-out tool has yet to materialize.

A Worrying Trend

The implications for Triplegangers are especially significant, given its reliance on data rights and GDPR compliance. The company’s assets involve scans of real individuals, making unauthorized use a legal gray area. "They cannot just take a photo of anyone on the web and use it,” Tomchuk explained, highlighting the need for stricter data governance in the AI age.

This is not an isolated incident. New research from digital advertising firm DoubleVerify reported an 86% increase in invalid traffic—non-human or bot-related—in 2024, fueled in part by AI crawlers. Tomchuk warns that most small businesses remain unaware of such risks: “Most sites remain clueless that they were scraped by these bots.”

While Triplegangers was able to block further scraping, Tomchuk says the incident revealed vulnerabilities in how AI companies operate. “It’s scary because there seems to be a loophole,” he said, emphasizing that businesses must actively monitor their logs to detect such activity.

A Call for Responsibility

Tomchuk believes the onus should be on AI companies to seek permission before scraping data. “They should be asking permission, not just scraping data,” he said, summarizing the frustrations shared by many small online businesses navigating the new realities of AI-driven data collection.