Web Scraping via JavaScript Runtime Heap Snapshots

摘要

In recent years, the web has gotten very hostile to the lowly web scraper. It's a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS. Developers no longer need to label their data with class-names or ids - it's only a courtesy to screen readers now.

There's also been a concerted effort by large companies to protect their public data. Facebook, for example, employs a team of over 100 people to make sure it is as difficult as possible for any data to escape the black hole. Granted, some of these large companies do offer APIs for their data but rarely is this unrestricted. You're usually at the whim of their app review process or granted access only to a partial view of the data. Data that would be otherwise public if you were to do a Google search and click through to their website manually.

This can be frustrating if you're like me - somebody who wanted to build a small, local, non-profit app that uses data hosted on a closed platform. The data is public but completely inaccessible to machines because of aggressive anti-web-scraping measures. That gave me two options - input the data manually or play the web-scraping game. Of course, I chose the latter.

欢迎在评论区写下你对这篇文章的看法。

评论

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-19 14:39
浙ICP备14020137号-1 $访客地图$