通过Javascript运行时堆快照进行网络刮削
In recent years, the web has gotten very hostile to the lowly web scraper. It's a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS. Developers no longer need to label their data with class-names or ids - it's only a courtesy to screen readers now.
近年来,网络对卑微的网络搜刮者变得非常敌视。这是网络技术自然发展的结果,从静态渲染的页面到用React和CSS-in-JS等框架构建的动态应用。开发人员不再需要用类名或ID来标记他们的数据--这只是对屏幕阅读器的一种礼貌。
There's also been a concerted effort by large companies to protect their public data. Facebook, for example, employs a team of over 100 people to make sure it is as difficult as possible for any data to escape the black hole. Granted, some of these large companies do offer APIs for their data but rarely is this unrestricted. You're usually at the whim of their app review process or granted access only to a partial view of the data. Data that would be otherwise public if you were to do a Google search and click through to their website manually.
大公司也一直在共同努力保护他们的公共数据。例如,Facebook雇用了一个超过100人的团队,以确保任何数据都尽可能难以逃出黑洞。当然,其中一些大公司确实为他们的数据提供了API,但这很少是不受限制的。你通常要听从他们的应用审查过程,或者只被授予对数据的部分看法。如果你在谷歌上搜索并手动点击进入他们的网站,这些数据将是公开的。
![](data:image/svg+xml,%3csvg xmlns=%27http://www.w3.org/2000/svg%27 version=%271.1%27 width=%27884%27 height=%27494%27/%3e)
![](data:image/svg+xml,%3csvg xmlns=%27http://www.w3.org/2000/svg%27 version=%271.1%27 width=%27884%27 height=%27494%27/%3e)
How HTML looks nowadays
现在的HTML看起来如何
This can be frustrating if you're like me - somebody who wanted to build a small, local, non-profit app that uses data hosted on a closed platform. The data is public but completely inaccessible to machines because of aggressive anti-web-scraping measures. That gave me two options - input the data manually or play the web-scraping game. Of course, I chose the latter.
如果你像我一样--有人想建立一个小型的、本地的、非营利性的应用程序,使用托管在一个封闭平台上的数据,这可能是令人沮丧的。这些数据是公开的,但由于积极的反网络刮擦措施,机器完全无法访问。这给了我两个选择--手动输入数据或玩网络抓取游戏。当然,我选择了后者。
After a couple of attempts at extracting the data using the usual CSS selector method...