Ripping a Website in One Line
I posted this as an answer to a question on "site-ripping" on another site, but I figured I'd mirror it here since I wrote it.
Question: "I need software that will rip a site via HTTP. It needs to download the images, HTML, CSS, and JavaScript as well as organize it in a file system."
Here's my answer: Use GNU wget.
wget -erobots=off --no-parent --wait=3 --limit-rate=20K -r -p \
-U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" \
-A htm,html,css,js,json,gif,jpeg,jpg,bmp http://example.com
This runs in the console. It will grab a site, wait 3 seconds between requests, limit how fast it downloads so it doesn't kill the site, and mask itself in a way that makes it appear to just be a browser so the site doesn't cut you off using an anti-leech mechanism.
Note the -A parameter that indicates a list of the file types you want to download.
You can also use another tag,
-D domain1.com,domain2.comto indicate a series of domains you want to download if they have another server or whatever for hosting different kinds of files. There's no safe way to automate that for all cases, so you just have to try it and keep an eye on it.
Wget is commonly preinstalled on Linux, but can be trivially compiled for most any other Unix systems or downloaded easily for Windows: GNUwin32 WGET
Labels: UNIX

0 Comments:
Post a Comment
Nothing bad will happen to your information, I promise.
Subscribe to Post Comments [Atom]
<< Home