Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion GroupsGeneralPHPASPPerlColdFusionFlashHTML, CSS, ScriptsBrowsers

Webmaster Forum / Perl / Modules / January 2006



Tip: Looking for answers? Try searching our database.

LWP module - parse one line at a time (only download part of a page)

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Alf McLaughlin - 20 Jan 2006 18:50 GMT
Hello-
  My apologies if this is an old topic, but I did a lot of searching
first and couldn't quite find the best answer.  Here is my problem
(very briefly):

 I want to download a fairly large amount of data from a webpage
(~10MB), but the stuff I'm really interested in is always toward the
top of the page (however, I don't know exactly where).  Since I'm only
interested in two or three lines, I don't want to download the whole
page.  I would like download until I see what I want (such as my
$current_line =~ /WHAT I WANT/) and then kill the download.

 The problem isn't that 10MB is such a big deal, but I have to call
different webpages for about 5000 of these things.  Any advice would be
greatly appreciated.

Thanks,
Alf
nobull@mail.com - 20 Jan 2006 20:56 GMT
>   I want to download a fairly large amount of data from a webpage
> (~10MB), but the stuff I'm really interested in is always toward the
> top of the page (however, I don't know exactly where).  Since I'm only
> interested in two or three lines, I don't want to download the whole
> page.  I would like download until I see what I want (such as my
> $current_line =~ /WHAT I WANT/) and then kill the download.

Read the description of the get() method of LWP::UserAgent.

In particular note the existance of the callback and the bit where it
says "The callback can abort the request by invoking die()."
xhoster@gmail.com - 21 Jan 2006 00:16 GMT
> >   I want to download a fairly large amount of data from a webpage
> > (~10MB), but the stuff I'm really interested in is always toward the
[quoted text clipped - 4 lines]
>
> Read the description of the get() method of LWP::UserAgent.

I think you mean request() rather than get().

> In particular note the existance of the callback and the bit where it
> says "The callback can abort the request by invoking die()."

This method is the direct answer to the OPs question, but he will have to
be careful to account for the chance that his desired string will span a
chunk boundary.

I think a simpler but less rigorous option would be to set the
$ua->max_size to his best guess of a upper limit on how far into the
response the desired string can be.  But there is always the danger that
the upper limit turns out to be set too low, and you miss things that the
callback method would find. Of course, there is the corresponding hazard
that the guess will be set too high, and he will still be reading far more
data than necessary.

Xho

Signature

-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service                        $9.95/Month 30GB

Paul Lalli - 20 Jan 2006 21:05 GMT
>   I want to download a fairly large amount of data from a webpage
> (~10MB), but the stuff I'm really interested in is always toward the
> top of the page (however, I don't know exactly where).  Since I'm only
> interested in two or three lines, I don't want to download the whole
> page.  I would like download until I see what I want (such as my
> $current_line =~ /WHAT I WANT/) and then kill the download.

I've never done this, but I wonder if these two references might point
you in the right direction:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP/UserAgent.pm#REQUEST_METHODS

Paul Lalli
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.