[Python-au] Webscraping

Sergio Oliveira seocam at seocam.com
Wed Apr 27 23:03:04 UTC 2011


Scrapy is not bad: http://scrapy.org/

On Wed, Apr 27, 2011 at 10:09 AM, Tennessee Leeuwenburg <
tleeuwenburg at gmail.com> wrote:

> I actually found I had some pages that lxml couldn't handle, so I ran the
> html2txt linux utility, which gave me the text I needed in ascii. It wasn't
> pretty, but it worked for what I was doing at the time. I was able to pick
> out the content I wanted more easily with a regexp in this one particular
> case.
>
> I did something like
>
> try:
>    lxml parse the doc
>    find my info
> except parse error:
>    convert to ascii
>    regexp find my info
>
> Cheers,
> -T
>
>
> On Wed, Apr 27, 2011 at 11:52 AM, Richard Jones <
> richardjones at optushome.com.au> wrote:
>
>> On Wed, Apr 27, 2011 at 11:08 AM, Ishwor Gurung <ishwor.gurung at gmail.com>
>> wrote:
>> > cURL / wget for doing RESTful stuffs (POST / GET)
>>
>> If you're just doing a get then "python -m urllib <url>"
>>
>>
>>     Richard
>>
>> _______________________________________________
>> python-au maillist  -  python-au at starship.python.net
>> http://starship.python.net/mailman/listinfo/python-au
>>
>
>
>
> --
> --------------------------------------------------
> Tennessee Leeuwenburg
> http://myownhat.blogspot.com/
> "Don't believe everything you think"
>
> _______________________________________________
> python-au maillist  -  python-au at starship.python.net
> http://starship.python.net/mailman/listinfo/python-au
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://starship.python.net/pipermail/python-au/attachments/20110428/5d536d36/attachment.htm>


More information about the python-au mailing list