Wednesday, September 12, 2012

Build a Web spider on Linux

Biological motivation
When you think of a spider in nature, you think of it in its interactions with an environment, not in isolation. The spider sees and feels its way around, moving from one place to another in a meaningful way. Web spiders operate in a similar way. A Web spider is a program written in a high-level language. It interacts with its environment through the use of networking protocols, such as the Hypertext Transfer Protocol (HTTP) for the Web. If your spider wants to communicate with you, it can use the Simple Mail Transfer Protocol (SMTP) to send an e-mail message.
Spiders aren't limited to HTTP or SMTP, though. Some spiders use Web services, such as SOAP or the Extensible Markup Language Remote Procedure Call (XML-RPC) protocol. Other spiders scour newsgroups with the Network News Transfer Protocol (NNTP) or look for interesting news items in Really Simple Syndication (RSS) feeds. While most spiders in nature can see only light-dark intensity and movement changes, Web spiders can see and feel using many types of protocols.
http://www.ibm.com/developerworks/library/l-spider/index.html

No comments:

Post a Comment