1. This website uses cookies to improve service and provide a tailored user experience. By using this site, you agree to this use. See our Cookie Policy.
    Dismiss Notice

How to gather URL to documents from a google search

Discussion in 'Programming' started by karnavico, Sep 10, 2019.

  1. karnavico

    karnavico Newbie

    Joined:
    Sep 10, 2019
    Messages:
    1
    Likes Received:
    0
    Gender:
    Male
    Hi all,

    For the last few weeks I have tried to develop a Python based tool to retrieve URLs to documents from the search engines by requesting queries like: "site:http://www.example.com filetype:pdf". Then parse the HTML and extract the link. There is a point where I get an error "429: too many requests".

    When I use https://www.elevenpaths.com/labstools/foca/index.html to retrieve the same documents, it can get 1000s of documents URLs in very short time. This tools has been mentioned on the forum on this https://www.blackhatworld.com/seo/tool-for-finding-peoples-email-adress.765012/#post-7954049.

    I have few questions you may can help me to find a solution....
    • Do you know any command line tool to automate custom queries inquiries?
    • Is it possible to automate that search using Python to get the URLs as quick as FOCA?
    Thanks
     
  2. Akermi

    Akermi Registered Member

    Joined:
    Dec 23, 2017
    Messages:
    64
    Likes Received:
    13
    Gender:
    Male
    You must associate each request with a unique proxy to bypass Google antibot detection.
    For example, if you have 5k proxies. You should rorate over them and pick a random proxy to associate with the next request.
     
  3. grammakov

    grammakov Jr. VIP Jr. VIP

    Joined:
    Feb 26, 2018
    Messages:
    149
    Likes Received:
    72
    Home Page:
    Not only that, eventually you'll have to start solving their captchas.