1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Web Scraping & Email Automation Project

Discussion in 'Hire a Freelancer' started by MMnemonic, Sep 15, 2014.

  1. MMnemonic

    MMnemonic Registered Member

    Jul 20, 2010
    Likes Received:
    We are a front end web services company that provides websites and applications for small/medium sized companies and web agencies in the USA.

    Our clients are all acquired through email marketing.
    We respond to a considerable number of job advertisements requesting for very specific website solutions. We do so on a daily basis.

    1. Search & Selection:

    Main goals: Speed and Proxy Preservation

    For the purpose of finding the job proposals that match our services we have developed a PHP Web crawler that searches for job postings based on a set of keywords that best describe our skill set. That tool is set up on an Amazon EC2 instance, and the results are organized on a MySQL database by date, job description, job title, email contact, offer URL, etc. We use a set of private proxies combined with a proxy rotation script to be able to crawl a higher number of pages at the best possible speed. Speed itself is crucial: the fastest we answer a given ad - the better are our chances of getting that project.

    What do we need:

    1.1. Proxy Authentication: Our Proxy rotation script uses a TXT file (proxies.txt) to import our proxy list. The proxy format is as follows: IP:pORT (eg. It does not allow proxy authentication (username & password) as it should. This is a one-line edit to the script.

    1.2. Proxy Script Optimization: Our proxy script is not the most optimized one. We burn a lot of proxies and therefore spend a lot more on buying new ones. The rotation is based on number of requests per proxy and total requests per minute. We have set a lot of timeouts between requests in order to preserve the working proxies but those timeouts slow speed down considerably. You?ll have to improve this algorithm in order to find a better speed-number of proxies balance.

    1.3. Crawler Speed Optimization: The crawler is well organized and built, the code itself is clean and beautiful. Simple and effective, BUT we need to be able to do fewer requests. You?ll need to read the code carefully and suggest improvements that will benefit its speed, accuracy and decrease the possibility of errors and burned proxies.

    1.4. Database: Add a ?Type? column to our MySQL database. Each single result (each collected job posting) will now have a ?Type? value (eg. Wordpress Developer / Graphic Designer / Ruby Developer) that will qualify the type of worker being requested (if not already present on the job title). That value will be attributed by analysing the technologies requested on the job posting description. For instance, if the ad requests for ?HTML5?, ?CSS3? and ?Twitter Bootstrap? the ?Type? value for this result will be ?Front End Developer?. Otherwise if the ad requests for ?Android? and ?iOS? experience the ?Type? value should be ?Mobile Developer?. We need to be able to qualify groups of job postings by type.

    1.5. We want the crawler running constantly, 24 hours per day. Instead of setting up cron jobs at specific points in the day we want the script to restart after the URL list has been completely searched. Only new results should be added to the database.

    2. Email automation:

    Main goals: Automation and Personalized Response

    At this moment we are replying to the job postings manually. We export the database to an Excel file which we then download. Then we chose the email template reply that better suits each advertisement based on its description. We need to be able to do all this manually: reply to every single email contact with a personalized response based on the job posting description. For instance, if the ad requests a ?developer with HTML5, CSS3 experience? the reply should state that ?HTML5? and ?CSS3? are part of our skillset.

    What do we need:

    2.1. SMTP Auto-responder: We need to read the database regularly and check for new results. The email contacts present on each new result should be contacted imediatelly. You can integrate 3rd party email providers like Mailgun and Sendgrid to achieve this purpose.

    2.2. We want a personalized response based on the job description. The response should be an HTML email with a given set of variables. Those variables will represent the technologies, location, compensation and a few other elements to be filled with the technologies, ad location, compensation and relevant portfolio links based on the job posting description.
    • Thanks Thanks x 1
  2. handmadebots

    handmadebots Senior Member

    Nov 8, 2012
    Likes Received:
    I can help you out :) Sending you a PM
    Finally a well organized and detailed thread!
    • Thanks Thanks x 1
    Last edited: Sep 15, 2014
  3. pavan

    pavan Elite Member

    Mar 30, 2008
    Likes Received:
    I can do this and make the script multi threaded as well
    Add me up on skype
  4. PrateekSinghania

    PrateekSinghania Junior Member

    Aug 21, 2014
    Likes Received:
    Can not PM you as I am Newbie. please contact us on Codemongoosesupport , We have solution for your query.