1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Help Extract Data

Discussion in 'Black Hat SEO' started by CoyoteAssassin, Dec 7, 2012.

  1. CoyoteAssassin

    CoyoteAssassin Elite Member

    Joined:
    Jan 3, 2010
    Messages:
    1,862
    Likes Received:
    3,906
    Occupation:
    Full Time IMer
    Location:
    USA
    UPDATE: See code sample in post #6 below.

    I need a tool that will allow me to download the HTML or text from a list of websites. The sites are in this format. "nameDetails.aspx?sid=1991658"

    I will generate a list of URL's by modifying the number at the end.

    I can do this one-by-one in Firefox and IE or any other browser. The problem is that I first have to log into my account in order for them to work since they are HTTPS.

    Normal scrappers will not work since it needs to first login.

    So, I need a scrapper that has a web interface that will allow me to login and then go to each site and copy either the text on the site or the HTML.

    I've tried Website Ripper Coppier and HTTrack. I have not tried to build something my self but am open to it as long as I do not have to provide credentials.

    Any suggestions?

    I wish that I could hire someone to make a bot to do this but due to the sensitive data provided, I do not have that option.

    Will Ubot do this? I'd be willing to learn if so since there are more than 2M records (w/email).

    Thanks.
     
    Last edited: Dec 7, 2012
  2. handmadebots

    handmadebots Senior Member

    Joined:
    Nov 8, 2012
    Messages:
    905
    Likes Received:
    204
    Home Page:
    I can help you with this one, not uBot, but real programming. Sending you a PM.
     
    • Thanks Thanks x 1
  3. moonlighsunligh

    moonlighsunligh Jr. VIP Jr. VIP Premium Member

    Joined:
    May 1, 2010
    Messages:
    1,623
    Likes Received:
    218
    Use free demo of Zennoster to do this. But if you have million of pages, custom programming (httprequest) is probably way faster, unless you buy pro version of zenoposter, which allows unlimited threads. The demo allows only one, so you would probably need 30-60 secs per scrape.
     
    Last edited: Dec 7, 2012
  4. CoyoteAssassin

    CoyoteAssassin Elite Member

    Joined:
    Jan 3, 2010
    Messages:
    1,862
    Likes Received:
    3,906
    Occupation:
    Full Time IMer
    Location:
    USA
    Clean your PM box. It's full.
     
  5. AdGate

    AdGate Jr. VIP Jr. VIP Premium Member

    Joined:
    Feb 23, 2011
    Messages:
    179
    Likes Received:
    37
    Home Page:
    Have you tried making a custom one in iMacros? It's pretty easy to use and is very flexible.
     
  6. CoyoteAssassin

    CoyoteAssassin Elite Member

    Joined:
    Jan 3, 2010
    Messages:
    1,862
    Likes Received:
    3,906
    Occupation:
    Full Time IMer
    Location:
    USA
    No, I haven't tried to build anything myself using Zennposter or iMacros. I usually hire others to do it and I stick with what I know.

    I hate to download a program and learn it to find out that it doesn't work as planned.

    I'd prefer to pay someone. This may help...


    here's how it could work.

    Always have two tabs open. Tab 1) The window I used to log in. Tab 2) The page being scrapped. (Note: I do not need to see the page that is being scrapped.)

    I should be able to load a list of link (https) and the bot go to each site and then download the data. I'd prefer that it is saved in a CSV based on column names but if it has to be saved all as text then that is fine. I'll figure something out... but again, a CSV file would be best.

    Each site is formatted with the same code as below. Notice that it already has the field names. Those can stay in place and I'll do Find & Replace later. I'll keep them for now so that I can make sue they sort to the right column.


    Thanks.

    -CA

    <table width="100%" border="0" cellpadding="0" cellspacing="0" class="grid">
    <tr>
    <th colspan="4">student information</th>
    </tr>
    <tr class="gridnormalrow">
    <td width="15%" align="left">Name:</td>
    <td width="35%" align="left" class="clred">Enver Ersen</td>
    <td width="20%" align="left">Phone:</td>
    <td width="30%" align="left" class="clred">203xx84477</td>
    </tr>
    <tr class="gridalternaterow">
    <td>Address:</td>
    <td class="clred"> 123 west avenue </td>
    <td>Fax:</td>
    <td class="clred"></td>
    </tr>
    <tr class="gridnormalrow">
    <td align="left">City:</td>
    <td align="left" class="clred">norwalk</td>
    <td align="left">License:</td>
    <td align="left" class="clred"></td>
    </tr>
    <tr class="gridalternaterow">
    <td>State:</td>
    <td class="clred">VA</td>
    <td>License Expiration Date:</td>
    <td class="clred"></td>
    </tr>
    <tr class="gridnormalrow">
    <td align="left">Zip Code:</td>
    <td align="left" class="clred">40901</td>
    <td align="left">SSN:</td>
    <td align="left" class="clred"><span id="ctl00_cntMain_rptStudentDetailView_ctl00_lblSSN"></span></td>
    </tr>
    <div id="ctl00_cntMain_rptStudentDetailView_ctl00_pnlCompany">

    <tr class="gridalternaterow">
    <td align="left">Co. Name:</td>
    <td align="left" class="clred">Taylor Farm</td>
    <td align="left">Co. Address:</td>
    <td align="left" class="clred"><br />  </td>
    </tr>

    </div>
    <tr class="gridnormalrow">
    <td>Email:</td>
    <td class="clred"><a href='mailto:enver@gmail .com' class="alblk">enver@gmail .com</a></td>
    <td>User Name:</td>
    <td class="clred"><span id="ctl00_cntMain_rptStudentDetailView_ctl00_lblUserName">enver@gmail .com</span></td>
    </tr>
    <tr class="gridalternaterow">
    <td align="left">DOB:</td>
    <td align="left" class="clred"><span id="ctl00_cntMain_rptStudentDetailView_ctl00_lblDOB">xx/xx/xxxx</span></td>
    <td align="left">Password:</td>
    <td align="left" class="clred">12345</td>
    </tr>
    <tr class="gridnormalrow">
    <td>Gender:</td>
    <td class="clred"><span id="ctl00_cntMain_rptStudentDetailView_ctl00_lblGender">N/A</span></td>
    <td>Referred From:</td>
    <td class="clred"></td>
    </tr>
    <div id="ctl00_cntMain_rptStudentDetailView_ctl00_pnlLocation">

    <tr class="gridalternaterow">
    <td>Location:</td>
    <td class="clred">0</td>
    <td>Department:</td>
    <td class="clred">283</td>
    </tr>

    </div>
    </table><br />
     
  7. SEOWhizz

    SEOWhizz Power Member

    Joined:
    Oct 22, 2011
    Messages:
    606
    Likes Received:
    432
    Location:
    Lat: 38N 43' 11.298" Long: 27W 12' 7.733"
    Here's 2 suggestions:
    Code:
    http://www.webextract.net/default.aspx
    http://www.visualwebripper.com/
    They both allow you to login to websites and extract data:
    See:
    Code:
    http://www.webextract.net/TutorialVideo.aspx
    > Easy Web Extract Demo 5 - Submit extracted data via HTTP Post
    http://www.visualwebripper.com/Demonstrations/IntroductionVideo.aspx
    > Extracting data from Facebook - PART 1
     
    • Thanks Thanks x 1
    Last edited: Dec 7, 2012
  8. CoyoteAssassin

    CoyoteAssassin Elite Member

    Joined:
    Jan 3, 2010
    Messages:
    1,862
    Likes Received:
    3,906
    Occupation:
    Full Time IMer
    Location:
    USA
    Thanks.

    I'm giving the first one a try.