1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How do I extract this from web pages?

Discussion in 'Black Hat SEO' started by Zeprokon, Apr 28, 2008.

  1. Zeprokon

    Zeprokon Newbie

    Joined:
    Mar 5, 2008
    Messages:
    41
    Likes Received:
    21
    Hello folks,

    I have a list of webpages I receive from
    a database I get access to, and I'm looking
    for a way to extract a certain code from it
    as I only want the phone numbers from these
    pages.

    Here is a sample of the code

    Code:

    HTML:
    <td class="lbl">Phone:</td>
    <td>000-000-0000</td>
    The phone numbers have dashes, dots, or
    spaces, but this will give you the general
    idea...is there a quick way to do this for all
    the ASP pages..I can get thousands.

    Please let me know - thank you & rep will
    be happily given.

    Zeprokon
     
  2. Stumickel

    Stumickel Junior Member

    Joined:
    Mar 9, 2008
    Messages:
    185
    Likes Received:
    1,306
    Occupation:
    Adventurer.
    Location:
    Near Chicago.
    I would use a simple replace function in any number of programs.

    Find: <td>
    Replace with: [do not fill in anything]

    Do that with each element.

    Depending on the program, you can use wildcards. You have to check what it recognizes. There are codes for removing whole lines containing wildcards or specific words, but I don't know them yet.

    Also, in a program like Word, you can use "^p" and "^l" and "t" (all lower case) for paragraph breaks and line breaks and tabs respectively.

    I usually like to do the following routine in Word to clean up data I have mined with copy/paste.

    Find: ^l
    Replace with: ^p

    Then:

    Find: ^p^p
    Replace with: ^p

    (repeat this until there are zero results or one result)

    Then:

    Find: ^t
    Replace with: [put in one blank space]

    Then:

    Find: [put in two blank spaces]
    Replace with: [put in one blank space]

    (repeat this until there are zero results or one result)

    There are variations on all this you can do depending on your document and formatting needs.