1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

i need help to configure a linkscanner cfg

Discussion in 'General Programming Chat' started by spla, Jun 22, 2010.

  1. spla

    spla Junior Member

    Jan 28, 2008
    Likes Received:
    I need to confiugre a linkscanner software and i dont understand it 100% >_>

    here is the original thread

    here is a sample config:
    # LinkScan.cfg !Version 11.0
    # Your comments here
    Homeurl = http://www.network.com/
    Homedir =
    Homefile =
    Mask =
    Casesensitive = 1
    Http = 1
    Noorphan = 1
    Projectdesc = Network
    Organization = My Organization
    Useragent = Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
    Scriptdisable = 0
    Maxcgi = 1000
    Maxlevels = 0
    Maxclicks = 0
    Maxdocs = 0
    Excludehidden = 0
    Noexternal = 0
    Followext = 1
    Retryext = 1
    Showredirext = 1
    Login =
    Logout = (?i).*(login|logoff|logout)
    Useloginfile = 1
    Usecookiefile = 1
    Slaves1 = 5
    Slaves2 = 5
    Import = 0
    Importfile =
    Errordoc =
    Errorbody =
    Errorbodyext =
    Defaultpages = index.html, index.shtml, index.htm, home.html, home.shtml, home.htm
    Indexoptions = 0
    Orphanfile =
    Autoowner = 1
    Defaultowner =
    Collectmeta = 0
    Noforms = 1
    Imgtags =
    Closeatag = 0
    Probe = 1
    Exclude mltQuery.html
    Exclude about:blank
    Exclude javascript:void
    Filetypes doc T
    Filetypes htm H
    Filetypes html H
    Filetypes pdf
    Filetypes shtml H
    Filetypes swf S
    Filetypes txt T
    Filetypes xml X
    Mimetypes application/msword T
    Mimetypes application/pdf
    Mimetypes application/x-javascript
    Mimetypes application/x-shockwave-flash S
    Mimetypes audio/x-pn-realaudio T
    Mimetypes text/html H
    Mimetypes text/plain T
    Mimetypes text/xml X
    Onlyfollow profilewin.html
    Onlyfollow srpage.html
    Profiler = 0
    Profilerlog = 0
    Profilermax = 300
    Substitutescript = .*openProfilewin\('(\d+)'.* '/profilewin.html?profileId=$1'
    Substituteraw = (.*)\\/(.*) $1/$2
    Substituteraw = (.*)\\/(.*) $1/$2
    Substituteraw = (.*)\\/(.*) $1/$2
    Substituteraw = (.*)\\/(.*) $1/$2
    Substituteraw = (.*)\\/(.*) $1/$2
    Substituteraw = (.*)\\/(.*) $1/$2
    Extrahome = onlinesearch.html??distance=-1&gender=13000&minAge=18&maxAge=99&country=100&zip =&photoRequired=false
    Extrahome = srpage.html?pageNumber=1 
    -why is Substituteraw 6times in there?
    -as far as i understand it the extrahome sets where to look for links at the beginning.
    - can someone explain what the substitute script does exactly?
    - what would you need to crawl something like tagged or myspace? as the myspace userpage is like domain.com/12344365456
    someone help pls

    quote form the help file, to me its not much help >_>

    11.6 How to manipulate URLs on-the-fly
    One of the most powerful (and complex) customization features of LinkScan concerns the real-time manipulation of links during the course of the scan. This is typically used to control the testing of sites with complex dynamic content. The basic commands available are:
    Sessionmatch expression
    Substitute relative-path-expression expression
    Substituteraw relative-path-expression expression
    Substitutescript relative-path-expression expression
    The Sessionmatch command is used to manipulate Session numbers. The Substitute command is used to perform transformations on resolved links. The Substituteraw is used to perform transformations on unresolved links (i.e. the raw contents of a tag or tag attribute). The Substitutescript is used to perform transformations of blocks of JavaScript code.
    We shall consider a number of examples which may be adapted according to your specific needs.
    Example 1
    Consider a site that produces links such as:
    It is entirely possible that page1.asp has been designed in such a manner that it delivers the same basic content with minor variations in formatting depending upon the presence or absence of the Print query string. One might configure LinkScan with:
    Substitute (.*\.asp)\?Print $1
    Whenever LinkScan encounters a link matching the specified pattern it will make the substitution indicated before it tries to validate or follow that link. In this example, a link to:
    will immediately be transformed to:
    Note, however, this is not the same as Excluding links which contain the Print query string; that would cause LinkScan to simply ignore the link. In this case, LinkScan will process the link but transform it on-the-fly during the scan.
    Example 2
    Next we will consider a significantly more complex scenario.
    Sessionmatch .*&token=([^&]+)
    Substitute (.*&token=)[^&]*(.*)$ $1!S$2
    In this case, we use the special Sessionmatch command to capture and save the first value of the query parameter token that LinkScan sees. This is most likely some kind of session number assigned by the target server immediately following the submission of a login form. The Substitute command then instructs LinkScan to replace all subsequent values of token with the saved value (represented by the special parameter !S).
    In this scenario, LinkScan ensures that the value of token can never change during the course of the scan from the originally assigned value.
    Example 3
    Next we'll consider a JSP site that produces URL's with the following structure:
    It may not be productive or efficient for LinkScan to scan all of the pages using every combination and permutation of values for the parameters A, B, C, D... etc.. We can control that by manipulating the individual name-value pairs during the scan. For example:
    Substitute (content\.jsp\?.*)&B=[^&](.*) $1&B=456$2
    Substitute (content\.jsp\?.*)&C=[^&](.*) $1$2
    Taglimit content\.jsp\?.*&D= 20
    The first command fixes the value of B=456. Whatever value the parameter B takes on during the scan, LinkScan will force the value back to 456. The second command deletes any references to the C parameter from every link that it finds. We have also included the third Taglimit command; this will cause LinkScan to completely ignore the twenty-first and subsequent links that include a D parameter. In other words, in this case, we only want to test a representative sample (20) of links that include a D parameter.
    Example 4
    For our next example, we shall consider a site that generates pages containing some links with the following structure:
    Rather than linking directly to Yahoo!, this page links to a script that generates a frameset that includes the referenced page. In a default configuration, LinkScan will happily follow the link, validating the frameset and the ultimate link to Yahoo!. However, it may not be productive to do that for potentially thousands of links. Furthermore, in the (extremely unlikely) event that the link to http://www.yahoo.com/ was broken, the error would appear in one of the GenerateFrame documents and not the original referring document. In order to repair that link, one would have to backtrack through the frameset to locate the original source of the trouble.
    Substitute cgi-bin/GenerateFrame.*&Link=([^&]+).* !U$1
    This command will extract the value of the Link= parameter, and the special !U token instructs LinkScan that the string needs to be un-encoded. So the original link:
    is transformed on-the-fly to:
    and then decoded to:
    And this means LinkScan can validate the link to Yahoo! directly without checking the GenerateFrame script many, many times. Furthermore, any errors will be flagged against the original document (and not one or more steps removed).
    Example 5
    For our final example, we include for illustration the complete configuration for a real-world large and very complex dynamic site:
    # Set the CGI limit to be very large
    # Include all file types on the Map
    Maxcgi = 10000
    Mapinclude .*
    # Force &A=B and insert it immediately after the '?'
    Substitute (cgi-bin.*[&\?])A=[^&=]*&*(.*) $1$2
    Substitute (cgi-bin.*\?)(.*) $1A=B&$2
    # Discard null and undefined values
    Substitute (cgi-bin.*)&B=(null|undefined)(.*) $1$3
    Substitute (cgi-bin.*)&C=(null|undefined)(.*) $1$3
    Substitute (cgi-bin.*)&D=(null|undefined)(.*) $1$3
    Substitute (cgi-bin.*)&R=(null|undefined)(.*) $1$3
    # For 'category', take the &C= if present, otherwise the &B=
    Substitute (cgi-bin/bv/scripts/category.*\?A=B).*?(&C=[^&=]*).* $1$2
    Substitute (cgi-bin/bv/scripts/category.*\?A=B).*?(&B=[^&=]*).* $1$2
    # For 'content', take the &D= or &R= if present (call it &D=). Otherwise take the &B=
    Substitute (cgi-bin/bv/scripts/content.*\?A=B).*?&[DR]=([^&=]*).* $1&D=$2
    Substitute (cgi-bin/bv/scripts/content.*\?A=B).*?(&B=[^&=]*).* $1$2
    # For 'frame', take the &D= or &R= if present (call it &D=). Otherwise take the &B=
    Substitute (cgi-bin/bv/scripts/frame.*\?A=B).*?&[DR]=([^&=]*).* $1&D=$2
    Substitute (cgi-bin/bv/scripts/frame.*\?A=B).*?(&B=[^&=]*).* $1$2
    # For 'mailing...', take the &R=
    Substitute (cgi-bin/bv/scripts/mailing.*\?A=B).*?(&R=[^&=]*).* $1$2
    # For 'contact', take the &B=, &C= and &Comments
    Substitute (cgi-bin/bv/scripts/contact.*\?A=B).*?(&B=[^&=]*).*?(&C=[^&=]*).*?(&Comments=[^&=]*).* $1$2$3$4
    # Mark redirects to Error page as 404
    # Mark documents containing 'Error Code:' as 404
    Errordoc cgi-bin/bv/scripts/error.jsp
    Errorbody Error\s+Code:[^\n<]*
    # Hide some frequent arising errors
    Noforms = 1
    Exclude images/arrow.gif
    Example 6
    Next we will consider a reference to a JavaScript function:
    <a href="javascript:MyFunction(4,5,6);">
    The following Substitutescript command:
    Substitutescript .*:MyFunction\((\d+),(\d+),(\d+)\) '/somepage.jsp?Par1=$1&Par2=$2&Par3=$3'
    will transform the function call into the following link which will then be validated/processed by LinkScan.
  2. admhat

    admhat Newbie

    Jun 13, 2010
    Likes Received:
    What does this actually do? Is it for scraping pages? If I knew more about the context I'd probably be able to help. I need a tldr;