Deprecation notice
This wiki page should be considered deprecated. The manual page should be considered as the up to date documentation for scraping. Changes to the documentation should be made via pull request.
Scraping Configuration
As of develop release 5078402, custom scraping of performer and scene details is now supported.
By default, Stash looks for scraper configurations in the scrapers sub-directory of the directory where the stash config.yml is read. This will either be the $HOME/.stash directory or the current working directory.
Custom scrapers are added by adding configuration yaml files (format: scrapername.yml) to the scrapers directory. The configuration file looks like the following:
Basic scraper configuration file structure
name: <site>
performerByName:
<single scraper config>
performerByFragment:
<single scraper config>
performerByURL:
<multiple scraper URL configs>
sceneByFragment:
<single scraper config>
sceneByURL:
<multiple scraper URL configs>
<other configurations>
name is mandatory, all other top-level fields are optional. The inclusion of each top-level field determines what capabilities the scraper has.
A scraper configuration in any of the top-level fields must at least have an action field. The other fields are required based on the value of the action field.
The scraping types and their required fields are outlined in the following table:
| Behaviour | Required configuration |
|---|---|
Scraper in Scrape... dropdown button in Performer Edit page |
Valid performerByName and performerByFragment configurations. |
| Scrape performer from URL | Valid performerByURL configuration with matching URL. |
Scraper in Scrape... dropdown button in Scene Edit page |
Valid sceneByFragment configuration. |
| Scrape scene from URL | Valid sceneByURL configuration with matching URL. |
URL-based scraping accepts multiple scrape configurations, and each configuration requires a url field. stash iterates through these configurations, attempting to match the entered URL against the url fields in the configuration. It executes the first scraping configuration where the entered URL contains the value of the url field.
Scraper Actions
Script
Executes a script to perform the scrape. The script field is required for this action and accepts a list of string arguments. For example:
action: script
script:
- python
- iafdScrape.py
- query
This configuration would execute python iafdScrape.py query.
Stash sends data to the script process's stdin stream and expects the output to be streamed to the stdout stream. Any errors and progress messages should be output to stderr.
The script is sent input and expects output based on the scraping type, as detailed in the following table:
| Scrape type | Input | Output |
|---|---|---|
performerByName |
{"name": "<performer query string>"} |
Array of JSON-encoded performer fragments (including at least name) |
performerByFragment |
JSON-encoded performer fragment | JSON-encoded performer fragment |
performerByURL |
{"url": "<url>"} |
JSON-encoded performer fragment |
sceneByFragment |
JSON-encoded scene fragment | JSON-encoded scene fragment |
sceneByURL |
{"url": "<url>"} |
JSON-encoded scene fragment |
For performerByName, only name is required in the returned performer fragments. One entire object is sent back to performerByFragment to scrape a specific performer, so the other fields may be included to assist in scraping a performer. For example, the url field may be filled in for the specific performer page, then performerByFragment can extract by using its value.
As an example, the following python code snippet can be used to scrape a performer:
import json
import sys
import string
def readJSONInput():
input = sys.stdin.read()
return json.loads(input)
def searchPerformer(name):
# perform scraping here - using name for the query
# fill in the output
ret = []
# example shown for a single found performer
p = {}
p['name'] = "some name"
p['url'] = "performer url"
ret.append(p)
return ret
def scrapePerformer(input):
ret = []
# get the url from the input
url = input['url']
return scrapePerformerURL(url)
def debugPrint(t):
sys.stderr.write(t + "\n")
def scrapePerformerURL(url):
debugPrint("Reading url...")
debugPrint("Parsing html...")
# parse html
# fill in performer details - single object
ret = {}
ret['name'] = "fred"
ret['aliases'] = "freddy"
ret['ethnicity'] = ""
# and so on
return ret
# read the input
i = readJSONInput()
if sys.argv[1] == "query":
ret = searchPerformer(i['name'])
print(json.dumps(ret))
elif sys.argv[1] == "scrape":
ret = scrapePerformer(i)
print(json.dumps(ret))
elif sys.argv[1] == "scrapeURL":
ret = scrapePerformerURL(i['url'])
print(json.dumps(ret))
scrapeXPath
This action scrapes a web page using an xpath configuration to parse. This action is valid for performerByName, performerByURL and sceneByURL only.
This action requires that the top-level xPathScrapers configuration is populated. The scraper field is required and must match the name of a scraper name configured in xPathScrapers. For example:
sceneByURL:
- action: scrapeXPath
url:
- pornhub.com/view_video.php
scraper: sceneScraper
The above configuration requires that sceneScraper exists in the xPathScrapers configuration.
Use with performerByName
For performerByName, the queryURL field must be present also. This field is used to perform a search query URL for performer names. The placeholder string sequence {} is replaced with the performer name search string. For the subsequent performer scrape to work, the URL field must be filled in with the URL of the performer page that matches a URL given in a performerByURL scraping configuration. For example:
name: Boobpedia
performerByName:
action: scrapeXPath
queryURL: http://www.boobpedia.com/wiki/index.php?title=Special%3ASearch&search={}&fulltext=Search
scraper: performerSearch
performerByURL:
- action: scrapeXPath
url:
- boobpedia.com/boobs/
scraper: performerScraper
xPathScrapers:
performerSearch:
performer:
Name: # name element
URL: # URL element that matches the boobpedia.com/boobs/ URL above
performerScraper:
# ... performer scraper details ...
XPath scrapers configuration
The top-level xPathScrapers field contains xpath scraping configurations, freely named. The scraping configuration may contain a common field, and must contain performer or scene depending on the scraping type it is configured for.
Within the performer/scene field are key/value pairs corresponding to the golang fields (see below) on the performer/scene object. These fields are case-sensitive.
The values of these may be either a simple xpath value, which tells the system where to get the value of the field from, or a more advanced configuration (see below). For example:
performer:
Name: //h1[@itemprop="name"]
This will set the Name attribute of the returned performer to the text content of the element that matches <h1 itemprop="name">....
The value may also be a sub-object, indicating that post-processing is required. If it is a sub-object, then the xpath must be set to the selector key of the sub-object. For example, using the same xpath as above:
performer:
Name:
selector: //h1[@itemprop="name"]
# post-processing config values
Common fragments
The common field is used to configure xpath fragments that can be referenced in the xpath strings. These are key-value pairs where the key is the string to reference the fragment, and the value is the string that the fragment will be replaced with. For example:
common:
$infoPiece: //div[@class="infoPiece"]/span
performer:
Measurements: $infoPiece[text() = 'Measurements:']/../span[@class="smallInfo"]
The Measurements xpath string will replace $infoPiece with //div[@class="infoPiece"]/span, resulting in: //div[@class="infoPiece"]/span[text() = 'Measurements:']/../span[@class="smallInfo"].
Post-processing options
The following post-processing keys are available:
concat: if an xpath matches multiple elements, andconcatis present, then all of the elements will be concatenated togetherreplace: contains an array of sub-objects. Each sub-object must have aregexandwithfield. Theregexfield is the regex pattern to replace, andwithis the string to replace it with.$is used to reference capture groups -is the first capture group,the second and so on. Replacements are performed in order of the array. Due to the way data is cleaned during post processing newlines are removed from text fields. If you want to add a newline a replace regex with awith: "\n"clause is required ( #579 ).subScraper: if present, the sub-scraper will be executed after all other post-processes are complete and before parseDate. It then takes the value and performs an http request, using the value as the URL. Within thesubScraperconfig is a nested scraping configuration. This allows you to traverse to other webpages to get the attribute value you are after. For more info and examples have a look at #370, #606parseDate: if present, the value is the date format using go's reference date (2006-01-02). For example, if an example date was14-Mar-2003, then the date format would be02-Jan-2006. See the time.Parse documentation for details. When present, the scraper will convert the input string into a date, then convert it to the string format used by stash (YYYY-MM-DD).split: Its the inverse ofconcat. Splits a string to more elements using the separator given. For more info and examples have a look at PR #579
Post-processing is done in order of the fields above - concat, regex, subscraper, parseDate and then split.
Example
A performer and scene xpath scraper is shown as an example below:
name: Pornhub
performerByURL:
- action: scrapeXPath
url:
- pornhub.com
scraper: performerScraper
sceneByURL:
- action: scrapeXPath
url:
- pornhub.com/view_video.php
scraper: sceneScraper
xPathScrapers:
performerScraper:
common:
$infoPiece: //div[@class="infoPiece"]/span
performer:
Name: //h1[@itemprop="name"]
Birthdate:
selector: //span[@itemprop="birthDate"]
parseDate: Jan 2, 2006
Twitter: //span[text() = 'Twitter']/../@href
Instagram: //span[text() = 'Instagram']/../@href
Measurements: $infoPiece[text() = 'Measurements:']/../span[@class="smallInfo"]
Height:
selector: $infoPiece[text() = 'Height:']/../span[@class="smallInfo"]
replace:
- regex: .*\((\d+) cm\)
with: $1
Ethnicity: $infoPiece[text() = 'Ethnicity:']/../span[@class="smallInfo"]
FakeTits: $infoPiece[text() = 'Fake Boobs:']/../span[@class="smallInfo"]
Piercings: $infoPiece[text() = 'Piercings:']/../span[@class="smallInfo"]
Tattoos: $infoPiece[text() = 'Tattoos:']/../span[@class="smallInfo"]
CareerLength:
selector: $infoPiece[text() = 'Career Start and End:']/../span[@class="smallInfo"]
replace:
- regex: \s+to\s+
with: "-"
sceneScraper:
common:
$performer: //div[@class="pornstarsWrapper"]/a[@data-mxptype="Pornstar"]
$studio: //div[@data-type="channel"]/a
scene:
Title: //div[@id="main-container"]/@data-video-title
Tags:
Name: //div[@class="categoriesWrapper"]//a[not(@class="add-btn-small ")]
Performers:
Name: $performer/@data-mxptext
URL: $performer/@href
Studio:
Name: $studio
URL: $studio/@href
See also #333 for more examples.
XPath resources:
- Test XPaths in Firefox: https://addons.mozilla.org/en-US/firefox/addon/try-xpath/
- XPath cheatsheet: https://devhints.io/xpath
Object fields
Performer
Name
Gender
URL
Twitter
Instagram
Birthdate
Ethnicity
Country
EyeColor
Height
Measurements
FakeTits
CareerLength
Tattoos
Piercings
Aliases
Image
Note: - Gender must be one of male, female, transgender_male, transgender_female (case insensitive).
Scene
From the scene page, Studio, Movies, Tags, and Performers are matched based on their Name field.
Title
Details
URL
Date
Image
Studio (see Studio Fields)
Movies (see Movie Fields)
Tags (see Tag fields)
Performers (list of Performer fields)
Studio
Name
URL
Tag
Name
Movie
Name
Aliases
Duration
Date
Rating
Director
Synopsis
URL
Stash
A different stash server can be configured as a scraping source. This action applies only to performerByName, performerByFragment, and sceneByFragment types. This action requires that the top-level stashServer field is configured.
stashServer contains a single url field for the remote stash server. The username and password can be embedded in this string using username:password@host.
An example stash scrape configuration is below:
name: stash
performerByName:
action: stash
performerByFragment:
action: stash
sceneByFragment:
- action: stash
stashServer:
url: http://stashserver.com:9999
Debugging support
To print the received html from a scraper request to the log file, add the following to your scraper yml file:
debug:
printHTML: true
Community Scrapers
You can always have a look at the scrapers provided by the stash community over here