Skip to content

Commit 91bd9dd

Browse files
author
Code Express
committed
Switched the XML parser package and enabled modules on the project
1 parent e627cd8 commit 91bd9dd

File tree

7 files changed

+206
-62
lines changed

7 files changed

+206
-62
lines changed

README.md

Lines changed: 102 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,110 @@
11
# Web Plucker
22

3-
`webpluck` scrapes a specific values from a web page. It works as a standalone
4-
binary as well as in a API mode.
3+
`webpluck` scrapes a specific values from a web page. It works as a
4+
standalone binary as well as in a API mode.
55

6-
`webpluck` takes the following as input:
6+
## Download
7+
8+
### Latest Releases
9+
`webpluck` is available for 64 bit linux, OS X and Windows systems.
10+
Latest versions can be downloaded from the
11+
[Release](https://github.com/codeexpress/webpluck/releases) tab above. This is the preferred way.
12+
13+
### Build from source
14+
This is a golang project. Assuming you have golang compiler installed,
15+
the following will build the binary from scratch
16+
```
17+
$ git clone https://github.com/codeexpress/webpluck
18+
$ cd webpluck
19+
$ go get
20+
$ go build -o webpluck main.go logger.go webpluck.go
21+
```
22+
23+
## Usage
24+
`webpluck` takes the following input:
725
- URL of the webpage
826
- XPATH of the element
927
- optional regex to further narrow the selection
1028

1129
and outputs the selected value.
30+
31+
### 1. URL of the webpage
32+
This is the link of the webpage which has the desired information we want to extract.
33+
34+
For example, if we want to scrape the founders of the StackOverflow website from its company page, the URL is:
35+
https://stackoverflow.com/company. The desired value we want to extract is: **Joel Spolsky and Jeff Atwood**
36+
37+
<img width="507" alt="baseUrl" src="https://user-images.githubusercontent.com/14211134/81618604-5335bf00-9405-11ea-8b8c-ddb75e194983.png">
38+
39+
### 2. XPATH of the element
40+
This is the link of the **xpath** of the element on the page that contains the required information. A good way to get this information is to (on a chrome browser):
41+
- Right click on the place were the information is present
42+
- Click "Inspect" to open the Chrome developer tools window with the element highligted
43+
- On the highlighed value in the HTML source code, `Right click -> Copy -> Copy xpath`
44+
- The copied value is the xpath we need
45+
46+
47+
Get xpath Step 1 | Get xpath Step 2
48+
:-------------------------:|:-------------------------:
49+
<img width="352" alt="Screen Shot 2020-05-12 at 4 06 13 AM" src="https://user-images.githubusercontent.com/14211134/81619156-8d539080-9406-11ea-99bf-17e9e4da7e87.png" > | <img width="355" alt="Screen Shot 2020-05-12 at 4 08 02 AM" src="https://user-images.githubusercontent.com/14211134/81619157-8e84bd80-9406-11ea-8941-b6c6e0dfab46.png">
50+
51+
The xpath in example above comes out to be: ```//*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()```
52+
53+
### 3. `regex` to pluck the right value
54+
55+
Note the the xpath above leads us to the value: *Joel Spolsky and Jeff Atwood launch Stack Overflow*
56+
57+
Since we want to trim that down further, we'll provide a regex value to extract just the names.
58+
59+
This regex will fetch just the names (the value in parenthesis):
60+
``` ^(*.) launch .* ```
61+
62+
## Sample hosted invocation
63+
64+
`webpluck` can be run as a standalone binary. To extract the names using the three params we just obtained, copy the `targets.yml` file and populate it with the parameters. The resulting `targets.yml` should look like this:
65+
66+
```yaml
67+
targetList:
68+
- name: stackoverflow_founders
69+
baseUrl: https://stackoverflow.com/company
70+
xpath: //*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()
71+
regex: ^(.*) launch .*
72+
```
73+
74+
Now invoke webpluck as follows and obtain the answer:
75+
```bash
76+
$ ./webpluck_osx -f /path/to/targets.yml
77+
{
78+
"stackoverflow_founders": "Joel Spolsky and Jeff Atwood"
79+
}
80+
```
81+
82+
## Sample API invocation
83+
84+
`webpluck` can be run in server mode as well. Thereafter, clients written in other programming languages can scrape web pages using the `webpluck` API over the network.
85+
86+
To run `webpluck` in server mode listening on localhost on 8080:
87+
```bash
88+
$ ./webpluck -p 8080
89+
```
90+
91+
An instance of `webpluck` API is running at `https://api.code.express/webpluck/`. You can use that for your light extraction needs. If your load is heavy, consider spinning your own server running `webpluck`
92+
93+
Armed with the knowledge of `baseUrl`, `xpath` and `regex`, we can now call the API endpoint by POSTing these three params:
94+
Example `curl` invocation for the server mode:
95+
```bash
96+
curl 'https://api.code.express/webpluck/' \
97+
--data-urlencode 'baseUrl=https://stackoverflow.com/company' \
98+
--data-urlencode 'xpath=//*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()' \
99+
--data-urlencode 'regex=^(.*) launch .*' -g
100+
```
101+
102+
The result from the API is as follows. The `pluckedData` field returns the value extracted:
103+
```json
104+
{
105+
"baseUrl": "https://stackoverflow.com/company",
106+
"pluckedData": "Joel Spolsky and Jeff Atwood",
107+
"regex": "^(.*) launch .*",
108+
"xpath": "//*[@id=\"content\"]/section[3]/ol/li[1]/ol/li[2]/text()"
109+
}
110+
```

main.go renamed to cmd/main.go

Lines changed: 30 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -10,23 +10,32 @@ import (
1010
"os"
1111
"strconv"
1212

13+
"github.com/codeexpress/webpluck/logger"
14+
"github.com/codeexpress/webpluck/webpluck"
1315
"gopkg.in/yaml.v2"
1416
)
1517

16-
const (
17-
UserAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36"
18-
)
19-
2018
var (
2119
//argument flags
2220
filePtr *string
2321
outputTextPtr *bool
2422
serverModePtr *int
2523
)
2624

25+
type targetList struct {
26+
TargetList []dataLocation `yaml:"targetList"`
27+
}
28+
29+
type dataLocation struct {
30+
Name string `yaml:"name"`
31+
BaseUrl string `yaml:"baseUrl"`
32+
Xpath string `yaml:"xpath"`
33+
Regex string `yaml:"regex"`
34+
}
35+
2736
func main() {
2837
initFlags()
29-
initLogger()
38+
logger.InitLogger()
3039

3140
serverMode := isFlagPassed("p")
3241

@@ -42,7 +51,7 @@ Listens on a port and answers online queries of type:
4251
http://localhost:8080?baseUrl="example.com"&xpath="/html/body"&regex=""
4352
*/
4453
func serveApi() {
45-
logIt("Started HTTP server on localhost: "+strconv.Itoa(*serverModePtr), true)
54+
logger.LogIt("Started HTTP server on localhost: "+strconv.Itoa(*serverModePtr), true)
4655

4756
http.HandleFunc("/", handleHttp)
4857
fmt.Println(http.ListenAndServe(":"+strconv.Itoa(*serverModePtr), nil))
@@ -59,22 +68,22 @@ func handleHttp(w http.ResponseWriter, req *http.Request) {
5968
results["xpath"] = xpath
6069
results["regex"] = regex
6170

62-
logIt(getIp(req) + " " + req.Header.Get("User-Agent") + " Request: ")
63-
logIt(results)
71+
logger.LogIt(getIp(req) + " " + req.Header.Get("User-Agent") + " Request: ")
72+
logger.LogIt(results)
6473
defer func() { // in case of panic
6574
if err := recover(); err != nil {
66-
http.Error(w, "my own error message", http.StatusInternalServerError)
75+
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
6776
fmt.Fprintf(w, "Webpluck encountered an error. Make sure that the baseUrl is a valid URL and xpath and regex are valid\n")
68-
fmt.Fprintf(w, "Error encountered is:\n%s\n", err)
69-
logIt(err)
77+
fmt.Fprintf(w, "Error description:\n%s\n", err)
78+
logger.LogIt(err)
7079
}
7180
}()
72-
text := ExtractTextFromUrl(baseUrl, xpath, regex)
81+
text := webpluck.ExtractTextFromUrl(baseUrl, xpath, regex)
7382
results["pluckedData"] = text
7483
jsonString, err := json.MarshalIndent(results, "", " ")
7584
check(err)
7685
fmt.Fprintf(w, string(jsonString))
77-
logIt("Answer: " + text)
86+
logger.LogIt("Answer: " + text)
7887
}
7988

8089
func pluckFromFile() {
@@ -88,15 +97,15 @@ func pluckFromFile() {
8897
results := make(map[string]string)
8998

9099
for _, t := range list.TargetList {
91-
text := ExtractTextFromUrl(t.BaseUrl, t.Xpath, t.Regex)
100+
text := webpluck.ExtractTextFromUrl(t.BaseUrl, t.Xpath, t.Regex)
92101
results[t.Name] = text
93102
if *outputTextPtr { // if output to text (t) flag is set
94103
fmt.Println(t.Name + ": " + text)
95104
}
96105
}
97106

98-
logIt("Webpluck invoked. Reading from file: " + *filePtr)
99-
logIt(results)
107+
logger.LogIt("Webpluck invoked. Reading from file: " + *filePtr)
108+
logger.LogIt(results)
100109

101110
if !*outputTextPtr { // default case is to print in JSON
102111
jsonString, err := json.MarshalIndent(results, "", " ")
@@ -143,12 +152,12 @@ func check(e error) {
143152
// Get IP address of the incoming HTTP request based on forwarded-for
144153
// header (present in case of proxy). If not, use the remote address
145154
func getIp(req *http.Request) string {
146-
forwarded := req.Header.Get("X-FORWARDED-FOR")
147-
var addr string
148-
if forwarded != "" {
149-
addr = forwarded
155+
forwardedIp := req.Header.Get("X-Forwarded-For")
156+
if forwardedIp != "" {
157+
return forwardedIp
150158
}
151-
addr = req.RemoteAddr
159+
160+
addr := req.RemoteAddr
152161
ip, _, _ := net.SplitHostPort(addr)
153162
return ip
154163
}

go.mod

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
module github.com/codeexpress/webpluck
2+
3+
go 1.17
4+
5+
require (
6+
github.com/antchfx/htmlquery v1.2.4
7+
golang.org/x/net v0.0.0-20211011170408-caeb26a5c8c0
8+
gopkg.in/yaml.v2 v2.4.0
9+
)
10+
11+
require (
12+
github.com/antchfx/xpath v1.2.0 // indirect
13+
github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e // indirect
14+
golang.org/x/text v0.3.6 // indirect
15+
)

go.sum

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
github.com/antchfx/htmlquery v1.2.4 h1:qLteofCMe/KGovBI6SQgmou2QNyedFUW+pE+BpeZ494=
2+
github.com/antchfx/htmlquery v1.2.4/go.mod h1:2xO6iu3EVWs7R2JYqBbp8YzG50gj/ofqs5/0VZoDZLc=
3+
github.com/antchfx/xpath v1.2.0 h1:mbwv7co+x0RwgeGAOHdrKy89GvHaGvxxBtPK0uF9Zr8=
4+
github.com/antchfx/xpath v1.2.0/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
5+
github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e h1:1r7pUrabqp18hOBcwBwiTsbnFeTZHV9eER/QT5JVZxY=
6+
github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
7+
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
8+
golang.org/x/net v0.0.0-20200421231249-e086a090c8fd/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A=
9+
golang.org/x/net v0.0.0-20211011170408-caeb26a5c8c0 h1:qOfNqBm5gk93LjGZo1MJaKY6Bph39zOKz1Hz2ogHj1w=
10+
golang.org/x/net v0.0.0-20211011170408-caeb26a5c8c0/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y=
11+
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
12+
golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
13+
golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
14+
golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
15+
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
16+
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
17+
golang.org/x/text v0.3.6 h1:aRYxNxv6iGQlyVaZmk6ZgYEDa+Jg18DxebPSrd6bg1M=
18+
golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
19+
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
20+
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
21+
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
22+
gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY=
23+
gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=

logger.go renamed to logger/logger.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
package main
1+
package logger
22

33
import (
44
"encoding/json"
@@ -14,7 +14,7 @@ var (
1414
)
1515

1616
// Initializing the logger and customizing prefix
17-
func initLogger() {
17+
func InitLogger() {
1818
f, err := os.OpenFile("run.log", os.O_RDWR|os.O_CREATE|os.O_APPEND, 0666)
1919
if err != nil {
2020
panic(err)
@@ -26,7 +26,7 @@ func initLogger() {
2626
// Logs to log file
2727
// takes generic object and then based on the type of object,
2828
// logs is in appropriate style.
29-
func logIt(val interface{}, console ...bool) {
29+
func LogIt(val interface{}, console ...bool) {
3030
// if console is passed, print to stdout as well
3131
if len(console) != 0 {
3232
fmt.Println(val)

targets.yml

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
11
targetList:
2-
- name: example.com
3-
baseUrl: http://example.com/
4-
xpath: /html/body/div/p[2]/a/@href
5-
regex: ^(?:https?://)?(?:[^@\n]+@)?([^:/\n]+)
6-
- name: stackoverflow_founders
7-
baseUrl: https://stackoverflow.com/company
8-
xpath: //*[@id="content"]/section[3]/ol/li[1]/ol/li[2]
9-
regex: ^(.*) launch .*
10-
- name: stackoverflow_example_without_regex
2+
- name: ufl_ms_cs_fall
3+
baseUrl: https://www.cise.ufl.edu/admissions/graduate/
4+
xpath: //*[@id="tablepress-14"]/tbody/tr[2]/td[2]
5+
regex:
6+
- name: ufl_ms_cs_spring
7+
baseUrl: https://www.cise.ufl.edu/admissions/graduate/
8+
xpath: //*[@id="tablepress-15"]/tbody/tr[1]/td[2]
9+
regex:
10+
- name: ncsu_cs_ms
11+
baseUrl: https://www.csc.ncsu.edu/academics/graduate/admdeadlines.php
12+
xpath: //*[@id="main"]/ol/li[1]/ul/li[1]/strong
13+
regex:
14+
- name: stackoverflow_extract_asked_date_of_a_question
1115
baseUrl: https://stackoverflow.com/questions/18361750
1216
xpath: //*[@id="question"]/div[2]/div[2]/div[3]/div/div[3]/div/div[1]
1317
regex:

webpluck.go renamed to webpluck/webpluck.go

Lines changed: 20 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,18 @@
1-
package main
1+
package webpluck
22

33
import (
44
"io/ioutil"
55
"net/http"
66
"regexp"
77
"strings"
88

9-
"gopkg.in/xmlpath.v2"
9+
"github.com/antchfx/htmlquery"
10+
"golang.org/x/net/html"
1011
)
1112

12-
type targetList struct {
13-
TargetList []dataLocation `yaml:"targetList"`
14-
}
15-
16-
type dataLocation struct {
17-
Name string `yaml:"name"`
18-
BaseUrl string `yaml:"baseUrl"`
19-
Xpath string `yaml:"xpath"`
20-
Regex string `yaml:"regex"`
21-
}
13+
const (
14+
UserAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36"
15+
)
2216

2317
/* Params:
2418
url - URL of the page to be scraped
@@ -35,23 +29,24 @@ func ExtractTextFromUrl(
3529
text := ""
3630

3731
//logIt("Fetch from URL: "+url, 1)
38-
parsedHtml := fetchUrl(url) // returns a xmlpath.Node object
39-
path := xmlpath.MustCompile(xpath)
40-
value, ok := path.String(parsedHtml)
41-
if ok {
42-
if regex != "" {
43-
// try applying regex
44-
regexMatch := regexp.MustCompile(regex)
45-
text = regexMatch.FindStringSubmatch(string(value))[1]
46-
} else {
47-
text = value // no regex, the xpath element is the value
48-
}
32+
parsedHtml := fetchUrl(url) // returns a xmlquery.Node object
33+
34+
node := htmlquery.FindOne(parsedHtml, xpath)
35+
value := htmlquery.InnerText(node)
36+
37+
if regex != "" {
38+
// try applying regex
39+
regexMatch := regexp.MustCompile(regex)
40+
text = regexMatch.FindStringSubmatch(string(value))[1]
41+
} else {
42+
text = value // no regex, the xpath element is the value
4943
}
44+
5045
return strings.TrimSpace(text)
5146
}
5247

5348
// does a HTTP GET and returns the HTML body for that URL
54-
func fetchUrl(url string) *xmlpath.Node {
49+
func fetchUrl(url string) *html.Node {
5550
client := &http.Client{}
5651
req, err := http.NewRequest("GET", url, nil)
5752
if err != nil {
@@ -66,8 +61,7 @@ func fetchUrl(url string) *xmlpath.Node {
6661

6762
html, _ := ioutil.ReadAll(resp.Body)
6863
htmlStr := string(html)
69-
70-
parsedHtml, err := xmlpath.ParseHTML(strings.NewReader(htmlStr))
64+
parsedHtml, err := htmlquery.Parse(strings.NewReader(htmlStr))
7165
if err != nil {
7266
panic(err)
7367
}

0 commit comments

Comments
 (0)