Skip to content

Commit 876cedd

Browse files
authored
Merge pull request #8 from opsdisk/make-http-429-optional
Added yagooglesearch_manages_http_429s option
2 parents 30cd185 + 12a1c19 commit 876cedd

File tree

3 files changed

+89
-25
lines changed

3 files changed

+89
-25
lines changed

README.md

Lines changed: 51 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -77,16 +77,16 @@ for url in urls:
7777
## Google is blocking me!
7878

7979
Low and slow is the strategy when executing Google searches using `yagooglesearch`. If you start getting HTTP 429
80-
responses, Google has rightfully detected you as a bot and will block your IP for a set period of time.
81-
`yagooglesearch` is not able to bypass CAPTCHA, but you can do this manually by performing a Google search from a
82-
browser and proving you are a human.
80+
responses, Google has rightfully detected you as a bot and will block your IP for a set period of time. `yagooglesearch`
81+
is not able to bypass CAPTCHA, but you can do this manually by performing a Google search from a browser and proving you
82+
are a human.
8383

8484
The criteria and thresholds to getting blocked is unknown, but in general, randomizing the user agent, waiting enough
85-
time between paged search results (7-17 seconds), and waiting enough time between different Google searches
86-
(30-60 seconds) should suffice. Your mileage will definitely vary though. Using this library with Tor will likely get
87-
you blocked quickly.
85+
time between paged search results (7-17 seconds), and waiting enough time between different Google searches (30-60
86+
seconds) should suffice. Your mileage will definitely vary though. Using this library with Tor will likely get you
87+
blocked quickly.
8888

89-
## HTTP 429 detection and recovery
89+
## HTTP 429 detection and recovery (optional)
9090

9191
If `yagooglesearch` detects an HTTP 429 response from Google, it will sleep for `http_429_cool_off_time_in_minutes`
9292
minutes and then try again. Each time an HTTP 429 is detected, it increases the wait time by a factor of
@@ -95,6 +95,43 @@ minutes and then try again. Each time an HTTP 429 is detected, it increases the
9595
The goal is to have `yagooglesearch` worry about HTTP 429 detection and recovery and not put the burden on the script
9696
using it.
9797

98+
If you do not want `yagooglesearch` to handle HTTP 429s and would rather handle it yourself, pass
99+
`yagooglesearch_manages_http_429s=False` when instantiating the yagooglesearch object. If an HTTP 429 is detected, the
100+
string "HTTP_429_DETECTED" is added to a list object that will be returned, and it's up to you on what the next step
101+
should be. The list object will contain any URLs found before the HTTP 429 was detected.
102+
103+
```python
104+
import yagooglesearch
105+
106+
query = "site:twitter.com"
107+
108+
client = yagooglesearch.SearchClient(
109+
query,
110+
tbs="li:1",
111+
verbosity=4,
112+
num=10,
113+
max_search_result_urls_to_return=1000,
114+
minimum_delay_between_paged_results_in_seconds=1,
115+
yagooglesearch_manages_http_429s=False, # Add to manage HTTP 429s.
116+
)
117+
client.assign_random_user_agent()
118+
119+
urls = client.search()
120+
121+
if "HTTP_429_DETECTED" in urls:
122+
print("HTTP 429 detected...it's up to you to modify your search.")
123+
124+
# Remove HTTP_429_DETECTED from list.
125+
urls.remove("HTTP_429_DETECTED")
126+
127+
print("URLs found before HTTP 429 detected...")
128+
129+
for url in urls:
130+
print(url)
131+
```
132+
133+
![http429_detection_string_in_returned_list.png](img/http429_detection_string_in_returned_list.png)
134+
98135
## HTTP and SOCKS5 proxy support
99136

100137
`yagooglesearch` supports the use of a proxy. The provided proxy is used for the entire life cycle of the search to
@@ -119,10 +156,10 @@ Supported proxy schemes are based off those supported in the Python `requests` l
119156

120157
* `http`
121158
* `https`
122-
* `socks5` - "causes the DNS resolution to happen on the client, rather than on the proxy server." You likely
123-
**do not** want this since all DNS lookups would source from where `yagooglesearch` is being run instead of the proxy.
124-
* `socks5h` - "If you want to resolve the domains on the proxy server, use socks5h as the scheme." This is the
125-
**best** option if you are using SOCKS because the DNS lookup and Google search is sourced from the proxy IP address.
159+
* `socks5` - "causes the DNS resolution to happen on the client, rather than on the proxy server." You likely **do
160+
not** want this since all DNS lookups would source from where `yagooglesearch` is being run instead of the proxy.
161+
* `socks5h` - "If you want to resolve the domains on the proxy server, use socks5h as the scheme." This is the **best**
162+
option if you are using SOCKS because the DNS lookup and Google search is sourced from the proxy IP address.
126163

127164
## HTTPS proxies and SSL/TLS certificates
128165

@@ -232,8 +269,8 @@ The `&tbs=` parameter is used to specify either verbatim or time-based filters.
232269
## Limitations
233270

234271
Currently, the `.filter_search_result_urls()` function will remove any url with the word "google" in it. This is to
235-
prevent the returned search URLs from being polluted with Google URLs. Note this if you are trying to explicitly
236-
search for results that may have "google" in the URL, such as `site:google.com computer`
272+
prevent the returned search URLs from being polluted with Google URLs. Note this if you are trying to explicitly search
273+
for results that may have "google" in the URL, such as `site:google.com computer`
237274

238275
## License
239276

@@ -248,4 +285,4 @@ Project Link: [https://github.com/opsdisk/yagooglesearch](https://github.com/ops
248285
## Acknowledgements
249286

250287
* [Mario Vilas](https://github.com/MarioVilas) for his amazing work on the original
251-
[googlesearch](https://github.com/MarioVilas/googlesearch) library.
288+
[googlesearch](https://github.com/MarioVilas/googlesearch) library.

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
setuptools.setup(
77
name="yagooglesearch",
8-
version="1.3.0",
8+
version="1.4.0",
99
author="Brennon Thomas",
1010
author_email="info@opsdisk.com",
1111
description="A Python library for executing intelligent, realistic-looking, and tunable Google searches.",

yagooglesearch/__init__.py

Lines changed: 37 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
# Custom Python libraries.
1414

15-
__version__ = "1.2.0"
15+
__version__ = "1.4.0"
1616

1717
# Logging
1818
ROOT_LOGGER = logging.getLogger("yagooglesearch")
@@ -77,8 +77,9 @@ def __init__(
7777
country="",
7878
extra_params=None,
7979
max_search_result_urls_to_return=100,
80-
delay_between_paged_results_in_seconds=list(range(7, 18)),
80+
minimum_delay_between_paged_results_in_seconds=7,
8181
user_agent=None,
82+
yagooglesearch_manages_http_429s=True,
8283
http_429_cool_off_time_in_minutes=60,
8384
http_429_cool_off_factor=1.1,
8485
proxy="",
@@ -88,28 +89,31 @@ def __init__(
8889

8990
"""
9091
SearchClient
91-
:param str query: Query string. Must NOT be url-encoded.
92+
:param str query: Query string. Must NOT be url-encoded.
9293
:param str tld: Top level domain.
9394
:param str lang: Language.
9495
:param str tbs: Verbatim search or time limits (e.g., "qdr:h" => last hour, "qdr:d" => last 24 hours, "qdr:m"
9596
=> last month).
9697
:param str safe: Safe search.
9798
:param int start: First page of results to retrieve.
9899
:param int num: Max number of results to pull back per page. Capped at 100 by Google.
99-
:param str country: Country or region to focus the search on. Similar to changing the TLD, but does not yield
100+
:param str country: Country or region to focus the search on. Similar to changing the TLD, but does not yield
100101
exactly the same results. Only Google knows why...
101-
:param dict extra_params: A dictionary of extra HTTP GET parameters, which must be URL encoded. For example if
102+
:param dict extra_params: A dictionary of extra HTTP GET parameters, which must be URL encoded. For example if
102103
you don't want Google to filter similar results you can set the extra_params to {'filter': '0'} which will
103104
append '&filter=0' to every query.
104105
:param int max_search_result_urls_to_return: Max URLs to return for the entire Google search.
105-
:param int delay_between_paged_results_in_seconds: Time to wait between HTTP requests for consecutive pages for
106-
the same search query.
106+
:param int minimum_delay_between_paged_results_in_seconds: Minimum time to wait between HTTP requests for
107+
consecutive pages for the same search query. The actual time will be a random value between this minimum
108+
value and value + 11 to make it look more human.
107109
:param str user_agent: Hard-coded user agent for the HTTP requests.
110+
:param bool yagooglesearch_manages_http_429s: Determines if yagooglesearch will handle HTTP 429 cool off and
111+
retries. Disable if you want to manage HTTP 429 responses.
108112
:param int http_429_cool_off_time_in_minutes: Minutes to sleep if an HTTP 429 is detected.
109113
:param float http_429_cool_off_factor: Factor to multiply by http_429_cool_off_time_in_minutes for each HTTP 429
110114
detected.
111115
:param str proxy: HTTP(S) or SOCKS5 proxy to use.
112-
:param bool verify_ssl: Verify the SSL certificate to prevent traffic interception attacks. Defaults to True.
116+
:param bool verify_ssl: Verify the SSL certificate to prevent traffic interception attacks. Defaults to True.
113117
This may need to be disabled in some HTTPS proxy instances.
114118
:param int verbosity: Logging and console output verbosity.
115119
@@ -127,8 +131,9 @@ def __init__(
127131
self.country = country
128132
self.extra_params = extra_params
129133
self.max_search_result_urls_to_return = max_search_result_urls_to_return
130-
self.delay_between_paged_results_in_seconds = delay_between_paged_results_in_seconds
134+
self.minimum_delay_between_paged_results_in_seconds = minimum_delay_between_paged_results_in_seconds
131135
self.user_agent = user_agent
136+
self.yagooglesearch_manages_http_429s = yagooglesearch_manages_http_429s
132137
self.http_429_cool_off_time_in_minutes = http_429_cool_off_time_in_minutes
133138
self.http_429_cool_off_factor = http_429_cool_off_factor
134139
self.proxy = proxy
@@ -362,14 +367,24 @@ def get_page(self, url):
362367

363368
if http_response_code == 200:
364369
html = response.text
370+
365371
elif http_response_code == 429:
372+
366373
ROOT_LOGGER.warning("Google is blocking your IP for making too many requests in a specific time period.")
374+
375+
# Calling script does not want yagooglesearch to handle HTTP 429 cool off and retry. Just return a
376+
# notification string.
377+
if not self.yagooglesearch_manages_http_429s:
378+
ROOT_LOGGER.info("Since yagooglesearch_manages_http_429s=False, yagooglesearch is done.")
379+
return "HTTP_429_DETECTED"
380+
367381
ROOT_LOGGER.info(f"Sleeping for {self.http_429_cool_off_time_in_minutes} minutes...")
368382
time.sleep(self.http_429_cool_off_time_in_minutes * 60)
369383
self.http_429_detected()
370384

371385
# Try making the request again.
372386
html = self.get_page(url)
387+
373388
else:
374389
ROOT_LOGGER.warning(f"HTML response code: {http_response_code}")
375390

@@ -432,6 +447,13 @@ def search(self):
432447
# Request Google search results.
433448
html = self.get_page(url)
434449

450+
# HTTP 429 message returned from get_page() function, add "HTTP_429_DETECTED" to the set and return to the
451+
# calling script.
452+
if html == "HTTP_429_DETECTED":
453+
unique_urls_set.add("HTTP_429_DETECTED")
454+
self.unique_urls_list = list(unique_urls_set)
455+
return self.unique_urls_list
456+
435457
# Create the BeautifulSoup object.
436458
soup = BeautifulSoup(html, "html.parser")
437459

@@ -509,6 +531,11 @@ def search(self):
509531
url = self.url_next_page_num
510532

511533
# Randomize sleep time between paged requests to make it look more human.
512-
random_sleep_time = random.choice(self.delay_between_paged_results_in_seconds)
534+
random_sleep_time = random.choice(
535+
range(
536+
self.minimum_delay_between_paged_results_in_seconds,
537+
self.minimum_delay_between_paged_results_in_seconds + 11,
538+
)
539+
)
513540
ROOT_LOGGER.info(f"Sleeping {random_sleep_time} seconds until retrieving the next page of results...")
514541
time.sleep(random_sleep_time)

0 commit comments

Comments
 (0)