Skip to content

Commit c1dff72

Browse files
authored
Merge pull request #404 from antoineeripret/fix/oom_errors_json_ld_errors
2 parents 035e7cb + 48035ee commit c1dff72

File tree

1 file changed

+31
-2
lines changed

1 file changed

+31
-2
lines changed

advertools/spider.py

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,35 @@
343343
include_url_regex='/shop/'
344344
)
345345
346+
Customizing the Columns Included in the output file
347+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
348+
349+
By default, the crawler will extract all the available elements as described in the documentation. If you want to customize the columns included in the output file, you can use the ``keep_columns`` and ``discard_columns`` parameters.
350+
351+
1. The ``keep_columns`` parameter allows you to specify which columns should always be included in the output file.
352+
2. The ``discard_columns`` parameter lets you specify which columns should be excluded.
353+
354+
The filtering is done in the following order: first, the ``keep_columns`` regex patterns are applied, and then the ``discard_columns`` patterns are applied. You cannot discard the ``url`` and ``errors`` columns as they are always kept by default.
355+
356+
For example, if we execute the following code snippet:
357+
358+
>>> adv.crawl(
359+
... "http://example.com",
360+
... "output_file.jl",
361+
... keep_columns=["url", "title"],
362+
... )
363+
364+
365+
Our ``output_file.jl`` will contain the ``url``, ``title`` and ``errors`` columns. If we execute the following:
366+
367+
>>> adv.crawl(
368+
... "http://example.com",
369+
... "output_file.jl",
370+
... discard_columns=["title"],
371+
... )
372+
373+
Our ``output.jl`` file will contain all columns but ``title``.
374+
346375
Spider Custom Settings and Additional Functionality
347376
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
348377
@@ -522,7 +551,7 @@ def _filter_crawl_dict(d, keep_columns=None, discard_columns=None):
522551
Returns:
523552
- dict: A filtered dictionary.
524553
"""
525-
always_include = {"url", "errors", "jsonld_errors"}
554+
always_include = {"url", "errors"}
526555

527556
def matches_any(patterns, key):
528557
return any(re.search(pattern, key) for pattern in patterns)
@@ -1072,7 +1101,7 @@ def crawl(
10721101
A list of regex patterns for the columns to discard in the output. If not
10731102
specified, all columns are kept. If both ``keep_columns`` and
10741103
``discard_columns`` are specified, the columns will be filtered based on the
1075-
``keep_columns`` regex patterns first, and then the ``discard_columns``.
1104+
``keep_columns`` regex patterns first, and then the ``discard_columns``. You cannot discard the ``url`` and ``errors`` columns as they are always kept by default.
10761105
Examples
10771106
--------
10781107
Crawl a website and let the crawler discover as many pages as available

0 commit comments

Comments
 (0)