Skip to content

Commit 5ee7498

Browse files
committed
Update - Add LazySources for Lazy Data Sources
1 parent 0267bc1 commit 5ee7498

File tree

9 files changed

+507
-1
lines changed

9 files changed

+507
-1
lines changed

lazyops/lazyio/models.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,6 @@ def readlines(self, *args, **kwargs):
191191
def get_num_lines(self):
192192
return sum(1 for _ in File.tflines(self._filename))
193193

194-
@timed_cache(10)
195194
@property
196195
def filesize(self):
197196
self._ensure_open()

lazyops/lazysources/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# LazySources
2+
3+
Lazy Data Sources
4+
5+
- `lazyops.lazysources.gdelt`: [GDELT](https://www.gdeltproject.org/)

lazyops/lazysources/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from . import gdelt
2+
3+
__all__ = [
4+
'gdelt'
5+
]

lazyops/lazysources/gdelt/README.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# LazySources - GDelt
2+
3+
Lazy Data Source for [GDELT](https://www.gdeltproject.org/)
4+
5+
Extended from [gdelt-doc-api](https://github.com/alex9smith/gdelt-doc-api).
6+
Credit to original authors @alex9smith and @FelixKleineBoesing
7+
8+
- Async Support
9+
- Multiple Result Formats Available
10+
- Dict
11+
- Pandas Dataframe
12+
- JSON string
13+
- Object: GDELTArticle class
14+
- Can be called on to fully parse the URL
15+
- `article = articles[0]; article.parse()`
16+
17+
18+
API client for the GDELT 2.0 Doc API. Supports Async Methods
19+
20+
```python
21+
from lazyops.lazysources.gdelt import GDELT, GDELTFilters
22+
23+
# Formats = [
24+
# 'obj', # GDELTArticle which can be called to extract the url
25+
# 'json', Pure JSON String Output
26+
# 'dict', Pyhon Dict
27+
# 'pd', Pandas DF
28+
# ]
29+
30+
f = GDELTFilters(
31+
keyword = "climate change",
32+
start_date = "2021-05-10",
33+
end_date = "2021-05-15"
34+
)
35+
gd = GDELT(result_format='obj')
36+
37+
# Search for articles matching the filters
38+
articles = gd.article_search(f)
39+
40+
# Or call method .search directly
41+
articles = gd.search(method='article', filters=f)
42+
43+
# Async Methods
44+
articles = await gd.async_search(method='article', filters=f)
45+
articles = await gd.async_article_search(f)
46+
47+
# Parsing Articles - Syncronous
48+
english_articles = [i for i in articles if i.language == 'English']
49+
50+
for article in english_articles:
51+
article.parse()
52+
print(article.text)
53+
54+
# Parsing Articles - Asyncronous
55+
english_articles = [await article.async_parse() for article in english_articles]
56+
57+
58+
# Get a timeline of the number of articles matching the filters
59+
# timeline = gd.timeline_search("timelinevol", f)
60+
61+
62+
```
63+
64+
### Article List
65+
The article list mode of the API generates a list of news articles that match the filters.
66+
The client returns this as a pandas DataFrame with columns `url`, `url_mobile`, `title`,
67+
`seendate`, `socialimage`, `domain`, `language`, `sourcecountry`.
68+
69+
### Timeline Search
70+
There are 5 available modes when making a timeline search:
71+
* `timelinevol` - a timeline of the volume of news coverage matching the filters,
72+
represented as a percentage of the total news articles monitored by GDELT.
73+
* `timelinevolraw` - similar to `timelinevol`, but has the actual number of articles
74+
and a total rather than a percentage
75+
* `timelinelang` - similar to `timelinevol` but breaks the total articles down by published language.
76+
Each language is returned as a separate column in the DataFrame.
77+
* `timelinesourcecountry` - similar to `timelinevol` but breaks the total articles down by the country
78+
they were published in. Each country is returned as a separate column in the DataFrame.
79+
* `timelinetone` - a timeline of the average tone of the news coverage matching the filters.
80+
See [GDELT's documentation](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/)
81+
for more information about the tone metric.
82+
83+
### Construct filters for the GDELT API.
84+
Filters for `keyword`, `domain`, `domain_exact`, `country` and `theme`
85+
can be passed either as a single string or as a list of strings. If a list is
86+
passed, the values in the list are wrapped in a boolean OR.
87+
Params
88+
------
89+
* `start_date`
90+
The start date for the filter in YYYY-MM-DD format. The API officially only supports the
91+
most recent 3 months of articles. Making a request for an earlier date range may still
92+
return data, but it's not guaranteed.
93+
Must provide either `start_date` and `end_date` or `timespan`
94+
* `end_date`
95+
The end date for the filter in YYYY-MM-DD format.
96+
* `timespan`
97+
A timespan to search for, relative to the time of the request. Must match one of the API's timespan
98+
formats - https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/
99+
Must provide either `start_date` and `end_date` or `timespan`
100+
* `num_records`
101+
The number of records to return. Only used in article list mode and can be up to 250.
102+
* `keyword`
103+
Return articles containing the exact phrase `keyword` within the article text.
104+
* `domain`
105+
Return articles from the specified domain. Does not require an exact match so
106+
passing "cnn.com" will match articles from "cnn.com", "subdomain.cnn.com" and "notactuallycnn.com".
107+
* `domain_exact`
108+
Similar to `domain`, but requires an exact match.
109+
* `near`
110+
Return articles containing words close to each other in the text. Use `near()` to construct.
111+
eg. near = near(5, "airline", "climate").
112+
* `repeat`
113+
Return articles containing a single word repeated at least a number of times. Use `repeat()`
114+
to construct. eg. repeat = repeat(3, "environment").
115+
If you want to construct a filter with multiple repeated words, construct with `multi_repeat()`
116+
instead. eg. repeat = multi_repeat([(2, "airline"), (3, "airport")], "AND")
117+
* `country`
118+
Return articles published in a country, formatted as the FIPS 2 letter country code.
119+
* `theme`
120+
Return articles that cover one of GDELT's GKG Themes. A full list of themes can be
121+
found here: http://data.gdeltproject.org/api/v2/guides/LOOKUP-GKGTHEMES.TXT

lazyops/lazysources/gdelt/__init__.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
from .core import GDELT
2+
from .filters import GDELTFilters, near, repeat, multi_repeat
3+
from .models import GDELTArticle, Article
4+
5+
__all__ = [
6+
'GDELT',
7+
'GDELTFilters',
8+
'near',
9+
'repeat',
10+
'multi_repeat',
11+
'GDELTArticle',
12+
'Article'
13+
]

lazyops/lazysources/gdelt/_base.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
2+
from lazyops import lazy_init, get_logger, timed_cache, LazyObject
3+
lazy_init('pandas')
4+
5+
import pandas as pd
6+
7+
from enum import Enum
8+
from typing import Dict, Optional, List, Union, Tuple
9+
10+
from dataclasses import dataclass
11+
from lazyops.apis import LazySession, async_req
12+
from lazyops.lazyio import LazyJson
13+
from lazyops.lazyclasses import lazyclass
14+
15+
16+
logger = get_logger('LazySources', 'GDELT')
17+

lazyops/lazysources/gdelt/core.py

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
from ._base import *
2+
3+
from .filters import GDELTFilters
4+
from .models import GDELTArticle
5+
6+
class GDELTFormat(Enum):
7+
dict = 'dict'
8+
obj = 'obj'
9+
json = 'json'
10+
pandas = 'pd'
11+
12+
13+
class GDELTMethods(Enum):
14+
article = 'article'
15+
timeline = 'timeline'
16+
17+
class GDELT:
18+
api_url = 'https://api.gdeltproject.org/api/v2/doc/doc'
19+
available_modes = ["artlist", "timelinevol", "timelinevolraw", "timelinetone", "timelinelang", "timelinesourcecountry"]
20+
21+
def __init__(self, result_format: GDELTFormat = GDELTFormat.obj, json_parsing_max_depth: int = 100, *args, **kwargs) -> None:
22+
self.max_depth_json_parsing = json_parsing_max_depth
23+
self._output_format = result_format
24+
self.sess = LazySession()
25+
26+
def return_article_result(self, articles: Dict = None):
27+
if not articles or not articles.get('articles'):
28+
return None
29+
if self._output_format.value == 'dict':
30+
return articles['articles']
31+
32+
if self._output_format.value == 'pd':
33+
return pd.DataFrame(articles["articles"])
34+
35+
if self._output_format.value == 'json':
36+
return LazyJson.dumps(articles['articles'])
37+
38+
if self._output_format.value == 'obj':
39+
return [GDELTArticle(**article) for article in articles['articles']]
40+
41+
def return_timeline_search(self, results: Dict = None):
42+
if not results:
43+
return None
44+
45+
if self._output_format.value == 'dict':
46+
return results
47+
48+
if self._output_format.value == 'pd':
49+
formatted = pd.DataFrame(results)
50+
formatted["datetime"] = pd.to_datetime(formatted["datetime"])
51+
return formatted
52+
53+
if self._output_format.value == 'json':
54+
return LazyJson.dumps(results)
55+
56+
if self._output_format.value == 'obj':
57+
return [LazyObject(res) for res in results]
58+
59+
60+
def article_search(self, filters: GDELTFilters) -> Union[pd.DataFrame, Dict, str]:
61+
articles = self._query("artlist", filters.query_string)
62+
return self.return_article_result(articles)
63+
64+
def timeline_search(self, mode: str, filters: GDELTFilters) -> Union[pd.DataFrame, Dict, str]:
65+
timeline = self._query(mode, filters.query_string)
66+
results = {"datetime": [entry["date"] for entry in timeline["timeline"][0]["data"]]}
67+
for series in timeline["timeline"]:
68+
results[series["series"]] = [entry["value"] for entry in series["data"]]
69+
70+
if mode == "timelinevolraw": results["All Articles"] = [entry["norm"] for entry in timeline["timeline"][0]["data"]]
71+
return self.return_timeline_search(results)
72+
73+
def search(self, method: GDELTMethods, filters: GDELTFilters) -> Union[pd.DataFrame, Dict, str]:
74+
if method.value == 'article':
75+
return self.article_search(filters)
76+
if method.value == 'timeline':
77+
return self.timeline_search(filters)
78+
79+
async def async_search(self, method: GDELTMethods, filters: GDELTFilters) -> Union[pd.DataFrame, Dict, str]:
80+
if method.value == 'article':
81+
return await self.async_article_search(filters)
82+
if method.value == 'timeline':
83+
return await self.async_timeline_search(filters)
84+
85+
async def async_article_search(self, filters: GDELTFilters) -> Union[pd.DataFrame, Dict, str]:
86+
articles = await self._async_query("artlist", filters.query_string)
87+
return self.return_article_result(articles)
88+
89+
async def async_timeline_search(self, mode: str, filters: GDELTFilters) -> Union[pd.DataFrame, Dict, str]:
90+
timeline = await self._async_query(mode, filters.query_string)
91+
results = {"datetime": [entry["date"] for entry in timeline["timeline"][0]["data"]]}
92+
for series in timeline["timeline"]:
93+
results[series["series"]] = [entry["value"] for entry in series["data"]]
94+
95+
if mode == "timelinevolraw": results["All Articles"] = [entry["norm"] for entry in timeline["timeline"][0]["data"]]
96+
return self.return_timeline_search(results)
97+
98+
def _decode_json(cls, content, max_recursion_depth: int = 100, recursion_depth: int = 0):
99+
try:
100+
result = LazyJson.loads(content, recursive=True)
101+
except Exception as e:
102+
if recursion_depth >= max_recursion_depth:
103+
raise ValueError("Max Recursion depth is reached. JSON can´t be parsed!")
104+
idx_to_replace = int(e.pos)
105+
if isinstance(content, bytes): content.decode("utf-8")
106+
json_message = list(content)
107+
json_message[idx_to_replace] = ' '
108+
new_message = ''.join(str(m) for m in json_message)
109+
return GDELT._decode_json(content=new_message, max_recursion_depth=max_recursion_depth, recursion_depth=recursion_depth+1)
110+
return result
111+
112+
async def _async_decode_json(cls, content, max_recursion_depth: int = 100, recursion_depth: int = 0):
113+
try:
114+
result = LazyJson.loads(content, recursive=True)
115+
except Exception as e:
116+
if recursion_depth >= max_recursion_depth:
117+
raise ValueError("Max Recursion depth is reached. JSON can´t be parsed!")
118+
idx_to_replace = int(e.pos)
119+
if isinstance(content, bytes): content.decode("utf-8")
120+
json_message = list(content)
121+
json_message[idx_to_replace] = ' '
122+
new_message = ''.join(str(m) for m in json_message)
123+
return await GDELT._async_decode_json(content=new_message, max_recursion_depth=max_recursion_depth, recursion_depth=recursion_depth+1)
124+
return result
125+
126+
def _query(self, mode: str, query_string: str) -> Dict:
127+
if mode not in GDELT.available_modes:
128+
raise ValueError(f"Mode {mode} not in supported API modes")
129+
resp = self.sess.fetch(url=GDELT.api_url, decode_json=False, method='GET', params={'query': query_string, 'mode': mode, 'format': 'json'})
130+
if resp.status_code not in [200, 202]:
131+
raise ValueError("The gdelt api returned a non-successful status code. This is the response message: {}".format(resp.text))
132+
return self._decode_json(resp.content, max_recursion_depth=self.max_depth_json_parsing)
133+
134+
async def _async_query(self, mode: str, query_string: str) -> Dict:
135+
if mode not in GDELT.available_modes:
136+
raise ValueError(f"Mode {mode} not in supported API modes")
137+
resp = await self.sess.async_fetch(url=GDELT.api_url, decode_json=False, method='GET', params={'query': query_string, 'mode': mode, 'format': 'json'})
138+
if resp.status_code not in [200, 202]:
139+
raise ValueError("The gdelt api returned a non-successful status code. This is the response message: {}".format(resp.text))
140+
return await self._async_decode_json(resp.content, max_recursion_depth=self.max_depth_json_parsing)
141+

0 commit comments

Comments
 (0)