Python web scraping for javascript generated content

I am trying to use python3 to return the bibtex citation generated by . The url's are predictable so the script can work out the url without having to interact with the web page. I have tried using selenium, bs4, etc but cant get the text inside the box.

url = "#/doi/10.1007/s00425-007-0544-9"
import urllib.request
from bs4 import BeautifulSoup
text = BeautifulSoup(urllib.request.urlopen(url).read())
print(text)

Can anyone suggest a way of returning the bibtex citation as a string (or whatever) in python?


ANSWERS:


You don't need BeautifulSoup here. There is an additional XHR request sent to the server to fill out the bibtex citation, simulate it, for example, with requests:

import requests

bibtex_id = '10.1007/s00425-007-0544-9'

url = "#/doi/{id}".format(id=bibtex_id)
xhr_url = 'doi2bib'

with requests.Session() as session:
    session.get(url)

    response = session.get(xhr_url, params={'id': bibtex_id})
    print(response.content)

Prints:

@article{Burgert_2007,
    doi = {10.1007/s00425-007-0544-9},
    url = {http://dx.doi.org/10.1007/s00425-007-0544-9},
    year = 2007,
    month = {jun},
    publisher = {Springer Science $\mathplus$ Business Media},
    volume = {226},
    number = {4},
    pages = {981--987},
    author = {Ingo Burgert and Michaela Eder and Notburga Gierlinger and Peter Fratzl},
    title = {Tensile and compressive stresses in tracheids are induced by swelling based on geometrical constraints of the wood cell},
    journal = {Planta}
}

You can also solve it with selenium. The key trick here is to use an Explicit Wait to wait for the citation to become visible:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('#/doi/10.1007/s00425-007-0544-9')

element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//pre[@ng-show="bib"]')))
print(element.text)

driver.close()

Prints the same as the above solution.



 MORE:


 ? Scraping Javascript generated data
 ? Html-Agility-Pack not loading the page with full content?
 ? scrapy xpath selector repeats data
 ? Find next siblings until a certain one using beautifulsoup
 ? How to properly use mechanize to scrape AJAX sites
 ? Reading data from PDF files into R
 ? How can I input data into a webpage to scrape the resulting output using Python?
 ? curl 302 redirect not working (command line)
 ? Web page scraping gems/tools available in Ruby
 ? Python data scraping